Error Context - a stack trace for data

Introduction

This article describes how you can capture local variables in an error context stack that gets logged on a crash. It outlines how to implement such a feature and introduces the feature in my logging library Loguru.

The problem

Logging is a great tool for, among other things, debugging crashes. By crash I mean "unscheduled exit" of any kind, be it a SIGSEGV or an assertion failure. If – oh, let's face it: when – your program crashes a log will help you:

  • See what happened that lead up to the crash (previous log lines).
  • See what the program was doing at the time of a crash (the stack trace).
  • See direct cause of crash (SIGSEGV, what assert failed, ...)

That's great, but sometimes it is not enough. Often a crash is the result of unexpected data. For instance, a crash could be caused by an extra space at the end of a string, or an integer being negative when your code didn't expect it to be. Or maybe the crash happened during the processing of a bunch of files and you want to know the name of the file that caused the crash. What you really want is the ability to inspect the values of arguments and locals at different parts of the stack. (Of course a debugger is the best tool for this job, but often you are using the log file to debug a crash in production or a crash you can't yet reproduce).

The naïve approach is to log all data that might be interesting. That will certainly help with the debugging but it has two huge downsides:

  • It will fill up the log file with a lot of spam, drastically decreasing its readability.
  • It might slow down your program significantly to log that much information.

What you really want is to only log that data if there is a crash. That's exactly like the stack trace, by the way: you don't constantly log which function is being executed, only when there is a crash do you want that info. But can we design something like that for data?

Solution outline

It turns out we can! At least in C++ with the help of RAII. Here's the outline of a solution:

struct DataInfo
{
	const char* description;
	const char* value;
};

thread_local std::vector<DataInfo> thread_error_stack;

struct ErrorContextScope
{
	ErrorContextScope(const char* description, const char* value)
	{
		thread_error_stack.push_back(DataInfo{description, value});
	}
	~ErrorContextScope()
	{
		thread_error_stack.pop_back();
	}
};

void print_error_context()
{
	for (const DataInfo& info : thread_error_stack) {
		fprintf(stderr, "%s: %s", info.description, info.value);
	}
}

MyLoggingLibrary::add_crash_callback(print_error_context);

// --------------------- Usage:

void production_code(const char* file_name, const char* unit_id)
{
	ErrorContextScope ec_file_name("file_name", file_name);
	ErrorContextScope ec_unit_id("unit_id", unit_id);
	crashy_code();
}

Now, if crashy_code crashes, the values of file_name and unit_id will be written to the log file. If it doesn't crash, nothing is logged and the code did not slow down by much at all. I first saw this idea in the game engine Bitsquid but this direct approach has several limitations:

  • Coming up with dummy stack variable names like ec_file_name and ec_unit_id is tedious.
  • You might want to also print file:line of where the x/y variables are captured.
  • Supports only one type (const char*) - you probably want to capture floats, integers and your own custom types.

A few weeks ago I decided to come up with a solution to these limitations. The first two problems can be easily solved by a macro, but supporting custom types is more difficult — especially if you still want to keep the code snappy. It took me a while to figure out how to do that in a library-friendly manner, but in the end I came up with an approach where not only the data is captured for each stack entry, but also a pointer to a function for formatting that data into a string. That way different types will be formatted correctly.

Loguru solution

I've added support for Error Contexts in my logging library Loguru. Out of the box they can capture const char*, char, all integer types and float/double/long double. Furthermore, you can add support for your own types with a few lines of code (see loguru::ec_to_text in loguru.hpp for details). Here's what using it looks like:

void process_customers(const std::string& filename)
{
    ERROR_CONTEXT("Processing file", filename.c_str());
    for (size_t i = 0; i < num_customers; ++i) {
        ERROR_CONTEXT("Customer index", i);
        if (i == 42) { crashy_code(); }
    }
}

And when you get a crash:

[...stack trace...]
------------------------------------------------
[ErrorContext]     main.cpp:3   Processing file:    'customers.json '
[ErrorContext]     main.cpp:6   Customer index:     42
------------------------------------------------

Notice how the quotes around 'customers.json ' helps you discover that there is an extra space there, which could be the cause of the crash.

EDIT

Thanks to this great comment on reddit I've modified the Loguru code quite a bit. One problem with the std::vector<ErrorContextScope> approach was that I had a hard-coded upper limit on the size of the data that could fit in an ErrorContextScope. The new solution solves this wonderfully:

I've broken up ErrorContextScope into an abstract base class (EcEntryBase) and a templated subclass (template<typename T> class EcEntryData : public EcEntryBase). Insteaf of a std::vector each thread now keep has a pointer to the latest EcEntryBase, and each EcEntryBase then keeps a pointer to the previous EcEntryBase. This forms a linked list of EcEntryBase where each node (EcEntryData) has a size that depends on the type. So if you write ERROR_CONTEXT("char", 'x') that error context will take up less space than a ERROR_CONTEXT("double", 3.14). This is even more important when the user prints custom data, e.g. large or non-POD objects.

The rewrite led to a lot cleaner code without any dynamic allocations (all allocation is done on the actual call stack).

In the future I'd love to be able to add support for capturing a lambda that gets executed only on a crash, returning a string to be written as the error context data. For instance: ERROR_CONTEXT_LAMBDA("Foo", [&](){ return calculate_debug_string(a,b,c) });. However, I have yet to figure out how to capture such a lambda in a fast and clean way.

Also, I've done some meassurements. On my MacBook Pro each ERROR_CONTEXT costs about 2ns. That super-cheap! Writing a single line to the log file costs about 4ms - 2000x slower!

Limitations

As great a tool as error contexts are, they do have a few limitations:

  • You need to explicitly capture each variable with an ERROR_CONTEXT.
  • For speed, you probably want to capture large data by reference (like filename.c_str() above where Loguru does not copy the string, just stores a pointer to it). This means you have to be careful not to change or destroy the data during the scope of the ERROR_CONTEXT.