Post-Mortem Memory Debugging

In our previous post Memory Management Bugs: An Introduction, we discussed common errors when dealing with manual memory management. These types of errors are some of the most time-consuming and difficult to identify and resolve. At Backtrace, we’ve built automated analysis and classification into our platform to help highlight important signals and reduce the pain associated with these types of errors and more. This post introduces Backtrace’s memory allocator analysis and highlights use-cases it serves that existing technologies do not. Through Backtrace’s memory allocator analysis, we’ve detected over 48,000 heap errors in our users’ production and testing environments.

Introduction

Run-time memory error detection has received much attention over the last decade, producing powerful technologies such as valgrind and address sanitizer. Unfortunately, all existing techniques involve significant trade-offs in memory or processor utilization and are limited in the environments that they support. For these reasons, it is not always feasible to achieve complete memory error coverage for applications using these tools.

Backtrace supports the default memory allocators of FreeBSD, NetBSD and Linux to help you debug memory errors and understand memory allocations. More specifically, support is provided for jemalloc, tcmalloc, ptmalloc and the FreeBSD kernel allocator. When a snapshot of a live process or a coredump is requested, Backtrace crawls, indexes and analyzes available allocator metadata. Allocation information for all encountered variables is extracted, crucial allocator statistics are archived and internal inconsistencies are highlighted.

All of this is possible without any modification to your memory allocator or run-time environments. Simply provide your debug symbols and a process identifier or core dump, and Backtrace takes care of the rest.

Case Study

Let’s narrow down a heap management error that was found in our object store during development. The data set used in testing exceeded various map limits in valgrind.

The fault occurs when deallocating memory associated with a query response, as seen below. The Backtrace bt tool is used to extract the callstack of the faulting thread.

PID: 11274
--------------------------------------------------------------------------------
Thread 11322
  [  0] libpthread-2.23.so        __GI_pause (../sysdeps/unix/syscall-template.S:84)
  [  1] libc-2.23.so              0x7f5854fd94a0 (../sysdeps/posix/killpg.c:35)
* |  2| libjemalloc.so.2          je_tcache_dalloc_small (tcache.h:421)
        -----------------------------------------------------------------------
  |  3| libjemalloc.so.2          je_arena_dalloc (arena.h:1428)
  |  4| libjemalloc.so.2          je_idalloctm (jemalloc_internal.h:1070)
  |  5| libjemalloc.so.2          je_iqalloc (jemalloc_internal.h:1087)
  [  6] libjemalloc.so.2          ifree (jemalloc.c:1811)
  [  7] libjemalloc.so.2          free (src/jemalloc.c:1931)
  [  8] coronerd                  handler_query_response (/home/user/w/backtrace/coroner/server/src/url_handlers.c:315)
  [  9] coronerd                  cr_pu_master_base (/home/user/w/backtrace/coroner/server/src/pu.c:1759)
  [ 10] coronerd                  cr_io_enter (/home/user/w/backtrace/coroner/server/src/io.c:413)
  [ 11] coronerd                  cr_pu_enter (/home/user/w/backtrace/coroner/server/src/pu.c:1364)
  [ 12] coronerd                  httpd_server (/home/user/w/backtrace/coroner/server/src/url_handlers.c:857)
  [ 13] coronerd                  cr_pu_spawn_thread (/home/user/w/backtrace/coroner/server/src/pu.c:1007)
  [ 14] libpthread-2.23.so        start_thread

Let’s look at this issue in gdb. We begin by going to the relevant frame in handler_query_response. The line of code responsible for the fault is free(response). The value pointed to by response is valid but clearly free is not happy about something.

(gdb) p response
$3 = 0x7f584af2cf5c "{\"version\":\"1.0\",\"encoding\":\"rle\",\"columns\":[\"fingerprint\",\"hostname\",\"dc\",\"environment\",\"tag\",\"tag_owner\"],\"objects\":[[\"1b4852a5b9e3bdba494a3b02b3b4e03b57d3c23c3638a1ad7028a2033d7022d6\", [[14312]]],["...

Even though some heap corruption or mismanagement occurred, it is unclear what form of corruption it was (see the list of common heap errors). Is it possible some unrelated heap overflow condition corrupted heap meta-data and triggered the application to crash? Is it possible some other heap mis-use such as a double-free condition may have triggered this? Are there other instances where this fault occurred due to unrelated corruption? We could begin by auditing all code paths involving memory management of response if this was an issue caused by mismanagement of response. Let’s look at this issue in Backtrace to see if we can narrow down the possibilities.

The Backtrace web console immediately highlights the fact that this particular error contains an invalid-free. In other words, free was called with an invalid address. With Backtrace, we also know that the callstack is unique and no recent heap corruption issues occurred before this work. This tells us that this is a regression introduced during development.

The developer then views the snapshot using the coroner get command. No external assets are needed such as debug symbols or the faulting application executable. A list of warnings is presented and the one with Critical: Heap region starting at 0x7fbb... is selected so that the relevant variable and frame are highlighted. We then jump to the calling frame that interacts with this variable and we immediately see that response is in a free state. In other words, this is a typical double-free condition. See below for a demo of this debugging session.

Evidence strongly suggests that this is a case of the response object being mismanaged in absence of re-use. Additional heap metadata made available by Backtrace tells us that the object was not recently freed on the same thread, which tells us that the free condition involves another thread. In this particular case, the response buffer is actually generated by another thread. We quickly narrowed down to the likely possibility that another thread is deallocating the buffer before the faulting thread deallocates it again. An audit of error paths associated with the worker thread functions points us at an erroneous error label that causes premature destruction of the response buffer, seen below.

leave:
error:
        free(ct->result);                                                       
        json_object_put(query);                                                 
        return ct;                                         
}

Rather than auditing all code paths involving memory management of response, we narrowed it down to one particular function with the help of heap classifiers, warnings and allocator state.

Capabilities

Below are descriptions of some of our memory allocator integration features, including error detection. Screenshots of some real-world heap errors that Backtrace has detected is included towards the end of this section.

Variable Allocation Information

Quickly rule out theories on memory mismanagement by being able to directly view variable allocation state. Detailed allocation information is provided for the selected variable including allocation size, status and auxiliary allocator information. Useful context is included such as whether or not a variable was likely to have been recycled recently.

Heap Statistics

Extract valuable information about the state of memory allocator usage. For user-space allocators, this includes detailed per-arena usage statistics across all sizes as well as statistics on interaction with your operating system. This helps you diagnose memory leaks and potentially other performance problems. For slab allocators, valuable information is provided for all slabs.

Classification

Classifiers are automatically added for successfully detected mismanagement errors such as double-free conditions, invalid pointers, size violations and inconsistent metadata. Reliably disambiguating whether the source of data corruption is an invalid pointer versus a heap overflow or double free is not easy and requires either a holistic understanding of program state or consistency checks across a wide array of allocator data structures. Backtrace automates allocator consistency checks and coupled with variable allocation information, helps you more effectively rule out possible sources of errors.

Double Free Detection

Active cases of double free conditions are highlighted. Additional information is included such as whether multiple threads may be involved as well as clues on the recency of the allocation.

Invalid Free Detection

If a pointer to something that isn’t an allocation boundary is deallocated then undefined behavior may occur ranging from corruption to a crash.

Type Mismatches

Stale pointers commonly manifest as region size mismatches as memory gets allocated. Backtrace makes sure detectable cases are highlighted.

Heap Metadata Consistency

Heap overflows may manifest through heap metadata corruption. Internal memory allocator data structures are checked for inconsistencies and highlighted if any are detected.

Gallery of Heap Errors

Stale pointer	Use-after-free	Double free	Invalid Free
{% img center http://backtrace.io/images/hydra_error.png 300 %}	{% img center http://backtrace.io/images/hydra-ck.png 300 %}	{% img center http://backtrace.io/images/hydra-df-krn.png 300 %}	{% img center http://backtrace.io/images/hydra-ip.png 300 %}

Limitations

Backtrace heuristics are not a silver bullet. Backtrace relies on allocator internals to extract variable information at run-time as well as detect errors. If a variable is uninitialized or points into an offset of an allocated region, ptmalloc may not allow for reliable extraction of allocation state.

Error detection currently uses a single point in time of application state. If an application issues a particular series of allocations, it may revert evidence of heap mismanagement. For example, one vector for double-free detection is checking thread-local freelists for duplicate or overlapping regions. If a program was to recycle memory from the same freelist in a row, then this mechanism for double-free detection is defeated if used in isolation. If a stale pointer lives through many allocations, it is possible that the region of memory containing the memory allocator metadata is no longer valid. In this case, no allocation information would be present (but lack of allocator information is in itself a signal). For these reasons, several consistency checkers are required and involve checking state from multiple sources of data. In some cases, a user may want to generate multiple snapshots over time to catch the presence of such errors earlier in their lifecycle.

There are many other ways for important breadcrumbs to be removed, but in practice allocation and pointer intensive applications typically fault before all evidence has been removed.

The heuristics used by Backtrace are not perfect but have been successful in detecting thousands of instances of latent bugs customers did not know about as well as live production errors.

Conclusion

We are constantly adding new analysis methods to help engineers prioritize and fix issues more effectively. Check in on our blog soon to learn about our stale pointer analysis that helps you find the sources of transient pointer and data corruption in your software.

If you’ve dealt with memory corruption before, we encourage you to try Backtrace today and see whether we can help make your life easier.