Have you ever had an assert get triggered only to result in a useless core dump with missing variable information or an invalid callstack? Common factors that go into selecting a C or C++ compiler are: availability, correctness, compilation speed and application performance. A factor that is often neglected is debug information quality, which symbolic debuggers use to reconcile application executable state to the source-code form that is familiar to most software engineers. When production builds of an application fail, the level of access to program state directly impacts the ability for a software engineer to investigate and fix a bug. If a compiler has optimized out a variable or is unable to express to a symbolic debugger how to reconstruct the value of a variable, the engineer’s investigation process is significantly impacted. Either the engineer has to attempt to recreate the problem, iterate through speculative fixes or attempt to perform prohibitively expensive debugging, such as reconstructing program state through executable code analysis.

Debug information quality is in fact not proportionally related to the quality of the generated executable code and wildly varies from compiler to compiler. This blog post compares debug information quality between two popular compilers: gcc and clang. In this blog post, we will introduce the topic of optimization and highlight examples of their impact on debuggability. This blog post is part of a longer series, in the next blog post we’ll do finer grained analysis directly comparing gcc and clang in real world and synthetic programs.

Introduction

A compiler compiles source-code into executable code that interacts with memory, a limited set of registers and simplistic control structures such as conditional jumps. A compiler also emits debug information that enables a symbolic debugger to map the state of memory and registers back to a representation that includes source-code structure, variables and types. The format of this debug information is complex because it is designed to be as flexible as possible in order to support most programming languages, architectures and compiler optimizations. The format is actually turing-complete! If you want to learn more about debug formats, then I recommend the following resources:

The examples below are compiled with -O2 optimization levels.

Optimizations and debug quality

Different compilers emit debug information at varying levels of quality and accuracy. However, certain optimizations will certainly impact any debugger’s ability to generate accurate stack traces or extract variable values. This section briefly touches on some of these optimizations.

Variables and optimization

The compiler’s register allocator is responsible for allocating a larger number of program variables to a smaller set of processor registers. Register accesses are significantly faster than memory accesses, but the number of registers are scarce. Executable code will juggle between spilling (writing values from register to memory) and filling (reading from memory into registers) to efficiently make use of these registers. The value of a variable may exist in a register, in memory, a combination of the two or in debug information if it is a constant. If a compiler detects that the value of a variable is no longer needed (see live variable analysis), then the generated executable code may not save the value of the variable. In this situation, the variable is optimized out and the value is irretrievable.

Take the following program:

#include <ck_pr.h>
#include <unistd.h>
 
int
main(int argc, const char **argv)
{
 
    ck_pr_load_ptr(argv);
    pause();
    return 0;
}

In the above program, the value of argv is extracted and then the program is paused. The ck_pr_load_ptr function performs a read from the region of memory pointed to by argv, in a manner that prevents the compiler from performing optimization on it. This ensures that the memory access occurs and for this reason, the value of argv must be accessible by the time ck_pr_load_ptr is executed.

When compiled with gcc, the debugger fails to find the value of the variable. The compiler determines that the value of argv is no longer needed after the ck_pr_load_ptr operation and so doesn’t bother paying the cost of saving the value.

Below, we see the output of a debugger.

sbahra@skylake:~/projects/cdqs/src$ bt `pgrep spill` 2> /dev/null
PID: 4216
--------------------------------------------------------------------------------
Thread 4216
  [  0] libc-2.23.so              __libc_pause (../sysdeps/unix/syscall-template.S:84)
  [  1] spill_00                  main (spill_00.c:9)
     argc = -- [optimized out]
     argv = -- [optimized out]

  [  2] libc-2.23.so              __libc_start_main

However, if we modify the program to the following:

#include <ck_pr.h>
#include <unistd.h>

int
main(int argc, const char **argv)
{

    pause();
    ck_pr_load_ptr(argv);
    return 0;
}

The debugger is able to successfully extract the value, as seen below.

PID: 4462
--------------------------------------------------------------------------------
Thread 4462
  [  0] libc-2.23.so              __libc_pause (../sysdeps/unix/syscall-template.S:84)
  [  1] spill_00                  main (spill_00.c:8)
     argc = -- [optimized out]
     argv = (parameter) reference(0, 0x7ffe022232c8)
       {pointer(const char)} -><> = reference(0x7ffe022232c8, 0x7ffe02224433)
          {const char} -><> = string(0x7ffe02224433, 10, [./spill_00])

  [  2] libc-2.23.so              __libc_start_main

The executable code will also ensure the value of argv is saved and restored. In this particular situation, when main is called, the value of argc is in the %rsi register. The compiler will save the value of %rsi in the %rbx register, whose value pause would be required to restore prior to return.

Dump of assembler code for function main:
   0x0000000000400430 <+0>:     push   %rbx
   0x0000000000400431 <+1>:     mov    %rsi,%rbx               # Save the value of argv in %rbx.
   0x0000000000400434 <+4>:     callq  0x400410 <pause@plt>
   0x0000000000400439 <+9>:     mov    (%rbx),%rax             # Load the value of argv.
   0x000000000040043c <+12>:    xor    %eax,%eax
   0x000000000040043e <+14>:    pop    %rbx
   0x000000000040043f <+15>:    retq

Call stack and optimization

Some optimizations generate executable code whose call stack cannot be sufficiently disambiguated to reconcile a call stack that mirrors that of the source program. Two common culprits for this are tail call optimization and basic block commoning.

Basic Block Commoning

Let’s examine how basic block commoning impacts the accuracy of extracting stack traces from the following program.

#include <ck_pr.h>
#include <stdlib.h>
#include <unistd.h>

static void
function(const char *string)
{

    ck_pr_load_ptr(&string);
    pause();
    return;
}

static int
f(int x)
{

    if (x == 1) {
        function("a");
    } else if (x == 2) {
        function("b");
    }

    return 0;
}

int
main(int argc, const char *argv[])
{

    return f(atoi(argv[1]));
}

If the program receives a first argument of 1, then function is called with the argument of "a". If the program receives a first argument of 2, then function is called with the argument of "b". However, if we compile this program with clang, the stack traces in both cases are identical! clang informs the debugger that the function f invoked the function("b") branch where x = 2 even if x = 1.

sbahra@skylake:~$ bt `pgrep cbe_00` 2> /dev/null
PID: 14406
--------------------------------------------------------------------------------
Thread 14406
  [  0] libc-2.23.so              __libc_pause (../sysdeps/unix/syscall-template.S:84)
  |  1| cbe_00                    function (cbe_00.c:10)

  |  2| cbe_00                    f (cbe_00.c:21)

  [  3] cbe_00                    main (cbe_00.c:31)
     argc = -- [no location entry found: 0]
     argv = -- [no location entry found: 0]

  [  4] libc-2.23.so              __libc_start_main

With common block elimination, the compiler may combine the branches in function into a single instruction. This means the stack is unwound to the same instruction in both cases (identical line numbers in f regardless of whether "a" or "b" is provided as input).

Tail Call Optimization

If the last operation executed in a function is a self-contained call to another function, the compiler may have the executable code jump into the target function without allocating additional stack space. In certain situations, this will mean the debugger will not have sufficient information to unwind the function call stack. Take the following program, where the function factorial is implemented in tail recursive form.

#include <ck_pr.h>
#include <unistd.h>

static int
factorial(int vr_ac, int vr_n)
{

    if (--vr_n == 1) {
        pause();
        return vr_ac;
    }

    return factorial(vr_ac * vr_n, vr_n);
}

int
main(void)
{
    int vr_value;

    ck_pr_store_int(&vr_value, 10);
    return factorial(vr_value, vr_value);
}

When compiled with optimizations on in both gcc and clang, the debugger reports the following call stack:

PID: 9373
--------------------------------------------------------------------------------
Thread 9373
  [  0] libc-2.23.so              __libc_pause (../sysdeps/unix/syscall-template.S:84)
  |  1| tco_00                    factorial (tco_00.c:9)
  [  2] tco_00                    main (tco_00.c:22)
  [  3] libc-2.23.so              __libc_start_main

The call stack should actually look like the following:

PID: 9385
--------------------------------------------------------------------------------
Thread 9385
  [  0] libc-2.23.so              __libc_pause (../sysdeps/unix/syscall-template.S:84)
  [  1] tco_00                    factorial (tco_00.c:9)
  [  2] tco_00                    factorial (tco_00.c:13)
  [  3] tco_00                    factorial (tco_00.c:13)
  [  4] tco_00                    factorial (tco_00.c:13)
  [  5] tco_00                    factorial (tco_00.c:13)
  [  6] tco_00                    factorial (tco_00.c:13)
  [  7] tco_00                    factorial (tco_00.c:13)
  [  8] tco_00                    factorial (tco_00.c:13)
  [  9] tco_00                    factorial (tco_00.c:13)
  [ 10] tco_00                    main (tco_00.c:22)
  [ 11] libc-2.23.so              __libc_start_main

The compilers were smart enough to eliminate the tail call and inline the function into the following loop:

[...]
   0x00000000004004d0 <+48>:    imul   %eax,%ebx
   0x00000000004004d3 <+51>:    sub    $0x1,%eax
   0x00000000004004d6 <+54>:    cmp    $0x1,%eax
   0x00000000004004d9 <+57>:    jne    0x4004d0 <main+48>
   0x00000000004004db <+59>:    callq  0x400480 <pause@plt>
[...]

The emitted debug information contains both information about the caller and the inlined instance of the function. This is insufficient to reconstruct a call stack with associated state. In this case, the debugger is only able to disambiguate the innermost invocation of the function call.

Debug Information Quality

Though some optimizations will certainly impact the accuracy of a symbolic debugger, some compilers simply lack the ability to generate debug information in the presence of certain optimizations. One common optimization is induction variable elimination. A variable that’s incremented or decremented by a constant on every iteration of a loop or derived from another variable that follows this pattern, is an induction variable.

Take the following snippet.

static unsigned int
count(char *buffer, size_t n)
{
        unsigned int sum = 0;
        size_t i;

        for (i = 0; i < n; i++)
                sum += buffer[i] == 'w';

        return sum;
}

This function will return the count of “w” characters in a string as seen below.

$ gcc -o wc wc.c -O2 -ggdb
$ ./wc /etc/passwd
16

In this particular case, the function is invoked using:

count(buffer, 4096);

Coupled with other optimizations, the compiler is then able to generate code that doesn’t actually rely on a dedicated counter variable i for maintaining the current offer into buffer. An approximate semantic mapping from the source-code to the generated executable code is below.

Semantic mapping

As you can see, i is completely optimized out. The compiler determines it doesn’t have to pay the cost of maintaining the induction variable i. It maintains the pointer in the register %rdi. The code is effectively rewritten to something closer to this:

static unsigned int
count(char *buffer)
{
        unsigned int sum = 0;
        char *buffer_end = buffer + 4096;

        while (buffer != buffer_end)
                sum += *buffer++ == 'w';

        return sum;
}

Both gcc and clang will end up generating similar executable code for this program. Debug information must support aggressive compiler optimizations and for that reason is highly expressive. For example, let’s look at the debug information generated by gcc (using the dwarfdump tool).

dwarfdump

The highlighted line indicate how to interpret the current state of registers to extract the value of variable i when the instruction pointer is pointing between memory addresses 4008cc and 4008ce. The highlighted line below is the instruction at address 4008cc.

objdump

The debug information (in the DWARF format) expresses the value of i using a state machine. The highlighted debug information in the first screenshot expresses that the debugger should push the value of the rdi register onto the stack, then the value of the rdx register, subtract the two, and then add the value 4095 to find the value of i. Note that the debug information does not describe the value of i in all regions of executable code where it is live (meaning, a debugger would be unable to retrieve the value). clang on the other hand is unable to express this and for some variables, may simply provide invalid information rather than indicate that the value is optimized out. See below for an invocation of a debugger on versions of this program compiled under gcc and clang.

comparison

gcc is able to recover the value of the i variable depending on the instruction being executed by the program at the time a debugger attempts to extract its value. clang on the other hand has erroneous debug information and expresses the values of both sum and i as a constant of 0.

Beyond optimizations, clang is unable to express certain data types with optimizations turned on such as bit fields. A more exhaustive comparison between the two compilers will be presented in an upcoming blog post.

Conclusion

We have shown some common optimizations that may get in the way of the debuggability of your application and demonstrated a disparity in debug information quality across two popular compilers. In the next blog post of this series, we will examine how gcc and clang stack up with regards to debug information quality across a myriad of synthetic applications and real world applications.

If you’re interested in better debugging capabilities for your applications including C++ crash reporting and native crash reporting, check us out at our website.

Follow me on Twitter at @0xf390.