Performance Improvements for FreeBSD kernel debugging

We previously explored FreeBSD userspace coredumps. Backtrace’s debugging platform supports FreeBSD kernel coredumps too, and their traces share many features. They are constructed somewhat differently, and in the process of adding support for them, we found a way to improve performance for automated programs accessing them.

Read on to learn how information is extracted from a FreeBSD kernel core, and how we improved performance for this mechanism.

Generating kernel core files

A kernel core is typically only generated in exceptional circumstances. Unlike userspace processes, kernel routines cannot fault without sacrificing the machine’s availability. This means things like page faults and illegal instructions inside the kernel stop the machine, instead of just one process. At that point, in most cases, it is only usable enough to inspect its state in a debugger, or to generate a core file.

On FreeBSD, kernel core files can be saved for any machine that has dumpdev=AUTO in its /etc/rc.conf, and a sufficiently large swap partition, configured at boot time.

On FreeBSD, a kernel core file can be generated by intentionally crashing:

# sysctl debug.kdb.panic=1

For machines that are configured to enter the on-line kernel debugger (aka kdb), the following command is also needed, at the system console:

kdb> call doadump

Otherwise, the kernel’s dump routines will automatically save the core file to a swap partition, if configured.

Upon reboot, the savecore rc startup script will automatically identify the core file in the swap partition, and save it to /var/crash by default.

What’s in a FreeBSD kernel core?

A FreeBSD kernel core file can be formatted in several different ways. This depends on which type of dump was performed. Full core dumps are ELF files, similar in structure to userspace core files.

However, as RAM size grew, this became more difficult to manage. In 2006, FreeBSD introduced minidumps, which are much smaller without making the core file useless. This has been the default dump type since FreeBSD 6.0.

Unlike userspace cores & full kernel cores, minidumps are not ELF files. They have a custom structure that starts with a header, and are followed by several sections of data explained by the header. For example, on amd64:

#define MINIDUMP_MAGIC   "minidump FreeBSD/amd64"
#define MINIDUMP_VERSION 2

struct minidumphdr {
	char magic[24];
	uint32_t version;
	uint32_t msgbufsize;
	uint32_t bitmapsize;
	uint32_t pmapsize;
	uint64_t kernbase;
	uint64_t dmapbase;
	uint64_t dmapend;
}

The version element allows minidump formats to change over time. At the time of this article, this has only happened once for amd64.

The next three uint32_t elements describe the size of the three sections of memory included in the core file:

The msgbuf (“message buffer”) section contains the contents that would be printed by dmesg if it were run at the time the core file was saved.
The bitmap section tells the debugger which pages of physical memory are included in the pmap section. This means bit N indicates the presence of the page covering 4K bytes starting at N * 4K, in physical memory.
The pmap section consists of a sparse dump of all pages of physical memory. Each page in this section has its bit set in the above bitmap.

The uint64_t elements describe the boundaries of the two kernel virtual address segments. On amd64, every process’s virtual address space has these sections reserved for the kernel.

Kernel virtual address space layout

There are two sections of kernel virtual address (KVA) space:

The kernel map. This is the primary map, used for critical kernel context. In particular, all thread stacks are stored here. Kernel map addresses require using page tables in order to find which physical page corresponds to a particular virtual address.
The direct map. This is an amd64-specific feature; it takes advantage of the 64-bit address space to enable looking up the corresponding physical pages of memory directly, via bitmask. All UMA-managed pages are mapped this way.

The minidump, strictly speaking, only includes physical pages that are in the kernel map (which includes the page tables for it). Because the direct map allows accessing all physical pages in memory, all kernel map addresses will also have direct map addresses. However, the converse is not true.

The minidump includes the direct map address range so whenever a debugger encounters a direct map address, it can attempt to look up their corresponding physical address.

In addition to partitioning the KVA space in this manner, kernel subsystems can indicate whether a particular page should not be ‘dumpable’. For minidumps, these non-dumpable pages are also not included.

Looking up information in a kernel core

Let’s try something a bit more concrete. The kernel debugger (kgdb) can be run like this:

freebsd-head% /usr/local/bin/kgdb ./kernel/kernel ./vmcore.0
GNU gdb (GDB) 7.11.1 [GDB v7.11.1 for FreeBSD]
[...]
Reading symbols from ./kernel/kernel...Reading symbols from /usr/home/will/tmp/amd64/kernel/kernel.symbols...done.
done.

Unread portion of the kernel message buffer:
panic: kdb_sysctl_panic
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff8098e390 at kdb_backtrace+0x60
#1 0xffffffff80951066 at vpanic+0x126
#2 0xffffffff80950f33 at panic+0x43
#3 0xffffffff8098eb81 at kdb_sysctl_panic+0x61
#4 0xffffffff8095ca5e at sysctl_root+0x24e
#5 0xffffffff8095d038 at userland_sysctl+0x1d8
#6 0xffffffff8095ce24 at sys___sysctl+0x74
#7 0xffffffff80d5694f at amd64_syscall+0x40f
#8 0xffffffff80d3bbbb at Xfast_syscall+0xfb
Uptime: 1m5s
Dumping 223 out of 4071 MB:..8%..15%..22%..36%..43%..51%..65%..72%..86%..94%

__curthread () at ./machine/pcpu.h:219
219     ./machine/pcpu.h: No such file or directory.
(kgdb)

kgdb provides a basic stack trace for the faulting thread. This is an example of what it looks like when forced as described earlier. Because the trace includes a frame for kdb_sysctl_panic (the handler for the command we ran), the cause of this particular crash is clear.

Let’s extract a bit more information by running bt (short for backtrace) on the faulting thread:

(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:219
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:298
#2  0xffffffff80950cc2 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff809510a5 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#4  0xffffffff80950f33 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:818
#5  0xffffffff8098eb81 in kdb_sysctl_panic (oidp=<optimized out>, arg1=<unavailable>, arg2=<unavailable>, req=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:190
#6  0xffffffff8095ca5e in sysctl_root (arg1=<optimized out>, arg2=<optimized out>, oidp=<optimized out>, req=<optimized out>) at /usr/src/sys/kern/kern_sysctl.c:1531
#7  0xffffffff8095d038 in userland_sysctl (td=<optimized out>, name=0xfffffe00f69a2820, namelen=<optimized out>, old=<optimized out>, oldlenp=<optimized out>, inkernel=<optimized out>, new=<optimized out>, newlen=<optimized out>, retval=<optimized out>, flags=0) at /usr/src/sys/kern/kern_sysctl.c:1641
#8  0xffffffff8095ce24 in sys___sysctl (td=0xfffff800031f0000, uap=0xfffffe00f69a2940) at /usr/src/sys/kern/kern_sysctl.c:1567
#9  0xffffffff80d5694f in syscallenter (td=<optimized out>, sa=<optimized out>, td=<optimized out>, sa=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:141
#10 amd64_syscall (td=0xfffff800031f0000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:959
#11 <signal handler called>
#12 0x000000080096a65a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd9f8

Let’s see more information about the current thread. Frame #8 includes the td local. We can use this to print out the name of the thread:

(kgdb) f 8
#8  0xffffffff8095ce24 in sys___sysctl (td=0xfffff800031f0000, uap=0xfffffe00f69a2940) at /usr/src/sys/kern/kern_sysctl.c:1567
1567    /usr/src/sys/kern/kern_sysctl.c: No such file or directory.
(kgdb) p &td->td_name
$4 = (char (*)[20]) 0xfffff800031f0284
(kgdb) p td->td_name
$2 = "sysctl", '\000' <repeats 13 times>

To access this information, kgdb asks libkvm for the location in the core file for the virtual address 0xfffff800031f0284. Because the debugger has access to the DWARF metadata explaining the structure of td, it would have figured out that it needs to access this address rather than the td base address 0xfffff800031f0000.

Libkvm will first determine whether the virtual address lies within the kernel or direct maps. If it lies in the kernel map, libkvm will consult the page table pages to discover the corresponding physical address. If it lies in the direct map, it can simply mask off the direct map base address. If neither of these applies, the address is illegal. This process is encapsulated by va_to_pa, or “virtual address to physical address”.

Once the physical address is determined, libkvm consults the core file’s bitmap to figure out where in the core file it is located.

One way libkvm can scale better

Recall that minidumps include a sparse bitmap indicating the pages that are included. These pages are dumped sequentially in the last section. Because they are sparse in a not entirely predictable way, figuring the offset into the dump for a particular physical address cannot be reduced to a trivial formula.

When minidumps were originally created, the solution adopted was to create a hash table that maps physical addresses (one per 4K page) to dump file offsets. This hash table was initialized with 1024 buckets. A minidump with 8GB of physical pages would, therefore, have 2 million entries with an average of 2048 per bucket. This is the best case in such a scenario.

On top of this high load factor, the offset lookup access pattern is not randomly distributed, which means frequently-accessed addresses can sometimes be particularly expensive to look up. This is reflected in automated processes taking significantly longer to run.

A valid formula for calculating the offset looks like this (where pa is the physical address):

offset_for(pa) = bits_set_before(pa) * 4096

Here bits_set_before represents the number of bits set in the bitmap before the page containing the indicated physical address. So the trick is finding a better solution for this function.

The “number of bits set” has a special term: the Hamming weight, or population count. So bits_set_before is just giving us the Hamming weight for the minidump’s bitmap, from bit 0 to bit N-1.

It would be inefficient to count all of those bits every time, hence the hash table implementation.

Caching the Hamming weight

Dumps will never have a predictable pattern in this bitmap. However, we can scan the bitmap and generate a lookup table that indicates the number of bits set. By doing so, we convert the expensive operation (“get number of bits set before bit N”) into a cheap one (“lookup pre-computed value for bit N”).

We have a trade-off to make: We could store this value for every bit, but that would consume a lot of memory. We can instead substitute a small amount of CPU time at lookup time to compute the number of bits set between a cached value and the desired bit.

Modern CPUs offer a POPCNT instruction that returns the number of bits set in a fixed width integer value. There are also generic equivalents available. Older and non-x86 systems supported by FreeBSD also use libkvm, so any optimization must not affect their functionality. GCC and Clang both support a __bitcount64 primitive to bridge this gap. So that’s what we actually used to perform the calculation.

We’ve already reduced the number of counts from N to N/64. We can go one step farther. Given a page N, where X <= N < Y, we can also compute the number of bits forwards (from X) and backwards (from Y). This reduces the number of counts by half again.

Finally, in order to further limit the memory usage of this cache, we can call __bitcount64 a few times. By adopting a cache interval of 1024 bits, we require, on average (in a perfectly random access pattern), 4 counts worth of CPU time, and one count per 1024 bits in memory.

Putting it together

To illustrate the relationship between the cache and the core’s bitmap:

   cache:  |         A         |        B          |         C         | ...     |
                     |                  |                    |
                     \-----------\      \------------\       \-----------\
                                 |                   |                   |
   bits:   | A0, A1, ... A1023 | B0, B1, ... B1023 | C0, C1, ... C1023 | D0, ... |

While generating the cache, each value is computed in turn. Cache value A is the total count of all bits set through A1023 inclusive, B is the total of all bits set through B1023 inclusive, and so forth.

To find the offset for page B122, we look up the cached count A, then compute and add the Hamming weight between bits B0 and B121, to find the total number of pages before page B122.

Similarly, to find the offset for page B700, we look up the cached count C, then compute the number of bits set between B700 and B1023, and subtract that.

The offset calculation now looks like this:

  pa = va_to_pa(va)
  popcount_bin = round(pa / 1024)
  bb = bits_before(pa) = cache[popcount_bin] + popcount(pa, popcount_bin * 1024)
  offset_for(pa) = pmap_base_offset + bb * 4096

First, we find the physical address for the requested virtual address. Then we determine the cache popcount bin, rounding up to the next bin in case we’re counting down. popcount encapsulates both counting up and down as well as counting the bits in the given range. This in turn is converted to a relative offset simply by multiplying by page size of 4096. Finally, the relative offset is added to the sparse page array’s base offset (pmap_base_offset), to provide the location of the requested address.

This means the formula is now just an array index, which is O(1), while popcount is a small number of CPU instructions not linearly related to the requested bit number N. This compares to previously, where each page required a separate hash table entry, and each lookup required computing a hash on the address and an O(log n) search of the matching hash table bucket.

Results

For typical manual debugger use, the impact of this change isn’t noticeable, which is probably why the hash table implementation has been in use for 10 years. However, for any automated debugging process, the extra latency adds up quickly.

The most obvious use case for stock FreeBSD is the crashinfo utility. This program is run by the savecore utility mentioned in the introduction. crashinfo is a script that processes a kernel core dump, and runs other utilities that can operate on core dumps, including: ps, vmstat, pstat, iostat, ipcs, nfsstat, netstat, and fstat.

For a sample 8GB kernel core file (generated on a 128GB server), crashinfo improves from (as reported by /usr/bin/time -l):

       43.72 real        33.25 user         2.94 sys
    187884  maximum resident set size
       348  average shared memory size
        28  average unshared data size
       126  average unshared stack size
    505312  page reclaims
       183  page faults
         0  swaps
     20131  block input operations
      7650  block output operations
         2  messages sent
         2  messages received
         0  signals received
     30741  voluntary context switches
       275  involuntary context switches

to:

        9.43 real         2.58 user         1.47 sys
    134648  maximum resident set size
      1906  average shared memory size
       127  average unshared data size
       126  average unshared stack size
    220114  page reclaims
         0  page faults
         0  swaps
        11  block input operations
      6828  block output operations
         2  messages sent
         2  messages received
         0  signals received
      9080  voluntary context switches
        50  involuntary context switches

So in this case, crashinfo now takes 75% less time and 30% less memory.

The benefits are similarly obvious for automated debugger frameworks like the one developed by Backtrace.

With a manually-driven debugger like kgdb, a developer enters commands and processes the output manually. This means only a handful of lookups are required in a relatively long (to a CPU) amount of time. Backtrace debuggers, on the other hand, issue thousands of such lookups per second.

Epilogue

Backtrace began shipping a version of this performance improvement in ptrace in February 2016. This enables us to also offer significantly faster tracing of FreeBSD kernel cores to customers running current and older releases of FreeBSD.

On July 17, 2016, our work improving libkvm scaling was committed to FreeBSD/head. It will ship with FreeBSD 12.0.