We previously explored FreeBSD userspace coredumps. Backtrace’s debugging platform supports FreeBSD kernel coredumps too, and their traces share many features. They are constructed somewhat differently, and in the process of adding support for them, we found a way to improve performance for automated programs accessing them.
Read on to learn how information is extracted from a FreeBSD kernel core, and how we improved performance for this mechanism.
Generating kernel core files
A kernel core is typically only generated in exceptional circumstances. Unlike userspace processes, kernel routines cannot fault without sacrificing the machine’s availability. This means things like page faults and illegal instructions inside the kernel stop the machine, instead of just one process. At that point, in most cases, it is only usable enough to inspect its state in a debugger, or to generate a core file.
On FreeBSD, kernel core files can be saved for any machine that has
dumpdev=AUTO
in its /etc/rc.conf
, and a sufficiently large swap
partition, configured at boot time.
On FreeBSD, a kernel core file can be generated by intentionally crashing:
# sysctl debug.kdb.panic=1
For machines that are configured to enter the on-line kernel debugger (aka
kdb
), the following command is also needed, at the system console:
kdb> call doadump
Otherwise, the kernel’s dump routines will automatically save the core file to a swap partition, if configured.
Upon reboot, the savecore
rc startup script will automatically identify
the core file in the swap partition, and save it to /var/crash
by default.
What’s in a FreeBSD kernel core?
A FreeBSD kernel core file can be formatted in several different ways. This depends on which type of dump was performed. Full core dumps are ELF files, similar in structure to userspace core files.
However, as RAM size grew, this became more difficult to manage. In 2006, FreeBSD introduced minidumps, which are much smaller without making the core file useless. This has been the default dump type since FreeBSD 6.0.
Unlike userspace cores & full kernel cores, minidumps are not ELF files. They have a custom structure that starts with a header, and are followed by several sections of data explained by the header. For example, on amd64:
#define MINIDUMP_MAGIC "minidump FreeBSD/amd64"
#define MINIDUMP_VERSION 2
struct minidumphdr {
char magic[24];
uint32_t version;
uint32_t msgbufsize;
uint32_t bitmapsize;
uint32_t pmapsize;
uint64_t kernbase;
uint64_t dmapbase;
uint64_t dmapend;
}
The version
element allows minidump formats to change over time.
At the time of this article, this has only happened once for amd64.
The next three uint32_t
elements describe the size of the three sections
of memory included in the core file:
- The
msgbuf
(“message buffer”) section contains the contents that would be printed bydmesg
if it were run at the time the core file was saved. - The
bitmap
section tells the debugger which pages of physical memory are included in thepmap
section. This means bitN
indicates the presence of the page covering 4K bytes starting atN * 4K
, in physical memory. - The
pmap
section consists of a sparse dump of all pages of physical memory. Each page in this section has its bit set in the above bitmap.
The uint64_t
elements describe the boundaries of the two kernel virtual
address segments. On amd64, every process’s virtual address space has these
sections reserved for the kernel.
Kernel virtual address space layout
There are two sections of kernel virtual address (KVA) space:
- The kernel map. This is the primary map, used for critical kernel context. In particular, all thread stacks are stored here. Kernel map addresses require using page tables in order to find which physical page corresponds to a particular virtual address.
- The direct map. This is an amd64-specific feature; it takes advantage of the 64-bit address space to enable looking up the corresponding physical pages of memory directly, via bitmask. All UMA-managed pages are mapped this way.
The minidump, strictly speaking, only includes physical pages that are in the kernel map (which includes the page tables for it). Because the direct map allows accessing all physical pages in memory, all kernel map addresses will also have direct map addresses. However, the converse is not true.
The minidump includes the direct map address range so whenever a debugger encounters a direct map address, it can attempt to look up their corresponding physical address.
In addition to partitioning the KVA space in this manner, kernel subsystems can indicate whether a particular page should not be ‘dumpable’. For minidumps, these non-dumpable pages are also not included.
Looking up information in a kernel core
Let’s try something a bit more concrete. The kernel debugger (kgdb
) can
be run like this:
freebsd-head% /usr/local/bin/kgdb ./kernel/kernel ./vmcore.0
GNU gdb (GDB) 7.11.1 [GDB v7.11.1 for FreeBSD]
[...]
Reading symbols from ./kernel/kernel...Reading symbols from /usr/home/will/tmp/amd64/kernel/kernel.symbols...done.
done.
Unread portion of the kernel message buffer:
panic: kdb_sysctl_panic
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff8098e390 at kdb_backtrace+0x60
#1 0xffffffff80951066 at vpanic+0x126
#2 0xffffffff80950f33 at panic+0x43
#3 0xffffffff8098eb81 at kdb_sysctl_panic+0x61
#4 0xffffffff8095ca5e at sysctl_root+0x24e
#5 0xffffffff8095d038 at userland_sysctl+0x1d8
#6 0xffffffff8095ce24 at sys___sysctl+0x74
#7 0xffffffff80d5694f at amd64_syscall+0x40f
#8 0xffffffff80d3bbbb at Xfast_syscall+0xfb
Uptime: 1m5s
Dumping 223 out of 4071 MB:..8%..15%..22%..36%..43%..51%..65%..72%..86%..94%
__curthread () at ./machine/pcpu.h:219
219 ./machine/pcpu.h: No such file or directory.
(kgdb)
kgdb
provides a basic stack trace for the faulting thread. This is an
example of what it looks like when forced as described earlier. Because the
trace includes a frame for kdb_sysctl_panic
(the handler for the command
we ran), the cause of this particular crash is clear.
Let’s extract a bit more information by running bt
(short for backtrace
)
on the faulting thread:
(kgdb) bt
#0 __curthread () at ./machine/pcpu.h:219
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:298
#2 0xffffffff80950cc2 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff809510a5 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#4 0xffffffff80950f33 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:818
#5 0xffffffff8098eb81 in kdb_sysctl_panic (oidp=<optimized out>, arg1=<unavailable>, arg2=<unavailable>, req=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:190
#6 0xffffffff8095ca5e in sysctl_root (arg1=<optimized out>, arg2=<optimized out>, oidp=<optimized out>, req=<optimized out>) at /usr/src/sys/kern/kern_sysctl.c:1531
#7 0xffffffff8095d038 in userland_sysctl (td=<optimized out>, name=0xfffffe00f69a2820, namelen=<optimized out>, old=<optimized out>, oldlenp=<optimized out>, inkernel=<optimized out>, new=<optimized out>, newlen=<optimized out>, retval=<optimized out>, flags=0) at /usr/src/sys/kern/kern_sysctl.c:1641
#8 0xffffffff8095ce24 in sys___sysctl (td=0xfffff800031f0000, uap=0xfffffe00f69a2940) at /usr/src/sys/kern/kern_sysctl.c:1567
#9 0xffffffff80d5694f in syscallenter (td=<optimized out>, sa=<optimized out>, td=<optimized out>, sa=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:141
#10 amd64_syscall (td=0xfffff800031f0000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:959
#11 <signal handler called>
#12 0x000000080096a65a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd9f8
Let’s see more information about the current thread. Frame #8 includes the
td
local. We can use this to print out the name of the thread:
(kgdb) f 8
#8 0xffffffff8095ce24 in sys___sysctl (td=0xfffff800031f0000, uap=0xfffffe00f69a2940) at /usr/src/sys/kern/kern_sysctl.c:1567
1567 /usr/src/sys/kern/kern_sysctl.c: No such file or directory.
(kgdb) p &td->td_name
$4 = (char (*)[20]) 0xfffff800031f0284
(kgdb) p td->td_name
$2 = "sysctl", '\000' <repeats 13 times>
To access this information, kgdb
asks libkvm for the location in the core
file for the virtual address 0xfffff800031f0284
. Because the debugger has
access to the DWARF metadata explaining the structure of td
, it would have
figured out that it needs to access this address rather than the td
base address 0xfffff800031f0000
.
Libkvm will first determine whether the virtual address lies within the
kernel or direct maps. If it lies in the kernel map, libkvm will consult
the page table pages to discover the corresponding physical address.
If it lies in the direct map, it can simply mask off the direct map base
address. If neither of these applies, the address is illegal. This process
is encapsulated by va_to_pa
, or “virtual address to physical address”.
Once the physical address is determined, libkvm consults the core file’s bitmap to figure out where in the core file it is located.
One way libkvm can scale better
Recall that minidumps include a sparse bitmap indicating the pages that are included. These pages are dumped sequentially in the last section. Because they are sparse in a not entirely predictable way, figuring the offset into the dump for a particular physical address cannot be reduced to a trivial formula.
When minidumps were originally created, the solution adopted was to create a hash table that maps physical addresses (one per 4K page) to dump file offsets. This hash table was initialized with 1024 buckets. A minidump with 8GB of physical pages would, therefore, have 2 million entries with an average of 2048 per bucket. This is the best case in such a scenario.
On top of this high load factor, the offset lookup access pattern is not randomly distributed, which means frequently-accessed addresses can sometimes be particularly expensive to look up. This is reflected in automated processes taking significantly longer to run.
A valid formula for calculating the offset looks like this (where pa
is
the physical address):
offset_for(pa) = bits_set_before(pa) * 4096
Here bits_set_before
represents the number of bits set in the bitmap
before the page containing the indicated physical address. So the trick is
finding a better solution for this function.
The “number of bits set” has a special term: the
Hamming weight, or
population count. So bits_set_before
is just giving us the Hamming
weight for the minidump’s bitmap, from bit 0 to bit N-1.
It would be inefficient to count all of those bits every time, hence the hash table implementation.
Caching the Hamming weight
Dumps will never have a predictable pattern in this bitmap. However, we can scan the bitmap and generate a lookup table that indicates the number of bits set. By doing so, we convert the expensive operation (“get number of bits set before bit N”) into a cheap one (“lookup pre-computed value for bit N”).
We have a trade-off to make: We could store this value for every bit, but that would consume a lot of memory. We can instead substitute a small amount of CPU time at lookup time to compute the number of bits set between a cached value and the desired bit.
Modern CPUs offer a POPCNT
instruction that returns the number of bits set
in a fixed width integer value. There are also
generic equivalents
available. Older and non-x86 systems supported by FreeBSD also use libkvm,
so any optimization must not affect their functionality. GCC and Clang both
support a __bitcount64
primitive to bridge this gap. So that’s what we
actually used to perform the calculation.
We’ve already reduced the number of counts from N to N/64. We can go one
step farther. Given a page N
, where X <= N < Y
, we can also compute
the number of bits forwards (from X
) and backwards (from Y
).
This reduces the number of counts by half again.
Finally, in order to further limit the memory usage of this cache, we can
call __bitcount64
a few times. By adopting a cache interval of 1024 bits,
we require, on average (in a perfectly random access pattern), 4 counts worth
of CPU time, and one count per 1024 bits in memory.
Putting it together
To illustrate the relationship between the cache and the core’s bitmap:
cache: | A | B | C | ... |
| | |
\-----------\ \------------\ \-----------\
| | |
bits: | A0, A1, ... A1023 | B0, B1, ... B1023 | C0, C1, ... C1023 | D0, ... |
While generating the cache, each value is computed in turn. Cache value
A
is the total count of all bits set through A1023
inclusive, B
is
the total of all bits set through B1023
inclusive, and so forth.
To find the offset for page B122
, we look up the cached count A
, then
compute and add the Hamming weight between bits B0
and B121
, to find
the total number of pages before page B122.
Similarly, to find the offset for page B700
, we look up the cached count
C
, then compute the number of bits set between B700
and B1023
,
and subtract that.
The offset calculation now looks like this:
pa = va_to_pa(va)
popcount_bin = round(pa / 1024)
bb = bits_before(pa) = cache[popcount_bin] + popcount(pa, popcount_bin * 1024)
offset_for(pa) = pmap_base_offset + bb * 4096
First, we find the physical address for the requested virtual address. Then
we determine the cache popcount bin, rounding up to the next bin in case
we’re counting down. popcount
encapsulates both counting up and down as
well as counting the bits in the given range. This in turn is converted to
a relative offset simply by multiplying by page size of 4096. Finally, the
relative offset is added to the sparse page array’s base offset
(pmap_base_offset
), to provide the location of the requested address.
This means the formula is now just an array index, which is O(1), while
popcount
is a small number of CPU instructions not linearly related to
the requested bit number N
. This compares to previously, where each page
required a separate hash table entry, and each lookup required computing a
hash on the address and an O(log n) search of the matching hash table bucket.
Results
For typical manual debugger use, the impact of this change isn’t noticeable, which is probably why the hash table implementation has been in use for 10 years. However, for any automated debugging process, the extra latency adds up quickly.
The most obvious use case for stock FreeBSD is the crashinfo
utility.
This program is run by the savecore
utility mentioned in the introduction.
crashinfo
is a script that processes a kernel core dump, and runs other
utilities that can operate on core dumps, including: ps
, vmstat
, pstat
,
iostat
, ipcs
, nfsstat
, netstat
, and fstat
.
For a sample 8GB kernel core file (generated on a 128GB server), crashinfo
improves from (as reported by /usr/bin/time -l
):
43.72 real 33.25 user 2.94 sys
187884 maximum resident set size
348 average shared memory size
28 average unshared data size
126 average unshared stack size
505312 page reclaims
183 page faults
0 swaps
20131 block input operations
7650 block output operations
2 messages sent
2 messages received
0 signals received
30741 voluntary context switches
275 involuntary context switches
to:
9.43 real 2.58 user 1.47 sys
134648 maximum resident set size
1906 average shared memory size
127 average unshared data size
126 average unshared stack size
220114 page reclaims
0 page faults
0 swaps
11 block input operations
6828 block output operations
2 messages sent
2 messages received
0 signals received
9080 voluntary context switches
50 involuntary context switches
So in this case, crashinfo now takes 75% less time and 30% less memory.
The benefits are similarly obvious for automated debugger frameworks like the one developed by Backtrace.
With a manually-driven debugger like kgdb
, a developer enters commands and
processes the output manually. This means only a handful of lookups are
required in a relatively long (to a CPU) amount of time. Backtrace
debuggers, on the other hand, issue thousands of such lookups per second.
Epilogue
Backtrace began shipping a version of this performance improvement in
ptrace
in February 2016. This enables us to also offer significantly
faster tracing of FreeBSD kernel cores to customers running current and
older releases of FreeBSD.
On July 17, 2016, our work improving libkvm scaling was committed to FreeBSD/head. It will ship with FreeBSD 12.0.