FreeBSD Userspace Coredumps

A core represents the state of a process at a point in time. It contains all the information that an engineer needs in order to inspect the process and its state even after the process has exited. This information includes thread information, mapped memory, register state and more. By using a debugger with the core file, engineers can interact with and inspect the state of the process as if they had attached a debugger to the process at the time when the core file was generated.

Ever wonder what exactly is contained in a core dump and how debuggers interact with them? This post is for you. We will explore how cores are generated and how software interacts with them.

Generating Core Files

Both FreeBSD and Linux initiate a core dump for a process when that process receives certain unhandled signals. Both create core dumps for SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGFPE, SIGSEGV, SIGBUS, SIGSYS, and SIGEMT; Linux also creates core dumps for SIGXCPU and SIGXFSZ. Users can also initiate core dumps manually using tools such as gcore. Each operating system supports a variety of core file formats, but the default and most common format is ELF, so we will focus on that.

ELF Core Files

Let’s start by generating a core file on FreeBSD (examples in this post refer to FreeBSD unless otherwise indicated):

$ gcore -c vim.core `pgrep vim`

Because the core file is an ELF file, we can inspect it using the readelf tool:

$ readelf --file-header vim.core
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 09 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - FreeBSD
  ABI Version:                       0
  Type:                              CORE (Core file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         79
  Size of section headers:           64 (bytes)
  Number of section headers:         0
  Section header string table index: 0

Note that the ELF file type is “Core file”. Although the ELF specification has a specific type value reserved for representing core files, it unfortunately does not provide any specifications regarding the contents of such files, and formal specifications in this regard are hard to come by. In practice, both Linux and FreeBSD store most data about the process in various ELF note sections. Let’s see what notes are present in the core file we just created:

$ readelf --notes vim.core

Notes at offset 0x00001188 with length 0x0000ad18:
  Owner         Data size       Description
  FreeBSD       0x00000078      NT_PRPSINFO (prpsinfo structure)
  FreeBSD       0x000000e0      NT_PRSTATUS (prstatus structure)
  FreeBSD       0x00000200      NT_FPREGSET (floating point registers)
  FreeBSD       0x00000018      NT_THRMISC (thrmisc structure)
  FreeBSD       0x000000e0      NT_PRSTATUS (prstatus structure)
  FreeBSD       0x00000200      NT_FPREGSET (floating point registers)
  FreeBSD       0x00000018      NT_THRMISC (thrmisc structure)
  FreeBSD       0x00000884      NT_PROCSTAT_PROC (proc data)
  FreeBSD       0x0000156c      NT_PROCSTAT_FILES (files data)
  FreeBSD       0x00008574      NT_PROCSTAT_VMMAP (vmmap data)
  FreeBSD       0x00000008      NT_PROCSTAT_GROUPS (groups data)
  FreeBSD       0x00000006      NT_PROCSTAT_UMASK (umask data)
  FreeBSD       0x000000d4      NT_PROCSTAT_RLIMIT (rlimit data)
  FreeBSD       0x00000008      NT_PROCSTAT_OSREL (osreldate data)
  FreeBSD       0x0000000c      NT_PROCSTAT_PSSTRINGS (ps_strings data)
  FreeBSD       0x00000114      NT_PROCSTAT_AUXV (auxv data)

General Process Information

The first note is a binary blob representing an instance of struct prpsinfo, defined on FreeBSD in /usr/include/sys/procfs.h:

typedef struct prpsinfo {
    int         pr_version;     /* Version number of struct (1) */
    size_t      pr_psinfosz;    /* sizeof(prpsinfo_t) (1) */
    char        pr_fname[PRFNAMESZ+1];  /* Command name, null terminated (1) */
    char        pr_psargs[PRARGSZ+1];   /* Arguments, null terminated (1) */
} prpsinfo_t;

This struct just contains some general information about the process; most of the meaty data is stored elsewhere.

Thread Information

The next notes are NT_PRSTATUS, NT_FPREGSET and NT_THRMISC twice each. These note types contain various types of data about threads and there will be a separate instance of them for each thread in the process. Because this process had two threads when we generated the core file, we see two instances of each type. Each NT_PRSTATUS note contains a binary blob representing an instance of struct prstatus, defined in the same header file:

typedef struct prstatus {
    int         pr_version;     /* Version number of struct (1) */
    size_t      pr_statussz;    /* sizeof(prstatus_t) (1) */
    size_t      pr_gregsetsz;   /* sizeof(gregset_t) (1) */
    size_t      pr_fpregsetsz;  /* sizeof(fpregset_t) (1) */
    int         pr_osreldate;   /* Kernel version (1) */
    int         pr_cursig;      /* Current signal (1) */
    pid_t       pr_pid;         /* Process ID (1) */
    gregset_t   pr_reg;         /* General purpose registers (1) */
} prstatus_t;

Here we have information that allows us to determine the state of each thread, most notably register values and signal number (if applicable). The register set here only contains values for general registers and certain control registers. For example, on FreeBSD running on x86-64, gregset_t is defined as:

struct __reg64 {
	__int64_t       r_r15;
	__int64_t       r_r14;
	__int64_t       r_r13;
	__int64_t       r_r12;
	__int64_t       r_r11;
	__int64_t       r_r10;
	__int64_t       r_r9;
	__int64_t       r_r8;
	__int64_t       r_rdi;
	__int64_t       r_rsi;
	__int64_t       r_rbp;
	__int64_t       r_rbx;
	__int64_t       r_rdx;
	__int64_t       r_rcx;
	__int64_t       r_rax;
	__uint32_t      r_trapno;
	__uint16_t      r_fs;
	__uint16_t      r_gs;
	__uint32_t      r_err;
	__uint16_t      r_es;
	__uint16_t      r_ds;
	__int64_t       r_rip;
	__int64_t       r_cs;
	__int64_t       r_rflags;
	__int64_t       r_rsp;
	__int64_t       r_ss;
};

The values of floating point registers are contained in the NT_FPREGSET sections. The NT_THRMISC sections really only contain the name for each thread:

typedef struct thrmisc {
    char        pr_tname[MAXCOMLEN+1];  /* Thread name, null terminated (1) */
    u_int       _pad;                   /* Convenience pad, 0-filled (1) */
} thrmisc_t;

libprocstat

All of the note sections we have seen so far are common between FreeBSD and Linux, although the exact details of the internal structs may vary. However, on FreeBSD, after these notes we see a lot of notes starting with NT_PROCSTAT. These notes are meant to be opaque to general users and accessible only via libprocstat, a FreeBSD library whose basic API is available here. On Linux, these notes are replaced with equivalent structs whose contents are transparent to users. Most of the NT_PROCSTAT notes correspond directly with libprocstat API calls. For example, the NT_PROCSTAT_PROC note contains the data that is exposed by procstat_getprocs. This call returns an array of struct kinfo_proc, which is defined on FreeBSD in /usr/include/sys/user.h and contains a wide variety of metadata about a thread, such as signal masks, stack size and and start time.

Memory Mappings

So far we have enough information to examine the threads in a process and determine what state they are in. However, we still don’t have any information about the process’ memory, so we can’t yet determine the values of any variables (except those stored in registers). FreeBSD provides information about mapped memory segments via the procstat_getvmmap call, while Linux stores it as binary data in a note of type NT_FILE. This information is logically equivalent to the output of procstat -v on FreeBSD or the contents of /proc/<pid>/maps on Linux. The FreeBSD procstat_getvmmap call returns an array of the following struct:

struct kinfo_vmentry {
	int      kve_structsize;                /* Variable size of record. */
	int      kve_type;                      /* Type of map entry. */
	uint64_t kve_start;                     /* Starting address. */
	uint64_t kve_end;                       /* Finishing address. */
	uint64_t kve_offset;                    /* Mapping offset in object */
	uint64_t kve_vn_fileid;                 /* inode number if vnode */
	uint32_t kve_vn_fsid;                   /* dev_t of vnode location */
	int      kve_flags;                     /* Flags on map entry. */
	int      kve_resident;                  /* Number of resident pages. */
	int      kve_private_resident;          /* Number of private pages. */
	int      kve_protection;                /* Protection bitmask. */
	int      kve_ref_count;                 /* VM obj ref count. */
	int      kve_shadow_count;              /* VM obj shadow count. */
	int      kve_vn_type;                   /* Vnode type. */
	uint64_t kve_vn_size;                   /* File size. */
	uint32_t kve_vn_rdev;                   /* Device id if device. */
	uint16_t kve_vn_mode;                   /* File mode. */
	uint16_t kve_status;                    /* Status flags. */
	int      _kve_ispare[12];               /* Space for more stuff. */
	/* Truncated before copyout in sysctl */
	char     kve_path[PATH_MAX];            /* Path to VM obj, if any. */
};

Let’s create a simple program using the libprocstat API to examine the process’ virtual memory map and look at the output (Note: do not use this code in production; for brevity’s sake it does not include any error checking or cleanup):

$ cat vm_print.c
/* These includes are required by libprocstat.h */
#include <sys/param.h>
#include <sys/queue.h>
#include <sys/socket.h>
#include <sys/user.h>
#include <libprocstat.h>

#include <stdio.h>
#include <sys/sysctl.h>

static const char *vm_type_names[] = {
	[KVME_TYPE_NONE] = "NONE",
	[KVME_TYPE_DEFAULT] = "DEFAULT",
	[KVME_TYPE_VNODE] = "VNODE",
	[KVME_TYPE_SWAP] = "SWAP",
	[KVME_TYPE_DEVICE] = "DEVICE",
	[KVME_TYPE_PHYS] = "PHYS",
	[KVME_TYPE_DEAD] = "DEAD",
	[KVME_TYPE_SG] = "SG",
	[KVME_TYPE_MGTDEVICE] = "MGTDEVICE",
	[KVME_TYPE_UNKNOWN] = "UNKNOWN",
};

int
main(int argc, char **argv)
{
	struct procstat *ps = procstat_open_core(argv[1]);
	unsigned int n_proc, n_vm, i;
	struct kinfo_proc *procs = procstat_getprocs(ps, KERN_PROC_PROC, 0,
	    &n_proc);
	struct kinfo_vmentry *vms = procstat_getvmmap(ps, &procs[0], &n_vm);

	printf("Start\t\tEnd\t\tProtection\tDump Core\tType\t\tPath\n");
	for (i = 0; i < n_vm; ++i) {
		struct kinfo_vmentry *vm = &vms[i];

		printf("%#llx\t%#llx\t%c%c%c\t\t%c\t\t%s\t\t%s\n",
		    (unsigned long long)vm->kve_start,
		    (unsigned long long)vm->kve_end,
		    (vm->kve_protection & KVME_PROT_READ) ? 'r' : '-',
		    (vm->kve_protection & KVME_PROT_WRITE) ? 'w' : '-',
		    (vm->kve_protection & KVME_PROT_EXEC) ? 'x' : '-',
		    (vm->kve_flags & KVME_FLAG_NOCOREDUMP) ? 'n' : 'y',
		    vm_type_names[vm->kve_type],
		    vm->kve_path);
	}

	return 0;
}
$ clang -o vm_print vm_print.c -lprocstat                                                                                                                                                                                                                                     
$ ./vm_print vim.core | head -n 30
Start           End             Protection      Dump Core       Type            Path
0x400000        0x695000        r-x             n               VNODE           /usr/local/bin/vim
0x895000        0x8aa000        rw-             y               VNODE           /usr/local/bin/vim
0x8aa000        0x8b8000        rw-             y               DEFAULT
0x800895000     0x8008b0000     r-x             n               VNODE           /libexec/ld-elf.so.1
0x8008b0000     0x8008b9000     rw-             y               DEFAULT
0x8008b9000     0x80091b000     rw-             y               DEFAULT
0x80091b000     0x800924000     rw-             y               DEFAULT
0x800ab0000     0x800ab2000     rw-             y               DEFAULT
0x800ab2000     0x800be5000     r-x             n               VNODE           /usr/local/lib/libX11.so.6.3.0
0x800be5000     0x800de5000     ---             n               DEFAULT
0x800de5000     0x800deb000     rw-             y               VNODE           /usr/local/lib/libX11.so.6.3.0
0x800deb000     0x800dfd000     r-x             n               VNODE           /usr/local/lib/libXpm.so.4.11.0
0x800dfd000     0x800ffd000     ---             n               DEFAULT
0x800ffd000     0x800ffe000     rw-             y               VNODE           /usr/local/lib/libXpm.so.4.11.0
0x800ffe000     0x80105a000     r-x             n               VNODE           /usr/local/lib/libXt.so.6.0.0
0x80105a000     0x80125a000     ---             n               DEFAULT
0x80125a000     0x801260000     rw-             y               VNODE           /usr/local/lib/libXt.so.6.0.0
0x801260000     0x801687000     r-x             n               VNODE           /usr/local/lib/libgtk-x11-2.0.so.0.2400.27
0x801687000     0x801886000     ---             n               DEFAULT
0x801886000     0x801893000     rw-             y               VNODE           /usr/local/lib/libgtk-x11-2.0.so.0.2400.27
0x801893000     0x801895000     rw-             y               NONE
0x801895000     0x801946000     r-x             n               VNODE           /usr/local/lib/libgdk-x11-2.0.so.0.2400.27
0x801946000     0x801b45000     ---             n               DEFAULT
0x801b45000     0x801b4b000     rw-             y               VNODE           /usr/local/lib/libgdk-x11-2.0.so.0.2400.27
0x801b4b000     0x801b64000     r-x             n               VNODE           /lib/libthr.so.3
0x801b64000     0x801d63000     ---             n               DEFAULT
0x801d63000     0x801d65000     rw-             y               VNODE           /lib/libthr.so.3
0x801d65000     0x801d70000     rw-             y               DEFAULT
0x801d70000     0x801d8f000     r-x             n               VNODE           /usr/local/lib/libgdk_pixbuf-2.0.so.0.3100.2

Now we know the addresses of the process’ mapped memory segments and, where appropriate, which file the segments were mapped from. Suppose that we’re interested in actually reading data from these segments. Both Linux and FreeBSD store the contents of (some of) these as ELF file segments. Let’s take a look at the ELF program headers in our core file, which provide information about the file’s segments:

$ readelf --segments vim.core | head -n 40

Elf file type is CORE (Core file)
Entry point 0x0
There are 79 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
		 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000001188 0x0000000000000000 0x0000000000000000
		 0x000000000000ad18 0x0000000000000000  R      4
  LOAD           0x000000000000c000 0x0000000000895000 0x0000000000000000
		 0x0000000000015000 0x0000000000015000  RW     1000
  LOAD           0x0000000000021000 0x00000000008aa000 0x0000000000000000
		 0x000000000000e000 0x000000000000e000  RW     1000
  LOAD           0x000000000002f000 0x00000008008b0000 0x0000000000000000
		 0x0000000000009000 0x0000000000009000  RW     1000
  LOAD           0x0000000000038000 0x00000008008b9000 0x0000000000000000
		 0x0000000000062000 0x0000000000062000  RW     1000
  LOAD           0x000000000009a000 0x000000080091b000 0x0000000000000000
		 0x0000000000009000 0x0000000000009000  RW     1000
  LOAD           0x00000000000a3000 0x0000000800ab0000 0x0000000000000000
		 0x0000000000002000 0x0000000000002000  RW     1000
  LOAD           0x00000000000a5000 0x0000000800de5000 0x0000000000000000
		 0x0000000000006000 0x0000000000006000  RW     1000
  LOAD           0x00000000000ab000 0x0000000800ffd000 0x0000000000000000
		 0x0000000000001000 0x0000000000001000  RW     1000
  LOAD           0x00000000000ac000 0x000000080125a000 0x0000000000000000
		 0x0000000000006000 0x0000000000006000  RW     1000
  LOAD           0x00000000000b2000 0x0000000801886000 0x0000000000000000
		 0x000000000000d000 0x000000000000d000  RW     1000
  LOAD           0x00000000000bf000 0x0000000801b45000 0x0000000000000000
		 0x0000000000006000 0x0000000000006000  RW     1000
  LOAD           0x00000000000c5000 0x0000000801d63000 0x0000000000000000
		 0x0000000000002000 0x0000000000002000  RW     1000
  LOAD           0x00000000000c7000 0x0000000801d65000 0x0000000000000000
		 0x000000000000b000 0x000000000000b000  RW     1000
  LOAD           0x00000000000d2000 0x0000000801f8f000 0x0000000000000000
		 0x0000000000001000 0x0000000000001000  RW     1000
  LOAD           0x00000000000d3000 0x00000008021de000 0x0000000000000000
		 0x0000000000003000 0x0000000000003000  RW     1000

The first segment is of type NOTE, and it contains all of the note data we’ve been looking at so far. However, there are also many segments with ELF type LOAD. Let’s look specifically at the starting virtual memory addresses for the LOAD segments:

$ readelf --segments vim.core | awk '$1~/LOAD/ { sub(/x0*/, "x", $3); print $3 }' | head -n 20
0x895000
0x8aa000
0x8008b0000
0x8008b9000
0x80091b000
0x800ab0000
0x800de5000
0x800ffd000
0x80125a000
0x801886000
0x801b45000
0x801d63000
0x801d65000
0x801f8f000
0x8021de000
0x802426000
0x80272e000
0x80293a000
0x802b41000
0x802d58000

Which segments are included?

Note that the above addresses all correspond to virtual memory segments reported by the libprocstat API. However, some segments returned by libprocstat are missing in this list, such as those starting at 0x800895000, 0x800be5000 and 0x801893000. On FreeBSD, there are several criteria that mapped memory segments must satisfy in order to be included in a core file:

They must have at least one of read, write, and execute permissions. If ELF legacy coredump mode is enabled (via the elf64_legacy_coredump or elf32_legacy_coredump sysctl), then the segment must have both read and write permissions.
They must not have been marked as exempt from core dumps. Users can mark memory maps as exempt from core dumps using the MAP_NOCORE flag with mmap or the MADV_NOCORE flag with madvise.
They must not be submaps. In practice, this means that kernel submaps, such as signal trampolines, will be excluded.
They must be be backed by physical memory (either on disk or in volatile memory). For instance, segments backed by devices or files in a procfs filesystem will not be included in a core file.

Executable segments from libraries and other binary files are typically mapped with the MAP_ENTRY_NOCOREDUMP flag (the kernel internal equivalent of MAP_NOCORE), so they are not included in core dumps. Data from these segments is sometimes of interest to libraries such as libthread_db and libunwind but these segments can be quite large and their data is read-only, so if needed we can just read the data directly from the binary files on disk. The bad news here is that if these files have been modified since the core file was created, then we won’t be able to read data (or worse, will read incorrect data) from their segments; the good news is that these segments only contain executable instructions and not variable data, so missing it won’t prevent us from determining the values of any variables.

Reading from process memory

For memory segments stored in the core file, the Offset and MemSiz fields in the program header tell us where to actually read the data for the mapping. Per the ELF specification, the Offset field indicates the offset within the ELF file where the segment starts, and the MemSiz fields indicates how many bytes the segment occupies within the file. For example, suppose we wanted to read a variable whose address in the original process was 0x8008b0100. First, we have to find the virtual memory segment containing the data we want based on start and end addresses. In this case it is the segment 0x8008b0000-0x8008b9000. Then we have to determine where to find the contents of that segment. We can find this from the corresponding ELF segment with the matching start address.

This segment starts at file offset 0x2f000, and the position of the desired memory in that segment is 0x8008b0100 - 0x8008b0000, or 0x100. After adding these together, we get an overall position of 0x2f100 in the core file, which is where we can find this variable’s value.

Memory segments in other files

As mentioned earlier, some memory segments are not actually included in the core file. In the case of executable segments, we can instead read the relevant data directly from the binary file. On FreeBSD the kve_path field of struct kinfo_vmentry gives us the path of the file to read from and the kve_offset field tells us the offset of the segment within that file. On Linux, each entry in the NT_FILE note has corresponding fields. Given these pieces of information we can apply the same logic as above to find the target memory location.

Limitations

ELF core files store memory mapping information as program headers, and the field in the ELF file header indicating the number of program headers is stored as a 16-bit unsigned integer, even in the 64-bit version of ELF. Therefore any program with more than (2^16) - 1 (65535) memory maps would cause an overflow in this field. Older core dump implementations did not account for this and as a result, core files appeared to be missing many of their memory segments. Newer versions of the Linux kernel get around this by setting this field (e_phnum) to 0xffff for any map count that would overflow, and then storing the actual count in the sh_info field (stored as a 32-bit unsigned integer) of the first section header. However, the problem still persists in FreeBSD.

Signal Information

Another very important part of process state is information about signals sent to the process. When a signal is caught by a user-supplied signal handler, the stack trace will make it clear that a signal is present because the stack will contain a frame for the signal trampoline. However, when the operating system initiates a core dump because of an unhandled signal such as SIGABRT or SIGSEGV, the stack trace will provide no indication of a signal. Linux core files provide a note of type NT_SIGINFO for each thread; each of these notes is an instance of the siginfo_t, which contains signal number, code and additional signal-specific details (such as the faulting address in the case of SIGSEGV).

Signals on FreeBSD

FreeBSD does not provide such detailed information. Instead, as shown earlier, the struct prstatus provided for each thread includes a field int pr_cursig whose value represents the number of the signal received by the process (or 0 if there is no signal present). It does not provide any further information about the signal, though. This field has the same value across all threads, so if a process dumps core due to an unhandled SIGABRT, for instance, every single struct prstatus will have its pr_cursig field set to SIGABRT. However, if a specific thread causes a signal (for example, a thread dereferences a NULL pointer, causing a SIGSEGV), then that thread will be the first one listed in the core file.

Use for Debugging

The information discussed so far is typically sufficient for traditional debugging purposes, i.e. determining the cause of a specific fault and inspecting the state of a process at the time of that fault. Some work is required for synthesizing this information into a useful form, such as unwinding stack frames and determining the values of specific variables, but in terms of the raw information that needs to be stored in a core file, nothing else is really needed. Instructions for synthesizing the information described above is typically contained in DWARF debug information either in the original executable file or a standalone file containing only this debug information.

Comparison with Backtrace Snapshots

The Backtrace snapshot format is extensible and among other things contains detailed callstack (compiler optimizations included), objects such as variables, threads and system information. Our debuggers try to determine which regions of memory are required to get to the root cause and are much more selective in determining what to persist in a snapshot. In contrast, core files contain raw data such as register values and memory contents. The additional information in Backtrace files enables various forms of large-scale trend analysis across multiple crashes. Let’s compare time and memory used to generate each type of file for a Chromium process running on Linux on my laptop:

	Time (s)	Memory (kB)
Coredump	`58.73`	`468825`
Backtrace Snapshot	`1.65`	`365`

The Backtrace file was generated more than 35 times faster and occupies less than 0.1% of the disk space of the core file, despite also analyzing DWARF debug information on the fly, evaluating variable values and performing various types of analysis on the resulting data. For instance, in addition to information about the traced process, the Backtrace trace file also collects system-level data such as operating system version, overall system memory usage, and other data that could be useful to engineers analyzing a crash after the fact; identifies common classes of crashes such as NULL pointer dereferences and stack overflows; and annotates specific variables that could be of special interest to engineers.

Another advantage of the Backtrace file format is that it is entirely self-contained and can be viewed on any machine with Backtrace tools, unlike core files, which typically must be viewed in the same environment in which they were created. The snapshot can also have all sorts of blobs attached to it so engineers are able to recreate the faulting environment assets in one command.