A core represents the state of a process at a point in time. It contains all the information that an engineer needs in order to inspect the process and its state even after the process has exited. This information includes thread information, mapped memory, register state and more. By using a debugger with the core file, engineers can interact with and inspect the state of the process as if they had attached a debugger to the process at the time when the core file was generated.
Ever wonder what exactly is contained in a core dump and how debuggers interact with them? This post is for you. We will explore how cores are generated and how software interacts with them.
Generating Core Files
Both FreeBSD and Linux initiate a core dump for a process when that process
receives certain unhandled signals. Both create core dumps for SIGQUIT
,
SIGILL
, SIGTRAP
, SIGABRT
, SIGFPE
, SIGSEGV
, SIGBUS
, SIGSYS
, and
SIGEMT
; Linux also creates core dumps for SIGXCPU
and SIGXFSZ
. Users can
also initiate core dumps manually using tools such as
gcore. Each operating
system supports a variety of core file formats, but the default and most common
format is ELF,
so we will focus on that.
ELF Core Files
Let’s start by generating a core file on FreeBSD (examples in this post refer to FreeBSD unless otherwise indicated):
$ gcore -c vim.core `pgrep vim`
Because the core file is an ELF file, we can inspect it using the readelf
tool:
$ readelf --file-header vim.core
ELF Header:
Magic: 7f 45 4c 46 02 01 01 09 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - FreeBSD
ABI Version: 0
Type: CORE (Core file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 64 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 79
Size of section headers: 64 (bytes)
Number of section headers: 0
Section header string table index: 0
Note that the ELF file type is “Core file”. Although the ELF specification has a specific type value reserved for representing core files, it unfortunately does not provide any specifications regarding the contents of such files, and formal specifications in this regard are hard to come by. In practice, both Linux and FreeBSD store most data about the process in various ELF note sections. Let’s see what notes are present in the core file we just created:
$ readelf --notes vim.core
Notes at offset 0x00001188 with length 0x0000ad18:
Owner Data size Description
FreeBSD 0x00000078 NT_PRPSINFO (prpsinfo structure)
FreeBSD 0x000000e0 NT_PRSTATUS (prstatus structure)
FreeBSD 0x00000200 NT_FPREGSET (floating point registers)
FreeBSD 0x00000018 NT_THRMISC (thrmisc structure)
FreeBSD 0x000000e0 NT_PRSTATUS (prstatus structure)
FreeBSD 0x00000200 NT_FPREGSET (floating point registers)
FreeBSD 0x00000018 NT_THRMISC (thrmisc structure)
FreeBSD 0x00000884 NT_PROCSTAT_PROC (proc data)
FreeBSD 0x0000156c NT_PROCSTAT_FILES (files data)
FreeBSD 0x00008574 NT_PROCSTAT_VMMAP (vmmap data)
FreeBSD 0x00000008 NT_PROCSTAT_GROUPS (groups data)
FreeBSD 0x00000006 NT_PROCSTAT_UMASK (umask data)
FreeBSD 0x000000d4 NT_PROCSTAT_RLIMIT (rlimit data)
FreeBSD 0x00000008 NT_PROCSTAT_OSREL (osreldate data)
FreeBSD 0x0000000c NT_PROCSTAT_PSSTRINGS (ps_strings data)
FreeBSD 0x00000114 NT_PROCSTAT_AUXV (auxv data)
General Process Information
The first note is a binary blob representing an instance of struct prpsinfo
,
defined on FreeBSD in /usr/include/sys/procfs.h
:
typedef struct prpsinfo {
int pr_version; /* Version number of struct (1) */
size_t pr_psinfosz; /* sizeof(prpsinfo_t) (1) */
char pr_fname[PRFNAMESZ+1]; /* Command name, null terminated (1) */
char pr_psargs[PRARGSZ+1]; /* Arguments, null terminated (1) */
} prpsinfo_t;
This struct just contains some general information about the process; most of the meaty data is stored elsewhere.
Thread Information
The next notes are NT_PRSTATUS
, NT_FPREGSET
and NT_THRMISC
twice each.
These note types contain various types of data about threads and there will be
a separate instance of them for each thread in the process. Because this
process had two threads when we generated the core file, we see two instances
of each type. Each NT_PRSTATUS
note contains a binary blob representing an
instance of struct prstatus
, defined in the same header file:
typedef struct prstatus {
int pr_version; /* Version number of struct (1) */
size_t pr_statussz; /* sizeof(prstatus_t) (1) */
size_t pr_gregsetsz; /* sizeof(gregset_t) (1) */
size_t pr_fpregsetsz; /* sizeof(fpregset_t) (1) */
int pr_osreldate; /* Kernel version (1) */
int pr_cursig; /* Current signal (1) */
pid_t pr_pid; /* Process ID (1) */
gregset_t pr_reg; /* General purpose registers (1) */
} prstatus_t;
Here we have information that allows us to determine the state of each thread,
most notably register values and signal number (if applicable). The register
set here only contains values for general registers and certain control
registers. For example, on FreeBSD running on x86-64, gregset_t
is defined
as:
struct __reg64 {
__int64_t r_r15;
__int64_t r_r14;
__int64_t r_r13;
__int64_t r_r12;
__int64_t r_r11;
__int64_t r_r10;
__int64_t r_r9;
__int64_t r_r8;
__int64_t r_rdi;
__int64_t r_rsi;
__int64_t r_rbp;
__int64_t r_rbx;
__int64_t r_rdx;
__int64_t r_rcx;
__int64_t r_rax;
__uint32_t r_trapno;
__uint16_t r_fs;
__uint16_t r_gs;
__uint32_t r_err;
__uint16_t r_es;
__uint16_t r_ds;
__int64_t r_rip;
__int64_t r_cs;
__int64_t r_rflags;
__int64_t r_rsp;
__int64_t r_ss;
};
The values of floating point registers are contained in the NT_FPREGSET
sections. The NT_THRMISC
sections really only contain the name for each
thread:
typedef struct thrmisc {
char pr_tname[MAXCOMLEN+1]; /* Thread name, null terminated (1) */
u_int _pad; /* Convenience pad, 0-filled (1) */
} thrmisc_t;
libprocstat
All of the note sections we have seen so far are common between FreeBSD and
Linux, although the exact details of the internal structs may vary. However,
on FreeBSD, after these notes we see a lot of notes starting with
NT_PROCSTAT
. These notes are meant to be opaque to general users and
accessible only via libprocstat
, a FreeBSD library whose basic API is
available
here.
On Linux, these notes are replaced with equivalent structs whose contents are
transparent to users. Most of the NT_PROCSTAT
notes correspond directly with
libprocstat
API calls. For example, the NT_PROCSTAT_PROC
note contains the
data that is exposed by procstat_getprocs
. This call returns an array of
struct kinfo_proc
, which is defined on FreeBSD in /usr/include/sys/user.h
and contains a wide variety of metadata about a thread, such as signal masks,
stack size and and start time.
Memory Mappings
So far we have enough information to examine the threads in a process and
determine what state they are in. However, we still don’t have any information
about the process’ memory, so we can’t yet determine the values of any
variables (except those stored in registers). FreeBSD provides information
about mapped memory segments via the procstat_getvmmap
call, while Linux
stores it as binary data in a note of type NT_FILE
. This information is
logically equivalent to the output of procstat -v
on FreeBSD or the contents
of /proc/<pid>/maps
on Linux. The FreeBSD procstat_getvmmap
call returns
an array of the following struct:
struct kinfo_vmentry {
int kve_structsize; /* Variable size of record. */
int kve_type; /* Type of map entry. */
uint64_t kve_start; /* Starting address. */
uint64_t kve_end; /* Finishing address. */
uint64_t kve_offset; /* Mapping offset in object */
uint64_t kve_vn_fileid; /* inode number if vnode */
uint32_t kve_vn_fsid; /* dev_t of vnode location */
int kve_flags; /* Flags on map entry. */
int kve_resident; /* Number of resident pages. */
int kve_private_resident; /* Number of private pages. */
int kve_protection; /* Protection bitmask. */
int kve_ref_count; /* VM obj ref count. */
int kve_shadow_count; /* VM obj shadow count. */
int kve_vn_type; /* Vnode type. */
uint64_t kve_vn_size; /* File size. */
uint32_t kve_vn_rdev; /* Device id if device. */
uint16_t kve_vn_mode; /* File mode. */
uint16_t kve_status; /* Status flags. */
int _kve_ispare[12]; /* Space for more stuff. */
/* Truncated before copyout in sysctl */
char kve_path[PATH_MAX]; /* Path to VM obj, if any. */
};
Let’s create a simple program using the libprocstat
API to examine the
process’ virtual memory map and look at the output (Note: do not use this code
in production; for brevity’s sake it does not include any error checking or
cleanup):
$ cat vm_print.c
/* These includes are required by libprocstat.h */
#include <sys/param.h>
#include <sys/queue.h>
#include <sys/socket.h>
#include <sys/user.h>
#include <libprocstat.h>
#include <stdio.h>
#include <sys/sysctl.h>
static const char *vm_type_names[] = {
[KVME_TYPE_NONE] = "NONE",
[KVME_TYPE_DEFAULT] = "DEFAULT",
[KVME_TYPE_VNODE] = "VNODE",
[KVME_TYPE_SWAP] = "SWAP",
[KVME_TYPE_DEVICE] = "DEVICE",
[KVME_TYPE_PHYS] = "PHYS",
[KVME_TYPE_DEAD] = "DEAD",
[KVME_TYPE_SG] = "SG",
[KVME_TYPE_MGTDEVICE] = "MGTDEVICE",
[KVME_TYPE_UNKNOWN] = "UNKNOWN",
};
int
main(int argc, char **argv)
{
struct procstat *ps = procstat_open_core(argv[1]);
unsigned int n_proc, n_vm, i;
struct kinfo_proc *procs = procstat_getprocs(ps, KERN_PROC_PROC, 0,
&n_proc);
struct kinfo_vmentry *vms = procstat_getvmmap(ps, &procs[0], &n_vm);
printf("Start\t\tEnd\t\tProtection\tDump Core\tType\t\tPath\n");
for (i = 0; i < n_vm; ++i) {
struct kinfo_vmentry *vm = &vms[i];
printf("%#llx\t%#llx\t%c%c%c\t\t%c\t\t%s\t\t%s\n",
(unsigned long long)vm->kve_start,
(unsigned long long)vm->kve_end,
(vm->kve_protection & KVME_PROT_READ) ? 'r' : '-',
(vm->kve_protection & KVME_PROT_WRITE) ? 'w' : '-',
(vm->kve_protection & KVME_PROT_EXEC) ? 'x' : '-',
(vm->kve_flags & KVME_FLAG_NOCOREDUMP) ? 'n' : 'y',
vm_type_names[vm->kve_type],
vm->kve_path);
}
return 0;
}
$ clang -o vm_print vm_print.c -lprocstat
$ ./vm_print vim.core | head -n 30
Start End Protection Dump Core Type Path
0x400000 0x695000 r-x n VNODE /usr/local/bin/vim
0x895000 0x8aa000 rw- y VNODE /usr/local/bin/vim
0x8aa000 0x8b8000 rw- y DEFAULT
0x800895000 0x8008b0000 r-x n VNODE /libexec/ld-elf.so.1
0x8008b0000 0x8008b9000 rw- y DEFAULT
0x8008b9000 0x80091b000 rw- y DEFAULT
0x80091b000 0x800924000 rw- y DEFAULT
0x800ab0000 0x800ab2000 rw- y DEFAULT
0x800ab2000 0x800be5000 r-x n VNODE /usr/local/lib/libX11.so.6.3.0
0x800be5000 0x800de5000 --- n DEFAULT
0x800de5000 0x800deb000 rw- y VNODE /usr/local/lib/libX11.so.6.3.0
0x800deb000 0x800dfd000 r-x n VNODE /usr/local/lib/libXpm.so.4.11.0
0x800dfd000 0x800ffd000 --- n DEFAULT
0x800ffd000 0x800ffe000 rw- y VNODE /usr/local/lib/libXpm.so.4.11.0
0x800ffe000 0x80105a000 r-x n VNODE /usr/local/lib/libXt.so.6.0.0
0x80105a000 0x80125a000 --- n DEFAULT
0x80125a000 0x801260000 rw- y VNODE /usr/local/lib/libXt.so.6.0.0
0x801260000 0x801687000 r-x n VNODE /usr/local/lib/libgtk-x11-2.0.so.0.2400.27
0x801687000 0x801886000 --- n DEFAULT
0x801886000 0x801893000 rw- y VNODE /usr/local/lib/libgtk-x11-2.0.so.0.2400.27
0x801893000 0x801895000 rw- y NONE
0x801895000 0x801946000 r-x n VNODE /usr/local/lib/libgdk-x11-2.0.so.0.2400.27
0x801946000 0x801b45000 --- n DEFAULT
0x801b45000 0x801b4b000 rw- y VNODE /usr/local/lib/libgdk-x11-2.0.so.0.2400.27
0x801b4b000 0x801b64000 r-x n VNODE /lib/libthr.so.3
0x801b64000 0x801d63000 --- n DEFAULT
0x801d63000 0x801d65000 rw- y VNODE /lib/libthr.so.3
0x801d65000 0x801d70000 rw- y DEFAULT
0x801d70000 0x801d8f000 r-x n VNODE /usr/local/lib/libgdk_pixbuf-2.0.so.0.3100.2
Now we know the addresses of the process’ mapped memory segments and, where appropriate, which file the segments were mapped from. Suppose that we’re interested in actually reading data from these segments. Both Linux and FreeBSD store the contents of (some of) these as ELF file segments. Let’s take a look at the ELF program headers in our core file, which provide information about the file’s segments:
$ readelf --segments vim.core | head -n 40
Elf file type is CORE (Core file)
Entry point 0x0
There are 79 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
NOTE 0x0000000000001188 0x0000000000000000 0x0000000000000000
0x000000000000ad18 0x0000000000000000 R 4
LOAD 0x000000000000c000 0x0000000000895000 0x0000000000000000
0x0000000000015000 0x0000000000015000 RW 1000
LOAD 0x0000000000021000 0x00000000008aa000 0x0000000000000000
0x000000000000e000 0x000000000000e000 RW 1000
LOAD 0x000000000002f000 0x00000008008b0000 0x0000000000000000
0x0000000000009000 0x0000000000009000 RW 1000
LOAD 0x0000000000038000 0x00000008008b9000 0x0000000000000000
0x0000000000062000 0x0000000000062000 RW 1000
LOAD 0x000000000009a000 0x000000080091b000 0x0000000000000000
0x0000000000009000 0x0000000000009000 RW 1000
LOAD 0x00000000000a3000 0x0000000800ab0000 0x0000000000000000
0x0000000000002000 0x0000000000002000 RW 1000
LOAD 0x00000000000a5000 0x0000000800de5000 0x0000000000000000
0x0000000000006000 0x0000000000006000 RW 1000
LOAD 0x00000000000ab000 0x0000000800ffd000 0x0000000000000000
0x0000000000001000 0x0000000000001000 RW 1000
LOAD 0x00000000000ac000 0x000000080125a000 0x0000000000000000
0x0000000000006000 0x0000000000006000 RW 1000
LOAD 0x00000000000b2000 0x0000000801886000 0x0000000000000000
0x000000000000d000 0x000000000000d000 RW 1000
LOAD 0x00000000000bf000 0x0000000801b45000 0x0000000000000000
0x0000000000006000 0x0000000000006000 RW 1000
LOAD 0x00000000000c5000 0x0000000801d63000 0x0000000000000000
0x0000000000002000 0x0000000000002000 RW 1000
LOAD 0x00000000000c7000 0x0000000801d65000 0x0000000000000000
0x000000000000b000 0x000000000000b000 RW 1000
LOAD 0x00000000000d2000 0x0000000801f8f000 0x0000000000000000
0x0000000000001000 0x0000000000001000 RW 1000
LOAD 0x00000000000d3000 0x00000008021de000 0x0000000000000000
0x0000000000003000 0x0000000000003000 RW 1000
The first segment is of type NOTE, and it contains all of the note data we’ve been looking at so far. However, there are also many segments with ELF type LOAD. Let’s look specifically at the starting virtual memory addresses for the LOAD segments:
$ readelf --segments vim.core | awk '$1~/LOAD/ { sub(/x0*/, "x", $3); print $3 }' | head -n 20
0x895000
0x8aa000
0x8008b0000
0x8008b9000
0x80091b000
0x800ab0000
0x800de5000
0x800ffd000
0x80125a000
0x801886000
0x801b45000
0x801d63000
0x801d65000
0x801f8f000
0x8021de000
0x802426000
0x80272e000
0x80293a000
0x802b41000
0x802d58000
Which segments are included?
Note that the above addresses all correspond to virtual memory segments
reported by the libprocstat
API. However, some segments returned by
libprocstat
are missing in this list, such as those starting at
0x800895000
, 0x800be5000
and 0x801893000
. On FreeBSD, there are several
criteria that mapped memory segments must satisfy in order to be included in a
core file:
- They must have at least one of read, write, and execute permissions. If ELF
legacy coredump mode is enabled (via the
elf64_legacy_coredump
orelf32_legacy_coredump
sysctl), then the segment must have both read and write permissions. - They must not have been marked as exempt from core dumps. Users can mark
memory maps as exempt from core dumps using the
MAP_NOCORE
flag withmmap
or theMADV_NOCORE
flag withmadvise
. - They must not be submaps. In practice, this means that kernel submaps, such as signal trampolines, will be excluded.
- They must be be backed by physical memory (either on disk or in volatile
memory). For instance, segments backed by devices or files in a
procfs
filesystem will not be included in a core file.
Executable segments from libraries and other binary files are typically mapped
with the MAP_ENTRY_NOCOREDUMP
flag (the kernel internal equivalent of
MAP_NOCORE
), so they are not included in core dumps. Data from these
segments is sometimes of interest to libraries such as libthread_db
and
libunwind but these segments can be quite
large and their data is read-only, so if needed we can just read the data
directly from the binary files on disk. The bad news here is that if these
files have been modified since the core file was created, then we won’t be able
to read data (or worse, will read incorrect data) from their segments; the good
news is that these segments only contain executable instructions and not
variable data, so missing it won’t prevent us from determining the values of
any variables.
Reading from process memory
For memory segments stored in the core file, the Offset
and MemSiz
fields
in the program header tell us where to actually read the data for the mapping.
Per the ELF specification, the Offset
field indicates the offset within the
ELF file where the segment starts, and the MemSiz
fields indicates how many
bytes the segment occupies within the file. For example, suppose we wanted to
read a variable whose address in the original process was 0x8008b0100
.
First, we have to find the virtual memory segment containing the data we want
based on start and end addresses. In this case it is the segment
0x8008b0000-0x8008b9000
. Then we have to determine where to find the
contents of that segment. We can find this from the corresponding ELF segment
with the matching start address.
This segment starts at file offset 0x2f000
, and the position of the desired
memory in that segment is 0x8008b0100 - 0x8008b0000
, or 0x100
. After
adding these together, we get an overall position of 0x2f100
in the core
file, which is where we can find this variable’s value.
Memory segments in other files
As mentioned earlier, some memory segments are not actually included in the
core file. In the case of executable segments, we can instead read the
relevant data directly from the binary file. On FreeBSD the kve_path
field
of struct kinfo_vmentry
gives us the path of the file to read from and the
kve_offset
field tells us the offset of the segment within that file. On
Linux, each entry in the NT_FILE
note has corresponding fields. Given these
pieces of information we can apply the same logic as above to find the target
memory location.
Limitations
ELF core files store memory mapping information as program headers, and the
field in the ELF file header indicating the number of program headers is stored
as a 16-bit unsigned integer, even in the 64-bit version of ELF. Therefore any
program with more than (2^16) - 1
(65535) memory maps would cause an overflow
in this field. Older core dump implementations did not account for this and as
a result, core files appeared to be missing many of their memory segments.
Newer versions of the Linux kernel get around this by setting this field
(e_phnum
) to 0xffff
for any map count that would overflow, and then storing
the actual count in the sh_info
field (stored as a 32-bit unsigned integer)
of the first section header. However, the problem still persists in FreeBSD.
Signal Information
Another very important part of process state is information about signals sent
to the process. When a signal is caught by a user-supplied signal handler, the
stack trace will make it clear that a signal is present because the stack will
contain a frame for the signal trampoline. However, when the operating system
initiates a core dump because of an unhandled signal such as SIGABRT
or
SIGSEGV
, the stack trace will provide no indication of a signal. Linux core
files provide a note of type NT_SIGINFO
for each thread; each of these notes
is an instance of the siginfo_t
, which contains signal number, code and
additional signal-specific details (such as the faulting address in the case of
SIGSEGV
).
Signals on FreeBSD
FreeBSD does not provide such detailed information. Instead, as shown earlier,
the struct prstatus
provided for each thread includes a field int pr_cursig
whose value represents the number of the signal received by the process (or 0
if there is no signal present). It does not provide any further information
about the signal, though. This field has the same value across all threads, so
if a process dumps core due to an unhandled SIGABRT
, for instance, every
single struct prstatus
will have its pr_cursig
field set to SIGABRT
.
However, if a specific thread causes a signal (for example, a thread
dereferences a NULL pointer, causing a SIGSEGV
), then that thread will be the
first one listed in the core file.
Use for Debugging
The information discussed so far is typically sufficient for traditional debugging purposes, i.e. determining the cause of a specific fault and inspecting the state of a process at the time of that fault. Some work is required for synthesizing this information into a useful form, such as unwinding stack frames and determining the values of specific variables, but in terms of the raw information that needs to be stored in a core file, nothing else is really needed. Instructions for synthesizing the information described above is typically contained in DWARF debug information either in the original executable file or a standalone file containing only this debug information.
Comparison with Backtrace Snapshots
The Backtrace snapshot format is extensible and among other things contains detailed callstack (compiler optimizations included), objects such as variables, threads and system information. Our debuggers try to determine which regions of memory are required to get to the root cause and are much more selective in determining what to persist in a snapshot. In contrast, core files contain raw data such as register values and memory contents. The additional information in Backtrace files enables various forms of large-scale trend analysis across multiple crashes. Let’s compare time and memory used to generate each type of file for a Chromium process running on Linux on my laptop:
Time (s) | Memory (kB) | |
---|---|---|
Coredump | 58.73 | 468825 |
Backtrace Snapshot | 1.65 | 365 |
The Backtrace file was generated more than 35 times faster and occupies less than 0.1% of the disk space of the core file, despite also analyzing DWARF debug information on the fly, evaluating variable values and performing various types of analysis on the resulting data. For instance, in addition to information about the traced process, the Backtrace trace file also collects system-level data such as operating system version, overall system memory usage, and other data that could be useful to engineers analyzing a crash after the fact; identifies common classes of crashes such as NULL pointer dereferences and stack overflows; and annotates specific variables that could be of special interest to engineers.
Another advantage of the Backtrace file format is that it is entirely self-contained and can be viewed on any machine with Backtrace tools, unlike core files, which typically must be viewed in the same environment in which they were created. The snapshot can also have all sorts of blobs attached to it so engineers are able to recreate the faulting environment assets in one command.