We are happy to announce the first release of Coresnap, a suite of tools that intercept and aggregate coredumps as they occur on Linux and FreeBSD systems. With Coresnap, both operations and software engineers benefit from having a holistic view of the state of faults on their systems and across their environments. Backtrace assistive debugging analyzes these dumps to make sure the state most relevant to the fault is not missed by incident responders and engineers. A concise annotated version of these dumps is generated that requires orders of magnitude less disk space and centralized aggregation means that disk space is not wasted on coredumps.
Read on to learn more about Coresnap and the benefits of automated dump analysis.
Demo
We have installed the coresnapd package with
sudo yum install backtrace-coresnapd
A configuration file for the object store is dropped in /usr/local/etc/coroner.cf
.
[universe]
name = backtrace
write = https://faults.backtrace.io:6098
read = faults.backtrace.io:4097
read.ssl.enabled = true
[token]
blackhole = e5f1a3a23f756f89cfd9f363647291fd9ff1d43f179e959d3aa2aed78af81663
Then finally the service is enabled so that all faults are analyzed by Backtrace
and routed to the object store at faults.backtrace.io
.
$ systemctl start coresnapd
$ systemctl status coresnapd
● coresnapd.service - Backtrace coredump aggregation service
Loaded: loaded (/lib/systemd/system/coresnapd.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2016-04-24 15:33:10 EDT; 1h 56min ago
Process: 777 ExecStart=/opt/backtrace/sbin/coresnapd $CORESNAPD_OPTS (code=exited, status=0/SUCCESS)
Main PID: 787 (coresnapd)
CGroup: /system.slice/coresnapd.service
└─787 /opt/backtrace/sbin/coresnapd
Apr 24 15:33:10 broadwell systemd[1]: Starting Backtrace coredump aggregation service...
Apr 24 15:33:10 broadwell systemd[1]: Started Backtrace coredump aggregation service.
For the purposes of the demo, the following simple program is run. The jemalloc
memory allocator is being used in this example.
static void
program(void)
{
void **r;
char *poop;
r = malloc(sizeof(void *));
*r = buffer;
free(r);
poop = malloc(8);
strcpy(poop, "AAAAAA");
free(poop);
**(char **)r = '0';
return;
}
This is a simple use-after-free bug in which a pointer contained in a deinitialized region of memory is used. There are many different manifestations of this form of bug, this is just one. The program is run and crashes immediately on start-up.
$ ./simple
Segmentation fault (core dumped)
This fault is picked up and processed by Coresnap.
Apr 24 17:39:31 broadwell coresnap[4616]: Crash dump archived in /var/coresnap/archive/pending/d1cbec97-9ee2-413b-8663-3ad797f40a08
Apr 24 17:39:31 broadwell coresnap[4617]: Executing slave: /opt/backtrace/bin/ptrace --faulted -o/var/coresnap/archive/sending/d1cbec97-9ee2-413b-8663-3ad797f40a08 --kv=coresnap.object:d1cbec97-9ee2-413b-8663-3ad797f40a08 --resource=/var/coresnap/archive/assets/d1cbec97-9ee2-413b-8663-3ad797f40a08 --core /var/coresnap/archive/pending/d1cbec97-9ee2-413b-8663-3ad797f40a08 /home/sbahra/projects/heap/src/simple
Apr 24 17:39:31 broadwell coresnap[4618]: Executing slave: /opt/backtrace/bin/coroner -c /usr/local/etc/coroner.cf put blackhole blackhole /var/coresnap/archive/sending/d1cbec97-9ee2-413b-8663-3ad797f40a08
Apr 24 17:39:36 broadwell coresnap[4459]: crash d1cbec97-9ee2-413b-8663-3ad797f40a08 processed in 6 seconds
Our object store is integrated into Slack and JIRA in this case. A ticket is created for JIRA since it is a fault that has never been seen before. In addition to this, the errors are aggregated and reported in Slack in real-time.
{% img center http://backtrace.io/blog/images/slack-example.png %}
This group of faults is also visible in the web console.
{% img center http://backtrace.io/blog/images/simple.png %} {% img center http://backtrace.io/blog/images/simple-1.png %}
The coresnap
command is used to view the state of fault processing on
the current system below.
$ coresnap list -a simple
...
sending/2d022186-8b4... -sa simple Sun Apr 24 18:05:21 2016 18.81kB
sending/815cc75a-8f1... -sa simple Sun Apr 24 18:05:21 2016 18.77kB
sending/2fa2905e-f25... -sa simple Sun Apr 24 18:05:24 2016 18.82kB
pending/ec023b58-84e... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
pending/b8d60d72-820... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
pending/93964307-843... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
pending/808f6e25-dda... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
pending/ced4e78d-3be... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
pending/f5ee226a-5a2... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
pending/a28e2761-d7c... c-a simple Sun Apr 24 18:05:24 2016 20.30mB
[pending]/49O/994.93mB [sending]/30O/563.54kB
The sending/2d022186-8b4...
dump is currently in process of being sent to
the object store. The system currently has a Backtrace snapshot and metadata
(indicated by s
and a
respectively) stored locally. This snapshot uses
18.81kB of disk space. The pending
objects are raw coredumps that are pending
processing. After running a couple hundred instances of these programs, the
system currently has 49 coredumps pending processing, requiring 994MB and 30
Backtrace snapshots being sent to the object store, requiring 563KB of disk
space.
A command line client is also available to query against the state of faults
in your system. For example, below, we request the latest crash in the
simple
application.
$ coroner list blackhole --filter=application=simple -R -H1 -i1
[b3e85c13b23bfbe79702e5dc2c4102ae5615d81ba4df1608b44791a0e6279f80]
Date: Sun Apr 24 17:32:11 2016 - Sun Apr 24 18:05:26 2016
Occurrences: 20 (over 0 days)
Mar 25 : █ : Apr 24
Attributes:
hostname (1 buckets)
application (1 buckets)
simple 20 100.00% ███████████████
Classification:
memory.write (20 buckets)
use-after-free (19 buckets)
null (1 buckets)
Frames:
thread_fn
Objects:
[431991dbda48453fbddf92d8700611ba] (Sun Apr 24 18:05:26 2016)
Classification: memory.write use-after-free
Attributes: hostname=broadwell application=simple
The coroner get
sub-command can be used to download and view snapshots. Let’s
download and view the 431991dbda48453fbddf92d8700611ba
instance of this fault.
$ coroner get blackhole 431991dbda48453fbddf92d8700611ba
This opens the snapshot in the Backtrace snapshot viewer. Below is a screenshot
of the 431991dbda48453fbddf92d8700611ba
snapshot.
{% img center http://backtrace.io/blog/images/hydra-simple.png %}
Below is another example of a snapshot that involves heap corruption in the FreeBSD kernel.
{% img center http://backtrace.io/blog/images/hydra-freebsd-1.png %}
Overview
Installation
Many applications lack infrastructure for post-mortem invocation of tools. Coresnap solves this problem by integrating with the operating system for crash handling. The Coresnap installation process is a simple three steps:
- Install the package.
- Specify a configuration file for object store archival.
- Enable the service.
Refer to the documentation for additional information.
Archival
On Linux systems, coresnap integrates with core_pattern
so the Linux
kernel automatically routes coredumps to the Coresnap archive
tool using a
pipe. The archive
tool applies various consistency checks to ensure that the
file metadata corresponds to the coredump and that disk resources are not
exhausted (at block-level granularity). It also captures additional
context such as the state of the system and process resources (captured through
/proc
) at the time of fault. Dumps are written out as the coresnap
user and
group, along with an additional archive in tar
format with additional
assets. All dumps are written out as sparse files in order to minimize
disk activity.
The coresnapd
daemon is notified of dumps once they are committed to disk. The
daemon is in an idle state in absence of dump generation. Once a dump is
generated, it will begin processing the dump in a journaled fashion. Dumps are
routed to our snapshot tool to generate a minified version of the state of
the process at the time of fault and then to our object store client so that
errors are rolled up into a centralized console. Faults are grouped according
to the objects that are relevant to the fault. In the case of a crash, this
is a normalized form of the callstack.
All input and output files can be removed at various stages of the pipeline to
minimize disk utilization. For example, a snapshot file is likely to be
sufficient for root cause investigation of a fault. Once a snapshot file is
generated, the relevant coredump is purged or archived according to the various
policies supported by coresnapd
. Processing is journaled so that coresnapd
is able to resume coredump processing in the face of failures.
Analysis
Backtrace brings automation to incident response and investigation. Currently, our debugging technology analyzes the post-mortem state of faulted applications so that crucial yet easy-to-miss signals are not ignored by responders. Backtrace leaps into the action when an application has failed and has no impact on run-time performance. Our analysis results in:
- Annotations on points of interest such as variables and other process state.
- Classification for prioritization and setting the framework of investigation.
- Deduplication so that faults are grouped according to uniqueness.
At the time of writing, Backtrace is able to detect many forms of heap corruption, various security issues (including malware), architectural constraint violations and disambiguates faults to include additional information about memory regions relevant to the fault. We also perform various tasks such as alias detection so that all reachable variables across all threads that are relevant to the fault are highlighted.
Example classifiers include security
, assert
, machine-check
,
double-free
, invalid-free
, invalid-pointer
and more. The goal of our
technology is to bring domain expert knowledge to all engineers. Various
heuristics associate dumps with a quality score that corresponds to the
comprehensibility of the dump. This allows engineers to focus their
investigation on dumps that contain more information.
Our tooling analyzes a coredump to output a minified annotated snapshot that contains all variables and results of analysis. The snapshot may also contain additional state such as system statistics, directory trees, state of the kernel stack and more.