Backtrace Engineering

Ridiculously fast feature flags

All long-lived programs are either implemented in dynamic languages,1 or eventually Greenspun themselves into subverting static programming languages to create a dynamic system (e.g., Unix process trees). The latter approach isn’t a bad idea, but it’s easy to introduce more flexibility than intended (e.g., data-driven JNDI lookups) when we add late binding features piecemeal, without a holistic view of how all the interacting components engender a weird program modification language....

Verneuil: S3-backed asynchronous replication for SQLite

We here at Backtrace are excited to share Verneuil, our MIT-licensed SQLite VFS (OS interface) extension that intercepts file writes and asynchronously replicates SQLite databases to S3-compatible blob stores, as well as a companion VFS that recreates snapshots from the replicated blob data. We migrated our internal repository, with full commit history, to this public repository where we will host all future development. What does it do? Verneuil is a Linux-only Rust crate (with a fair amount of C) that uses SQLite’s extension mechanism to access database files on disk exactly like the default unix ™ VFS, while asynchronously replicating snapshots to a blob store like S3....

Slitter: a slab allocator that trusts, but verifies (in Rust, for C)

Slitter is Backtrace’s deliberately middle-of-the-road thread-caching slab allocator, with explicit allocation class tags (rather than derived from the object’s size class). It’s mostly written in Rust, and we use it in our C backend server. Slitter’s design is about as standard as it gets: we hope to dedicate the project’s complexity budget to always-on “observability” and safety features. We don’t wish to detect all or even most memory management errors, but we should statistically catch a small fraction (enough to help pinpoint production issues) of such bugs, and always constrain their scope to the mismanaged allocation class....

Stuff your logs!

Nine months ago, we embarked on a format migration for the persistent (on-disk) representation of variable-length strings like symbolicated call stacks in the Backtrace server. We chose a variant of consistent overhead byte stuffing (COBS), a self-synchronising code, for the metadata (variable-length as well). This choice let us improve our software’s resilience to data corruption in local files, and then parallelise data hydration, which improved startup times by a factor of 10… without any hard migration from the old to the current on-disk data format....

UMASH: a fast and universal enough hash

We accidentally a whole hash function… but we had a good reason! Our MIT-licensed UMASH hash function is a decently fast non-cryptographic hash function that guarantees a worst-case bound on the probability of collision between any two inputs generated independently of the UMASH parameters. On the 2.5 GHz Intel 8175M servers that power Backtrace’s hosted offering, UMASH computes a 64-bit hash for short cached inputs of up to 64 bytes in 9-22 ns, and for longer ones at up to 22 GB/s, while guaranteeing that two distinct inputs of at most \(s\) bytes collide with probability less than \(\lceil s / 2048 \rceil \cdot 2^{-56}\)....

Triplefault: Djordje Todorovic on LLVM and debug information

Join us July 23rd at 4PM ET (8PM UTC), as Djordje Todorovic joins me to discuss the state of debug information in LLVM and the work that is happening to improve it. The event will be livestreamed at Triplefault. LLVM is a popular compiler backend that is the basis for toolchain across the Apple ecosystem, popular game consoles and more. Compared to compilers such as gcc, the quality of debug information in the presence of optimizations has historically been very poor for LLVM-based compilers....

How hard is it to guide test case generators with branch coverage feedback?

To make a long story short, it’s almost easy, now that tracing is available on commodity hardware, especially with a small library to handle platform-specific setup. Here’s a graph of the empirical distribution functions for the number of calls into OpenSSL 1.0.1f it takes for Hypothesis to find Heartbleed, given a description of the format for Heartbeat requests (“grammar”), and the same with additional branch coverage feedback (“bts”, for Intel Branch Trace Store)....

Talk: What every programmer should know about malloc

At CPPCon 2019, Samy Al Bahra, Paul Khuong and Hannes Frederic Sowa collaborated on presenting the foundations of malloc and the tips that helped them improve performance in the real world. Abstract Memory allocators are a critical part of the run-time system for languages such as C++. Over the last few decades, the requirements for general-purpose memory allocation have become significantly more complex with different allocators evolving to handle different workloads varying by process age, concurrency, allocation patterns, object sizes, access patterns and more....

Talk: Abusing your memory model for fun and profit

The most efficient concurrent C++ data structures used in the wild today usually achieve break-neck performance by either constraining their workload or constraining correctness to a particular memory model. At CPPCon 2019, Paul Khuong and Samy Al Bahra collaborated on a talk sharing success stories of workload specialization in real-world workloads over the last few years. Abstract The most efficient concurrent C++ data structures used in the wild today usually achieve break-neck performance by either constraining their workload or constraining correctness to a particular memory model....

A Workflow for Memory Error Checking

Memory error detection technologies such as valgrind and address sanitizer are a critical part of the toolkit for programmers doing manual memory management. These tools allow you to detect memory access errors, memory management bugs, race conditions and more, in your programs. They are heavily used in companies developing high quality software such as Google, Facebook, Mozilla and more. A general problem with all these tools is that there is no easy way to aggregate their errors and action on them....