High thread counts and slow process maps

A few months ago, a friend noted that they saw a significant increase in the time required to read /proc/<pid>/maps in Linux due to a change from a few years ago. A patch was introduced to the Linux kernel in 2012 (> 3.2) that marked thread stacks in /proc/<pid>/maps output. Previously, these regions were indistinguishable from other anonymous memory.

Unfortunately, with this additional output there is a high cost for applications that utilize maps. In the current implementation of maps, anonymous memory requires scanning every thread in the thread group to detect the possibility of the VMA serving as a thread stack (see source). This means that a process with n thread stacks will require n² scans of the thread group list. This gets very slow very quickly.

The good news is that this logic is not applied to /proc/<pid>/task/<tid>/maps, since it does not mark thread stacks of other threads. You can use this to your advantage with the bt --map option if you do not require distinguishing all thread stacks and prefer not to use cached maps.

shell$ uname -r
3.15.4-x86_64-linode45
shell$ time bt $pid --thread 2821
[...]
real	0m0.858s
user	0m0.013s
sys	0m0.370s
shell$ time bt $pid --thread 2821 --map /proc/$pid/task/$pid/maps
[...]
real	0m0.075s
user	0m0.003s
sys	0m0.027s