Valgrind on ARM
If you have a problem with memory bloating or leaks, the tool of the first choice is Valgrind. Briefly, it wraps some standard library calls with their implementation (malloc, new, etc.) and tracks all memory allocation within your code. By default, Valgrind is checking whether all memory allocated in the program is freed at the end. A more sophisticated tool is massif – it collects all allocations to profile memory consumption over time. After the program run, you may check it with ms_print or massif-visualizer GUI.
During my last job, I’ve noticed, that my application’s memory is constantly increasing. I’ve run Valgrind massif to check why. Unfortunately, I couldn’t find the reason, since all stacks were empty. All I could see was the amount of memory consumed without stack traces. Listing below shows the ms_print output
KB
556.3^ #
| @@@#
| @@@@@@@#
| @@@@@@@@@@@@#
| @@@@@@@@@@@@@@@@#
| @@@@@@@@@@@@@@@@@@@#
| @@:@@@@@@@@@@@@@@@@@@@#
| :@::@ :@@@@@@@@@@@@@@@@@@@#
| @@:::::@: @ :@@@@@@@@@@@@@@@@@@@#
| @@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| @@@@@@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| @@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| @ ::::::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| ::@::::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| @@@@:::::::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| :::@ @ ::: :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| :: :@ @ ::: :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
| :: :@ @ ::: :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
0 +----------------------------------------------------------------------->Gi
0 1.908
Number of snapshots: 88
Detailed snapshots: [4, 5, 6, 15, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40, 43, 45, 46, 47, 48, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87 (peak)]
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
0 0 0 0 0 0
1 46,537,185 69,888 60,082 9,806 0
2 80,589,950 98,024 85,567 12,457 0
3 122,171,391 89,984 74,107 15,877 0
4 167,353,809 119,272 99,911 19,361 0
83.77% (99,911B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
5 199,890,853 127,048 105,155 21,893 0
82.77% (105,155B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
6 221,582,101 132,288 108,651 23,637 0
82.13% (108,651B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
As you can see, there are no stacks collected. I spent two days investigating what’s going on. I will not write about the whole process, but I have eliminated the most obvious tracks like debug symbols, compiler options, etc. The last thing was reading the docs. I found these options:
--unw-stack-scan-thresh=<number> [default: 0] ,
--unw-stack-scan-frames=<number> [default: 5]
Stack-scanning support is available only on ARM targets.
These flags enable and control stack unwinding by stack
scanning. When the normal stack unwinding mechanisms -- usage
of Dwarf CFI records, and frame-pointer following -- fail,
stack scanning may be able to recover a stack trace.
Note that stack scanning is an imprecise, heuristic mechanism
that may give very misleading results, or none at all. It
should be used only in emergencies, when normal unwinding
fails, and it is important to nevertheless have stack traces.
Stack scanning is a simple technique: the unwinder reads
words from the stack, and tries to guess which of them might
be return addresses, by checking to see if they point just
after ARM or Thumb call instructions. If so, the word is
added to the backtrace.
The main danger occurs when a function call returns, leaving
its return address exposed, and a new function is called, but
the new function does not overwrite the old address. The
result of this is that the backtrace may contain entries for
functions which have already returned, and so be very
confusing.
A second limitation of this implementation is that it will
scan only the page (4KB, normally) containing the starting
stack pointer. If the stack frames are large, this may result
in only a few (or not even any) being present in the trace.
Also, if you are unlucky and have an initial stack pointer
near the end of its containing page, the scan may miss all
interesting frames.
By default stack scanning is disabled. The normal use case is
to ask for it when a stack trace would otherwise be very
short. So, to enable it, use --unw-stack-scan-thresh=number.
This requests Valgrind to try using stack scanning to
"extend" stack traces which contain fewer than number frames.
If stack scanning does take place, it will only generate at
most the number of frames specified by
--unw-stack-scan-frames. Typically, stack scanning generates
so many garbage entries that this value is set to a low value
(5) by default. In no case will a stack trace larger than the
I’ve set --unw-stack-scan-thresh
to 5, because all stacks had 0 or 1 element. --unw-stack-scan-frames
to 30 – I don’t use recursion, so this value is much above needs. After setting it up the stacks appeared in ms_print output. Apparently, normal stack resolving failed – why? Maybe it’s because old gcc – 4.9.2 version. Another possible reason is that the standard stack parser works only on x86 architecture.
I didn’t spend much time finding out why this option was needed. If you know why, please share your thoughts in a comment.