Tag: #memory

C, C++, Linux

Valgrind on ARM

If you have a problem with memory bloating or leaks, the tool of the first choice is Valgrind. Briefly, it wraps some standard library calls with their implementation (malloc, new, etc.) and tracks all memory allocation within your code. By default, Valgrind is checking whether all memory allocated in the program is freed at the end. A more sophisticated tool is massif – it collects all allocations to profile memory consumption over time. After the program run, you may check it with ms_print or massif-visualizer GUI.

During my last job, I’ve noticed, that my application’s memory is constantly increasing. I’ve run Valgrind massif to check why. Unfortunately, I couldn’t find the reason, since all stacks were empty. All I could see was the amount of memory consumed without stack traces. Listing below shows the ms_print output

    KB
556.3^                                                                       #
     |                                                                    @@@#
     |                                                                @@@@@@@#
     |                                                           @@@@@@@@@@@@#
     |                                                       @@@@@@@@@@@@@@@@#
     |                                                    @@@@@@@@@@@@@@@@@@@#
     |                                                 @@:@@@@@@@@@@@@@@@@@@@#
     |                                             :@::@ :@@@@@@@@@@@@@@@@@@@#
     |                                       @@:::::@: @ :@@@@@@@@@@@@@@@@@@@#
     |                                   @@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |                              @@@@@@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |                           @@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |                         ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |                  @  ::::::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |                ::@::::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |             :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |     @@@@:::::::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     |  :::@ @ ::: :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     | :: :@ @ ::: :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
     | :: :@ @ ::: :::::@: ::: ::@@@@@@ @@@@@@@: :::@: @ :@@@@@@@@@@@@@@@@@@@#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   1.908

Number of snapshots: 88
 Detailed snapshots: [4, 5, 6, 15, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40, 43, 45, 46, 47, 48, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87 (peak)]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1     46,537,185           69,888           60,082         9,806            0
  2     80,589,950           98,024           85,567        12,457            0
  3    122,171,391           89,984           74,107        15,877            0
  4    167,353,809          119,272           99,911        19,361            0
83.77% (99,911B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  5    199,890,853          127,048          105,155        21,893            0
82.77% (105,155B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  6    221,582,101          132,288          108,651        23,637            0
82.13% (108,651B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.

As you can see, there are no stacks collected. I spent two days investigating what’s going on. I will not write about the whole process, but I have eliminated the most obvious tracks like debug symbols, compiler options, etc. The last thing was reading the docs. I found these options:


       --unw-stack-scan-thresh=<number> [default: 0] ,
       --unw-stack-scan-frames=<number> [default: 5]
           Stack-scanning support is available only on ARM targets.

           These flags enable and control stack unwinding by stack
           scanning. When the normal stack unwinding mechanisms -- usage
           of Dwarf CFI records, and frame-pointer following -- fail,
           stack scanning may be able to recover a stack trace.

           Note that stack scanning is an imprecise, heuristic mechanism
           that may give very misleading results, or none at all. It
           should be used only in emergencies, when normal unwinding
           fails, and it is important to nevertheless have stack traces.

           Stack scanning is a simple technique: the unwinder reads
           words from the stack, and tries to guess which of them might
           be return addresses, by checking to see if they point just
           after ARM or Thumb call instructions. If so, the word is
           added to the backtrace.

           The main danger occurs when a function call returns, leaving
           its return address exposed, and a new function is called, but
           the new function does not overwrite the old address. The
           result of this is that the backtrace may contain entries for
           functions which have already returned, and so be very
           confusing.

           A second limitation of this implementation is that it will
           scan only the page (4KB, normally) containing the starting
           stack pointer. If the stack frames are large, this may result
           in only a few (or not even any) being present in the trace.
           Also, if you are unlucky and have an initial stack pointer
           near the end of its containing page, the scan may miss all
           interesting frames.

           By default stack scanning is disabled. The normal use case is
           to ask for it when a stack trace would otherwise be very
           short. So, to enable it, use --unw-stack-scan-thresh=number.
           This requests Valgrind to try using stack scanning to
           "extend" stack traces which contain fewer than number frames.

           If stack scanning does take place, it will only generate at
           most the number of frames specified by
           --unw-stack-scan-frames. Typically, stack scanning generates
           so many garbage entries that this value is set to a low value
           (5) by default. In no case will a stack trace larger than the

I’ve set --unw-stack-scan-thresh to 5, because all stacks had 0 or 1 element. --unw-stack-scan-frames to 30 – I don’t use recursion, so this value is much above needs. After setting it up the stacks appeared in ms_print output. Apparently, normal stack resolving failed – why? Maybe it’s because old gcc – 4.9.2 version. Another possible reason is that the standard stack parser works only on x86 architecture.

I didn’t spend much time finding out why this option was needed. If you know why, please share your thoughts in a comment.

stdatomic.h under the hood #2

In today’s part of the series, I will find out, how the code from #1 (http://olejniczak.ovh/index.php/2020/12/18/stdatomic-h-under-the-hood-1/) is compiled on ARM architecture. I don’t want to expand this text without need, so if you want to examine code, check #1 of this series.

ARM

Below you can find details of tested architecture and compiler:

$ arm-linux-gnueabihf-gcc -v
Using built-in specs.
...
Target: arm-linux-gnueabihf
... --disable-multilib --enable-multiarch --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard ... --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-plugin --enable-gold --enable-c99 --enable-long-long --with-mode=thumb --disable-multilib --with-float=hard
Thread model: posix
gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09)

Let’s find out how the compiler works on this machine:

# ./non-synchronized-arm 
-22301
# ./non-synchronized-arm 
96532
# ./non-synchronized-arm 
225150
# ./non-synchronized-arm 
-66366
# ./non-synchronized-arm 
-120416
# ./non-synchronized-arm 
4340
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0

The difference in using stdatomic is obvious, so let’s check how ARM compiler works to make it happen. The first assembly code is generated from non-synchronized.c file:

Thread:
	@ args = 0, pretend = 0, frame = 16
	@ frame_needed = 1, uses_anonymous_args = 0
	@ link register save eliminated.
	str	fp, [sp, #-4]!
	add	fp, sp, #0
	sub	sp, sp, #20
	str	r0, [fp, #-16]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L7:
	ldr	r3, [fp, #-16]
	cmp	r3, #0
	beq	.L5
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	add	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
	b	.L6
.L5:
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	sub	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
.L6:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L7
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #0
	@ sp needed
	ldr	fp, [sp], #4
	bx	lr
	.size	Thread, .-Thread


Code is quite similar to x86. The most crucial part is L7, which implements addition, and L5, which implements subtraction. Let’s see what they are doing:

	movw	r3, #:lower16:x /* Load less significant 16 bit of x address */
	movt	r3, #:upper16:x /* ... and more significant part */
	ldr	r3, [r3]        /* Load value of x to r3 register */
	add	r2, r3, #1      /* Add 1 to x (sub in L5) */
	movw	r3, #:lower16:x /* Put address to register (like previously)  */
	movt	r3, #:upper16:x
	str	r2, [r3]        /* Store result */

We can see, that this code is longer than its x86 counterpart. Now let’s see how arm-linux-gnueabihf stdatomic implementation looks like:

Thread:
	@ args = 0, pretend = 0, frame = 40
	@ frame_needed = 1, uses_anonymous_args = 0
	stmfd	sp!, {fp, lr}
	add	fp, sp, #4
	sub	sp, sp, #40
	str	r0, [fp, #-40]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L11:
	ldr	r3, [fp, #-40]
	cmp	r3, #0
	beq	.L5
	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8
.L5:
	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10
.L7:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L11
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #4
	@ sp needed
	ldmfd	sp!, {fp, pc}
	.size	Thread, .-Thread

This time, synchronized implementation is much more complex, but it’s not a problem for us :). After short initialization, the code jumps to L4 code, where for loop is implemented. If the break condition is false, the code jumps to L11. L11 in turn starts with examining arg parameter against NULL (check code in part #1). If it’s not, L11-L6 code makes the addition otherwise, the jump to L5 is made. If you compare code in L11-L6 and L5-L9, you can see, that they are almost the same:

	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8
	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10

Each code start from saving literal 1 to the local variable on the stack (frame pointer-32 for addition and frame pointer-20 for subtraction). Right after that register r3 is filled with x variable address. Then we can see something new – instruction dmb sy which was not used in non-synchronized code. According to the ARM developers guide (https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb)

Data Memory Barrier acts as a memory barrier. It ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program order after the DMB instruction.

The dmb instruction has the second part, defining which operation should unlock our barrier. In this case, we are locking till the whole memory subsystem finishes its job (sy). It ensures so-called Sequential Consistency. It is done before ldr instruction (which as str, should be considered as asynchronous) loads current x value, to make sure, that all stores called before were done (in other words, whether we load most actual value).

After finishing ldr operation (which is synchronized with second dmb sy), we are storing a loaded value to local copy on the stack (fp-28 in addition branch and fp-16in subtraction branch). Keep in mind, that only accessing variable shared by both threads is surrounded by dmb instructions.

L8 and L10 blocks load the value of x to r2 register, then value 1 to r3 and adds (or subtracts) them, saving the result in r3. After that result is stored in the new place – fp-24 in addition branch and fp-12 in subtraction branch. The result is also moved to ip (r12) register. Right after that, r3 register once again is filled with x address, r2 is filled with fp-28 (fp-16). Let’s go back to the previous paragraph – these locations are our local copies of x value, which are loaded to r0 register. Keep in mind that this value was not changed by our addition (subtraction) operation.

Then the comparison between r1 (value of x) and r0 (local copy before operation) is done. If these values are not equal, it means, that someone changed x during our L8(L10) execution. Let’s analyze this case – we are jumping to L14 (L15). This block is not intuitive at all – it synchronizes memory operations, loads 1 to r3 if our previous compare was true (not this time), and loads 0 otherwise (this case). Then it compares r3 against 0 (we loaded 0 just before – true) and calls bne to L6 (L9), which in our scenario is not done (bne is “branch if not equal”). ARM is known of conditional instructions, it is nice, but puzzling it in such context is not easy :). We didn’t make a jump, so we are storing r1 (x value) to address under r2 (or local copy). After that cmp + bne + b set moves our instruction pointer to L8(L10) and all described operations are repeated. Not efficient :(.

Now let’s go back to cmp r1,r0 and consider, that these values didn’t change. In this case, we are storing our result with strex in a shared x location. strex stands for store exclusive, so we can expect some kind of synchronization – probably dependent on ldrex usage. It stores our changed value in a shared x value address and returns the result to the lr – a synonym of r14 – register. If the store is successful, it returns 0, and our execution path goes to L14(L16). The previous paragraph puzzled this code, in this case, we are finishing in common L7 block, which prepares next for loop step. If the store failed, we are repeating L13 (L15) once again.

Could you write it simpler?

Yes :). I prepared some pseudocode, describing this assembly listing. Here you have:

   l_1  <- 1
   ------ Memory barrier
   r3   <- x
   ------ Memory barrier
   r3   -> l_x
L10:
   r2   <- l_x
   r3   <- l_1
   r3   <- r2 +/- r3
   r3   -> l_res
   ip   <- l_res
   r0   <- l_x
   ------ Memory barrier
L15:
   r1   <- x /* exclusive */
   if (r1 != r0) goto L16
   ip   -> x /* exclusive wit result to lr */
   if (lr != 0) goto L15
L16:
   ------ Memory barrier
   if (r1 != r0) goto L10

   l_x = x;
   l_1 = 1;



start:
   l_res = l_x +/- l_1;




storing:
   if (l_x != x(ex)) /* exclusive x load */
   {
      l_x = x;
      goto start;
   }
   else
   {
   /* exclusive store returning 0 if succeed */
      if (x =(ex) l_res)
         goto storing;
   }

l_ prefix stands for a local variable copy. Exclusive operations (strex and ldrex) are marked with (ex). If you didn’t get my assembly analysis, you should focus on C-like pseudocode.

The thing, which does the job in this program is ldrex and strex instructions. This couple allows atomic load-modify-store operation via the so-called “Exclusive monitor”. ldrex instruction initializes its state machine and tells it to wait for the following strex. If any context switch happens between these two steps, which may corrupt atomic operation, strex will return 1 into lr register. In this case, ldrex + strex operation must be repeated. If this context switch also changed the value of x, the whole calculation done before must be repeated with the current value (goto start).

Memory barrier

I found on the internet some opinions, that dmb instruction is only needed on multi-core processors. I was curious, whether it is needed in our context (TI AM3359 single-core processor) or not. To check this, I have removed all memory barrier instructions from assembly and compiled a new version:

$ cat synchronized-arm.s | grep -v dmb > synchronized-arm-no-barrier.s 
$ arm-linux-gnueabihf-gcc synchronized-arm-no-barrier.s -lpthread -o synchronized-arm-no-barrier
$ scp synchronized-arm-no-barrier root@mydevice:/root
$ ssh root@mydevice
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0

My test shows, that these barriers are redundant however, I’m not sure if corruption, in this case, is not possible or not likely.

Summary

This code is complicated, comparing to the x86 implementation. It may be due to the old gcc version. However I suppose, that ARM instruction set is focused on low power usage and that is the main reason. I made some measurements of execution time. To make it more reliable, I should change for loop to a longer run, but I will leave it as an exercise for the reader 🙂

$ ##### x86 :
$ time ./non-synchronized
708439
real	0m0,011s
user	0m0,017s
sys	0m0,004s
$ time ./non-synchronized
-998811
real	0m0,009s
user	0m0,016s
sys	0m0,000s
$ time ./non-synchronized
986583
real	0m0,011s
user	0m0,015s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,039s
user	0m0,068s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,053s
user	0m0,097s
sys	0m0,008s
$ time ./synchronized
0
real	0m0,053s
user	0m0,103s
sys	0m0,000s

# ##### ARM Cortex-A8
# time ./non-synchronized-arm 
-396494
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./non-synchronized-arm 
0
real	0m 0.08s
user	0m 0.03s
sys	0m 0.01s
# time ./non-synchronized-arm 
-531472
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.33s
sys	0m 0.01s
# time ./synchronized-arm 
0
real	0m 0.54s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.12s
user	0m 0.06s
sys	0m 0.02s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.09s
user	0m 0.07s
sys	0m 0.01s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.10s
user	0m 0.07s
sys	0m 0.01s

We can see, that x86 stdatomic made execution 3-5 times slower. The code generated by ARM compiler slowed down 7-8 times. Surprisingly, the main bottleneck in the generated code was dmb sy instruction. Removing it, gives synchronized code, which executes only 1.5-2 times slower than non-synchronized! The question is if removing memory barriers is 100% bullet-proof.

I hope you are not sleepy after reading this post :). If you have any suggestions regarding this series, or you have some remarks, please add a comment. Next time I will try to compile something similar in typical embedded architecture, which is AVR.