stdatomic.h under the hood #2
In today’s part of the series, I will find out, how the code from #1 (http://olejniczak.ovh/index.php/2020/12/18/stdatomic-h-under-the-hood-1/) is compiled on ARM architecture. I don’t want to expand this text without need, so if you want to examine code, check #1 of this series.
ARM
Below you can find details of tested architecture and compiler:
$ arm-linux-gnueabihf-gcc -v
Using built-in specs.
...
Target: arm-linux-gnueabihf
... --disable-multilib --enable-multiarch --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard ... --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-plugin --enable-gold --enable-c99 --enable-long-long --with-mode=thumb --disable-multilib --with-float=hard
Thread model: posix
gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09)
Let’s find out how the compiler works on this machine:
# ./non-synchronized-arm
-22301
# ./non-synchronized-arm
96532
# ./non-synchronized-arm
225150
# ./non-synchronized-arm
-66366
# ./non-synchronized-arm
-120416
# ./non-synchronized-arm
4340
# ./synchronized-arm
0
# ./synchronized-arm
0
# ./synchronized-arm
0
# ./synchronized-arm
0
# ./synchronized-arm
0
# ./synchronized-arm
0
The difference in using stdatomic is obvious, so let’s check how ARM compiler works to make it happen. The first assembly code is generated from non-synchronized.c file:
Thread:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 1, uses_anonymous_args = 0
@ link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #20
str r0, [fp, #-16]
mov r3, #0
str r3, [fp, #-8]
b .L4
.L7:
ldr r3, [fp, #-16]
cmp r3, #0
beq .L5
movw r3, #:lower16:x
movt r3, #:upper16:x
ldr r3, [r3]
add r2, r3, #1
movw r3, #:lower16:x
movt r3, #:upper16:x
str r2, [r3]
b .L6
.L5:
movw r3, #:lower16:x
movt r3, #:upper16:x
ldr r3, [r3]
sub r2, r3, #1
movw r3, #:lower16:x
movt r3, #:upper16:x
str r2, [r3]
.L6:
ldr r3, [fp, #-8]
add r3, r3, #1
str r3, [fp, #-8]
.L4:
ldr r2, [fp, #-8]
movw r3, #16959
movt r3, 15
cmp r2, r3
ble .L7
mov r3, #0
mov r0, r3
sub sp, fp, #0
@ sp needed
ldr fp, [sp], #4
bx lr
.size Thread, .-Thread
Code is quite similar to x86. The most crucial part is L7,
which implements addition, and L5
, which implements subtraction. Let’s see what they are doing:
movw r3, #:lower16:x /* Load less significant 16 bit of x address */
movt r3, #:upper16:x /* ... and more significant part */
ldr r3, [r3] /* Load value of x to r3 register */
add r2, r3, #1 /* Add 1 to x (sub in L5) */
movw r3, #:lower16:x /* Put address to register (like previously) */
movt r3, #:upper16:x
str r2, [r3] /* Store result */
We can see, that this code is longer than its x86 counterpart. Now let’s see how arm-linux-gnueabihf stdatomic implementation looks like:
Thread:
@ args = 0, pretend = 0, frame = 40
@ frame_needed = 1, uses_anonymous_args = 0
stmfd sp!, {fp, lr}
add fp, sp, #4
sub sp, sp, #40
str r0, [fp, #-40]
mov r3, #0
str r3, [fp, #-8]
b .L4
.L11:
ldr r3, [fp, #-40]
cmp r3, #0
beq .L5
mov r3, #1
str r3, [fp, #-32]
movw r3, #:lower16:x
movt r3, #:upper16:x
dmb sy
ldr r3, [r3]
dmb sy
str r3, [fp, #-28]
.L8:
ldr r2, [fp, #-28]
ldr r3, [fp, #-32]
add r3, r2, r3
str r3, [fp, #-24]
ldr r3, [fp, #-24]
mov ip, r3
movw r3, #:lower16:x
movt r3, #:upper16:x
sub r2, fp, #28
ldr r0, [r2]
dmb sy
.L13:
ldrex r1, [r3]
cmp r1, r0
bne .L14
strex lr, ip, [r3]
cmp lr, #0
bne .L13
.L14:
dmb sy
moveq r3, #1
movne r3, #0
cmp r3, #0
bne .L6
str r1, [r2]
.L6:
cmp r3, #0
bne .L7
b .L8
.L5:
mov r3, #1
str r3, [fp, #-20]
movw r3, #:lower16:x
movt r3, #:upper16:x
dmb sy
ldr r3, [r3]
dmb sy
str r3, [fp, #-16]
.L10:
ldr r2, [fp, #-16]
ldr r3, [fp, #-20]
rsb r3, r3, r2
str r3, [fp, #-12]
ldr r3, [fp, #-12]
mov ip, r3
movw r3, #:lower16:x
movt r3, #:upper16:x
sub r2, fp, #16
ldr r0, [r2]
dmb sy
.L15:
ldrex r1, [r3]
cmp r1, r0
bne .L16
strex lr, ip, [r3]
cmp lr, #0
bne .L15
.L16:
dmb sy
moveq r3, #1
movne r3, #0
cmp r3, #0
bne .L9
str r1, [r2]
.L9:
cmp r3, #0
bne .L7
b .L10
.L7:
ldr r3, [fp, #-8]
add r3, r3, #1
str r3, [fp, #-8]
.L4:
ldr r2, [fp, #-8]
movw r3, #16959
movt r3, 15
cmp r2, r3
ble .L11
mov r3, #0
mov r0, r3
sub sp, fp, #4
@ sp needed
ldmfd sp!, {fp, pc}
.size Thread, .-Thread
This time, synchronized implementation is much more complex, but it’s not a problem for us :). After short initialization, the code jumps to L4
code, where for
loop is implemented. If the break condition is false, the code jumps to L11
. L11
in turn starts with examining arg
parameter against NULL (check code in part #1). If it’s not, L11-L6
code makes the addition otherwise, the jump to L5
is made. If you compare code in L11-L6
and L5-L9
, you can see, that they are almost the same:
mov r3, #1
str r3, [fp, #-32]
movw r3, #:lower16:x
movt r3, #:upper16:x
dmb sy
ldr r3, [r3]
dmb sy
str r3, [fp, #-28]
.L8:
ldr r2, [fp, #-28]
ldr r3, [fp, #-32]
add r3, r2, r3
str r3, [fp, #-24]
ldr r3, [fp, #-24]
mov ip, r3
movw r3, #:lower16:x
movt r3, #:upper16:x
sub r2, fp, #28
ldr r0, [r2]
dmb sy
.L13:
ldrex r1, [r3]
cmp r1, r0
bne .L14
strex lr, ip, [r3]
cmp lr, #0
bne .L13
.L14:
dmb sy
moveq r3, #1
movne r3, #0
cmp r3, #0
bne .L6
str r1, [r2]
.L6:
cmp r3, #0
bne .L7
b .L8
mov r3, #1
str r3, [fp, #-20]
movw r3, #:lower16:x
movt r3, #:upper16:x
dmb sy
ldr r3, [r3]
dmb sy
str r3, [fp, #-16]
.L10:
ldr r2, [fp, #-16]
ldr r3, [fp, #-20]
rsb r3, r3, r2
str r3, [fp, #-12]
ldr r3, [fp, #-12]
mov ip, r3
movw r3, #:lower16:x
movt r3, #:upper16:x
sub r2, fp, #16
ldr r0, [r2]
dmb sy
.L15:
ldrex r1, [r3]
cmp r1, r0
bne .L16
strex lr, ip, [r3]
cmp lr, #0
bne .L15
.L16:
dmb sy
moveq r3, #1
movne r3, #0
cmp r3, #0
bne .L9
str r1, [r2]
.L9:
cmp r3, #0
bne .L7
b .L10
Each code start from saving literal 1 to the local variable on the stack (frame pointer-32
for addition and frame pointer-20
for subtraction). Right after that register r3
is filled with x
variable address. Then we can see something new – instruction dmb sy
which was not used in non-synchronized code. According to the ARM developers guide (https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb)
Data Memory Barrier acts as a memory barrier. It ensures that all explicit memory accesses that appear in program order before the
DMB
instruction are observed before any explicit memory accesses that appear in program order after theDMB
instruction.
The dmb
instruction has the second part, defining which operation should unlock our barrier. In this case, we are locking till the whole memory subsystem finishes its job (sy
). It ensures so-called Sequential Consistency. It is done before ldr
instruction (which as str
, should be considered as asynchronous) loads current x
value, to make sure, that all stores called before were done (in other words, whether we load most actual value).
After finishing ldr
operation (which is synchronized with second dmb sy
), we are storing a loaded value to local copy on the stack (fp-28
in addition branch and fp-16
in subtraction branch). Keep in mind, that only accessing variable shared by both threads is surrounded by dmb
instructions.
L8
and L10
blocks load the value of x
to r2
register, then value 1 to r3
and adds (or subtracts) them, saving the result in r3
. After that result is stored in the new place – fp-24
in addition branch and fp-12
in subtraction branch. The result is also moved to ip
(r12
) register. Right after that, r3
register once again is filled with x
address, r2 is filled with fp-28
(fp-16
). Let’s go back to the previous paragraph – these locations are our local copies of x
value, which are loaded to r0
register. Keep in mind that this value was not changed by our addition (subtraction) operation.
Then the comparison between r1 (value of x) and r0 (local copy before operation) is done. If these values are not equal, it means, that someone changed x
during our L8
(L10
) execution. Let’s analyze this case – we are jumping to L14
(L15
). This block is not intuitive at all – it synchronizes memory operations, loads 1 to r3 if our previous compare was true (not this time), and loads 0 otherwise (this case). Then it compares r3 against 0 (we loaded 0 just before – true) and calls bne
to L6
(L9
), which in our scenario is not done (bne
is “branch if not equal”). ARM is known of conditional instructions, it is nice, but puzzling it in such context is not easy :). We didn’t make a jump, so we are storing r1 (x value) to address under r2 (or local copy). After that cmp + bne + b
set moves our instruction pointer to L8
(L10
) and all described operations are repeated. Not efficient :(.
Now let’s go back to cmp r1,r0
and consider, that these values didn’t change. In this case, we are storing our result with strex
in a shared x
location. strex
stands for store exclusive, so we can expect some kind of synchronization – probably dependent on ldrex
usage. It stores our changed value in a shared x
value address and returns the result to the lr
– a synonym of r14
– register. If the store is successful, it returns 0, and our execution path goes to L14
(L16
). The previous paragraph puzzled this code, in this case, we are finishing in common L7
block, which prepares next for
loop step. If the store failed, we are repeating L13
(L15
) once again.
Could you write it simpler?
Yes :). I prepared some pseudocode, describing this assembly listing. Here you have:
l_1 <- 1
------ Memory barrier
r3 <- x
------ Memory barrier
r3 -> l_x
L10:
r2 <- l_x
r3 <- l_1
r3 <- r2 +/- r3
r3 -> l_res
ip <- l_res
r0 <- l_x
------ Memory barrier
L15:
r1 <- x /* exclusive */
if (r1 != r0) goto L16
ip -> x /* exclusive wit result to lr */
if (lr != 0) goto L15
L16:
------ Memory barrier
if (r1 != r0) goto L10
l_x = x;
l_1 = 1;
start:
l_res = l_x +/- l_1;
storing:
if (l_x != x(ex)) /* exclusive x load */
{
l_x = x;
goto start;
}
else
{
/* exclusive store returning 0 if succeed */
if (x =(ex) l_res)
goto storing;
}
l_
prefix stands for a local variable copy. Exclusive operations (strex
and ldrex
) are marked with (ex)
. If you didn’t get my assembly analysis, you should focus on C-like pseudocode.
The thing, which does the job in this program is ldrex
and strex
instructions. This couple allows atomic load-modify-store operation via the so-called “Exclusive monitor”. ldrex
instruction initializes its state machine and tells it to wait for the following strex
. If any context switch happens between these two steps, which may corrupt atomic operation, strex
will return 1 into lr
register. In this case, ldrex
+ strex
operation must be repeated. If this context switch also changed the value of x
, the whole calculation done before must be repeated with the current value (goto start
).
Memory barrier
I found on the internet some opinions, that dmb
instruction is only needed on multi-core processors. I was curious, whether it is needed in our context (TI AM3359 single-core processor) or not. To check this, I have removed all memory barrier instructions from assembly and compiled a new version:
$ cat synchronized-arm.s | grep -v dmb > synchronized-arm-no-barrier.s
$ arm-linux-gnueabihf-gcc synchronized-arm-no-barrier.s -lpthread -o synchronized-arm-no-barrier
$ scp synchronized-arm-no-barrier root@mydevice:/root
$ ssh root@mydevice
# ./synchronized-arm-no-barrier
0
# ./synchronized-arm-no-barrier
0
# ./synchronized-arm-no-barrier
0
# ./synchronized-arm-no-barrier
0
# ./synchronized-arm-no-barrier
0
My test shows, that these barriers are redundant however, I’m not sure if corruption, in this case, is not possible or not likely.
Summary
This code is complicated, comparing to the x86 implementation. It may be due to the old gcc version. However I suppose, that ARM instruction set is focused on low power usage and that is the main reason. I made some measurements of execution time. To make it more reliable, I should change for
loop to a longer run, but I will leave it as an exercise for the reader 🙂
$ ##### x86 :
$ time ./non-synchronized
708439
real 0m0,011s
user 0m0,017s
sys 0m0,004s
$ time ./non-synchronized
-998811
real 0m0,009s
user 0m0,016s
sys 0m0,000s
$ time ./non-synchronized
986583
real 0m0,011s
user 0m0,015s
sys 0m0,004s
$ time ./synchronized
0
real 0m0,039s
user 0m0,068s
sys 0m0,004s
$ time ./synchronized
0
real 0m0,053s
user 0m0,097s
sys 0m0,008s
$ time ./synchronized
0
real 0m0,053s
user 0m0,103s
sys 0m0,000s
# ##### ARM Cortex-A8
# time ./non-synchronized-arm
-396494
real 0m 0.07s
user 0m 0.04s
sys 0m 0.00s
# time ./non-synchronized-arm
0
real 0m 0.08s
user 0m 0.03s
sys 0m 0.01s
# time ./non-synchronized-arm
-531472
real 0m 0.07s
user 0m 0.04s
sys 0m 0.00s
# time ./synchronized-arm
0
real 0m 0.56s
user 0m 0.34s
sys 0m 0.00s
# time ./synchronized-arm
0
real 0m 0.56s
user 0m 0.33s
sys 0m 0.01s
# time ./synchronized-arm
0
real 0m 0.54s
user 0m 0.34s
sys 0m 0.00s
# time ./synchronized-arm-no-barrier
0
real 0m 0.12s
user 0m 0.06s
sys 0m 0.02s
# time ./synchronized-arm-no-barrier
0
real 0m 0.09s
user 0m 0.07s
sys 0m 0.01s
# time ./synchronized-arm-no-barrier
0
real 0m 0.10s
user 0m 0.07s
sys 0m 0.01s
We can see, that x86 stdatomic made execution 3-5 times slower. The code generated by ARM compiler slowed down 7-8 times. Surprisingly, the main bottleneck in the generated code was dmb sy
instruction. Removing it, gives synchronized code, which executes only 1.5-2 times slower than non-synchronized! The question is if removing memory barriers is 100% bullet-proof.
I hope you are not sleepy after reading this post :). If you have any suggestions regarding this series, or you have some remarks, please add a comment. Next time I will try to compile something similar in typical embedded architecture, which is AVR.