C, stdatomic

stdatomic.h under the hood #2

sty 13, 2021 #arm, #barrier, #c, #cortex-a8, #data, #dmb, #lock-free, #memory, #programming, #stdatomic autorstwa Rafał

In today’s part of the series, I will find out, how the code from #1 (http://olejniczak.ovh/index.php/2020/12/18/stdatomic-h-under-the-hood-1/) is compiled on ARM architecture. I don’t want to expand this text without need, so if you want to examine code, check #1 of this series.

ARM

Below you can find details of tested architecture and compiler:

$ arm-linux-gnueabihf-gcc -v
Using built-in specs.
...
Target: arm-linux-gnueabihf
... --disable-multilib --enable-multiarch --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard ... --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-plugin --enable-gold --enable-c99 --enable-long-long --with-mode=thumb --disable-multilib --with-float=hard
Thread model: posix
gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09)

Let’s find out how the compiler works on this machine:

# ./non-synchronized-arm 
-22301
# ./non-synchronized-arm 
96532
# ./non-synchronized-arm 
225150
# ./non-synchronized-arm 
-66366
# ./non-synchronized-arm 
-120416
# ./non-synchronized-arm 
4340
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0

The difference in using stdatomic is obvious, so let’s check how ARM compiler works to make it happen. The first assembly code is generated from non-synchronized.c file:

Thread:
	@ args = 0, pretend = 0, frame = 16
	@ frame_needed = 1, uses_anonymous_args = 0
	@ link register save eliminated.
	str	fp, [sp, #-4]!
	add	fp, sp, #0
	sub	sp, sp, #20
	str	r0, [fp, #-16]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L7:
	ldr	r3, [fp, #-16]
	cmp	r3, #0
	beq	.L5
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	add	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
	b	.L6
.L5:
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	sub	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
.L6:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L7
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #0
	@ sp needed
	ldr	fp, [sp], #4
	bx	lr
	.size	Thread, .-Thread

Code is quite similar to x86. The most crucial part is L7, which implements addition, and L5, which implements subtraction. Let’s see what they are doing:

	movw	r3, #:lower16:x /* Load less significant 16 bit of x address */
	movt	r3, #:upper16:x /* ... and more significant part */
	ldr	r3, [r3]        /* Load value of x to r3 register */
	add	r2, r3, #1      /* Add 1 to x (sub in L5) */
	movw	r3, #:lower16:x /* Put address to register (like previously)  */
	movt	r3, #:upper16:x
	str	r2, [r3]        /* Store result */

We can see, that this code is longer than its x86 counterpart. Now let’s see how arm-linux-gnueabihf stdatomic implementation looks like:

Thread:
	@ args = 0, pretend = 0, frame = 40
	@ frame_needed = 1, uses_anonymous_args = 0
	stmfd	sp!, {fp, lr}
	add	fp, sp, #4
	sub	sp, sp, #40
	str	r0, [fp, #-40]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L11:
	ldr	r3, [fp, #-40]
	cmp	r3, #0
	beq	.L5
	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8
.L5:
	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10
.L7:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L11
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #4
	@ sp needed
	ldmfd	sp!, {fp, pc}
	.size	Thread, .-Thread

This time, synchronized implementation is much more complex, but it’s not a problem for us :). After short initialization, the code jumps to L4 code, where for loop is implemented. If the break condition is false, the code jumps to L11. L11 in turn starts with examining arg parameter against NULL (check code in part #1). If it’s not, L11-L6 code makes the addition otherwise, the jump to L5 is made. If you compare code in L11-L6 and L5-L9, you can see, that they are almost the same:

	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8

	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10

Each code start from saving literal 1 to the local variable on the stack (frame pointer-32 for addition and frame pointer-20 for subtraction). Right after that register r3 is filled with x variable address. Then we can see something new – instruction dmb sy which was not used in non-synchronized code. According to the ARM developers guide (https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb)

Data Memory Barrier acts as a memory barrier. It ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program order after the DMB instruction.

The dmb instruction has the second part, defining which operation should unlock our barrier. In this case, we are locking till the whole memory subsystem finishes its job (sy). It ensures so-called Sequential Consistency. It is done before ldr instruction (which as str, should be considered as asynchronous) loads current x value, to make sure, that all stores called before were done (in other words, whether we load most actual value).

After finishing ldr operation (which is synchronized with second dmb sy), we are storing a loaded value to local copy on the stack (fp-28 in addition branch and fp-16in subtraction branch). Keep in mind, that only accessing variable shared by both threads is surrounded by dmb instructions.

L8 and L10 blocks load the value of x to r2 register, then value 1 to r3 and adds (or subtracts) them, saving the result in r3. After that result is stored in the new place – fp-24 in addition branch and fp-12 in subtraction branch. The result is also moved to ip (r12) register. Right after that, r3 register once again is filled with x address, r2 is filled with fp-28 (fp-16). Let’s go back to the previous paragraph – these locations are our local copies of x value, which are loaded to r0 register. Keep in mind that this value was not changed by our addition (subtraction) operation.

Then the comparison between r1 (value of x) and r0 (local copy before operation) is done. If these values are not equal, it means, that someone changed x during our L8(L10) execution. Let’s analyze this case – we are jumping to L14 (L15). This block is not intuitive at all – it synchronizes memory operations, loads 1 to r3 if our previous compare was true (not this time), and loads 0 otherwise (this case). Then it compares r3 against 0 (we loaded 0 just before – true) and calls bne to L6 (L9), which in our scenario is not done (bne is “branch if not equal”). ARM is known of conditional instructions, it is nice, but puzzling it in such context is not easy :). We didn’t make a jump, so we are storing r1 (x value) to address under r2 (or local copy). After that cmp + bne + b set moves our instruction pointer to L8(L10) and all described operations are repeated. Not efficient :(.

Now let’s go back to cmp r1,r0 and consider, that these values didn’t change. In this case, we are storing our result with strex in a shared x location. strex stands for store exclusive, so we can expect some kind of synchronization – probably dependent on ldrex usage. It stores our changed value in a shared x value address and returns the result to the lr – a synonym of r14 – register. If the store is successful, it returns 0, and our execution path goes to L14(L16). The previous paragraph puzzled this code, in this case, we are finishing in common L7 block, which prepares next for loop step. If the store failed, we are repeating L13 (L15) once again.

Could you write it simpler?

Yes :). I prepared some pseudocode, describing this assembly listing. Here you have:

   l_1  <- 1
   ------ Memory barrier
   r3   <- x
   ------ Memory barrier
   r3   -> l_x
L10:
   r2   <- l_x
   r3   <- l_1
   r3   <- r2 +/- r3
   r3   -> l_res
   ip   <- l_res
   r0   <- l_x
   ------ Memory barrier
L15:
   r1   <- x /* exclusive */
   if (r1 != r0) goto L16
   ip   -> x /* exclusive wit result to lr */
   if (lr != 0) goto L15
L16:
   ------ Memory barrier
   if (r1 != r0) goto L10

   l_x = x;
   l_1 = 1;



start:
   l_res = l_x +/- l_1;




storing:
   if (l_x != x(ex)) /* exclusive x load */
   {
      l_x = x;
      goto start;
   }
   else
   {
   /* exclusive store returning 0 if succeed */
      if (x =(ex) l_res)
         goto storing;
   }

l_ prefix stands for a local variable copy. Exclusive operations (strex and ldrex) are marked with (ex). If you didn’t get my assembly analysis, you should focus on C-like pseudocode.

The thing, which does the job in this program is ldrex and strex instructions. This couple allows atomic load-modify-store operation via the so-called “Exclusive monitor”. ldrex instruction initializes its state machine and tells it to wait for the following strex. If any context switch happens between these two steps, which may corrupt atomic operation, strex will return 1 into lr register. In this case, ldrex + strex operation must be repeated. If this context switch also changed the value of x, the whole calculation done before must be repeated with the current value (goto start).

Memory barrier

I found on the internet some opinions, that dmb instruction is only needed on multi-core processors. I was curious, whether it is needed in our context (TI AM3359 single-core processor) or not. To check this, I have removed all memory barrier instructions from assembly and compiled a new version:

$ cat synchronized-arm.s | grep -v dmb > synchronized-arm-no-barrier.s 
$ arm-linux-gnueabihf-gcc synchronized-arm-no-barrier.s -lpthread -o synchronized-arm-no-barrier
$ scp synchronized-arm-no-barrier root@mydevice:/root
$ ssh root@mydevice
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0

My test shows, that these barriers are redundant however, I’m not sure if corruption, in this case, is not possible or not likely.

Summary

This code is complicated, comparing to the x86 implementation. It may be due to the old gcc version. However I suppose, that ARM instruction set is focused on low power usage and that is the main reason. I made some measurements of execution time. To make it more reliable, I should change for loop to a longer run, but I will leave it as an exercise for the reader 🙂

$ ##### x86 :
$ time ./non-synchronized
708439
real	0m0,011s
user	0m0,017s
sys	0m0,004s
$ time ./non-synchronized
-998811
real	0m0,009s
user	0m0,016s
sys	0m0,000s
$ time ./non-synchronized
986583
real	0m0,011s
user	0m0,015s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,039s
user	0m0,068s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,053s
user	0m0,097s
sys	0m0,008s
$ time ./synchronized
0
real	0m0,053s
user	0m0,103s
sys	0m0,000s

# ##### ARM Cortex-A8
# time ./non-synchronized-arm 
-396494
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./non-synchronized-arm 
0
real	0m 0.08s
user	0m 0.03s
sys	0m 0.01s
# time ./non-synchronized-arm 
-531472
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.33s
sys	0m 0.01s
# time ./synchronized-arm 
0
real	0m 0.54s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.12s
user	0m 0.06s
sys	0m 0.02s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.09s
user	0m 0.07s
sys	0m 0.01s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.10s
user	0m 0.07s
sys	0m 0.01s

We can see, that x86 stdatomic made execution 3-5 times slower. The code generated by ARM compiler slowed down 7-8 times. Surprisingly, the main bottleneck in the generated code was dmb sy instruction. Removing it, gives synchronized code, which executes only 1.5-2 times slower than non-synchronized! The question is if removing memory barriers is 100% bullet-proof.

I hope you are not sleepy after reading this post :). If you have any suggestions regarding this series, or you have some remarks, please add a comment. Next time I will try to compile something similar in typical embedded architecture, which is AVR.

C, stdatomic

stdatomic.h under the hood #1

gru 18, 2020 #c, #c11, #lock, #lock-free, #mutex, #non-blocking, #programming, #stdatomic, #synchronization, #xaddl autorstwa Rafał

My favorite C11 feature is the support of atomic operations. It is useful, especially on embedded systems, where using mutexes is very often overkill. At this point, in many circumstances, we can put away platform-dependent, error-prone, and not-effective synchronization API and easily use variables from different contexts. Of course, not every operation is possible. But if something is not possible, it is probably caused by poor software design. Let’s see how it works on typical architectures.

I will run 2 threads using pthread native library and show how assembly is build to make presented operations atomic. The example is typical, placed in every book about multithread-programming.

#include <stdatomic.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>

int32_t x = 0;

void * Thread(void * arg);

int main(void)
{
	pthread_t t;
	pthread_create(&t, NULL, &Thread, (void*)1);
	Thread((void*)0);
	pthread_join(t, NULL);
	printf("%d\n", x);
	return 0;
}

void * Thread(void * arg)
{
	for (int i = 0; i < 1000000; ++i)
	{
		if (arg)
			x++;
		else
			x--;
	}

	return NULL;
}

I think the code is self-explanatory. The only thing is the dirty passing of boolean value by the pointer. I know it’s not elegant, but it takes less coding. The output expected by synchronization-ignorant is 0. But of course, it is not:

$ gcc non-synchronized.c -lpthread -o non-synchronized
$ ./non-synchronized 
193678
$ ./non-synchronized 
-578251
$ ./non-synchronized 
-509428
$ ./non-synchronized 
-571838
$ ./non-synchronized 
163562

Let’s see how stdatomic handle this situation on my native compiler. I’ve made similar code with the only difference of using stdatomic library and type supported by it:

$ diff non-synchronized.c synchronized.c 
3c3
< #include <stdint.h>
---
> #include <stdatomic.h>
6c6
< int32_t x = 0;
---
> atomic_int_least32_t x = 0;
$ gcc synchronized.c -lpthread -o synchronized
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0

Now our results are surprisingly correct – with no locks, mutexes, etc. Let’s see what kind of magic is done behind it.

x86-64

The test in this part of the article series was performed on the native architecture of my laptop (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0). It produces the following assembly output for our program.

$ gcc synchronized.c -lpthread -S -o synchronized.s
$ cat synchronized.s
	.file	"synchronized.c"
	.text
	.globl	x
	.bss
	.align 4
	.type	x, @object
	.size	x, 4
x:
	.zero	4
	.section	.rodata
...
Thread:
.LFB6:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$48, %rsp
	movq	%rdi, -40(%rbp)
	movq	%fs:40, %rax
	movq	%rax, -8(%rbp)
	xorl	%eax, %eax
	movl	$0, -12(%rbp)
	jmp	.L5
.L8:
	cmpq	$0, -40(%rbp)
	je	.L6
	movl	$1, -28(%rbp)
	movl	-28(%rbp), %eax
	lock xaddl	%eax, x(%rip)
	movl	%eax, -24(%rbp)
	jmp	.L7
.L6:
	movl	$1, -20(%rbp)
	movl	-20(%rbp), %eax
	negl	%eax
	lock xaddl	%eax, x(%rip)
	movl	%eax, -16(%rbp)
.L7:
	addl	$1, -12(%rbp)
.L5:
	cmpl	$999999, -12(%rbp)
	jle	.L8
	movl	$0, %eax
	movq	-8(%rbp), %rdx
	xorq	%fs:40, %rdx
	je	.L10
	call	__stack_chk_fail@PLT
.L10:
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
...

The most interesting is the implementation of Thread function. So I pasted only this one. It has a few parts:

.LFB6: – variables initialization, not interesting
.L5 and.L7 – this is our for loop. L7is ivariable incrementation.
.L8 and .L6 – is ifstatement and its content. .L8 after je instruction implements if (arg) branch. L6 implements else branch.

The most crucial are these two assembly snippets:

	movl	$1, -28(%rbp)         # move 32-bit value "1" to memory[%rbp-28]
	movl	-28(%rbp), %eax       # load this memory to %eax register
	lock xaddl	%eax, x(%rip) # this instruction does the job (fetch-and-add with lock prefix)

	movl	$1, -20(%rbp)         # as above memory[%rbp-20]
	movl	-20(%rbp), %eax       # as above
	negl	%eax                  # make negation of %eax to decrement variable i
	lock xaddl	%eax, x(%rip) # as above

So, the usage of stdatomic library forced the compiler to utilize lock xaddl instruction, which allows atomic incrementation. lock is x86 feature, that blocks cache line for whole instruction. Keep in mind, that this, like any kind of synchronization, has an impact on performance. Let’s see how code is built without stdatomic.h library. I will paste only the essential part:

.L8:
	cmpq	$0, -24(%rbp)
	je	.L6
	movl	x(%rip), %eax      # load
	addl	$1, %eax           # add
	movl	%eax, x(%rip)      # store
	jmp	.L7
.L6:
	movl	x(%rip), %eax      # load
	subl	$1, %eax           # subtract
	movl	%eax, x(%rip)      # store

As you can see. The code similar, but there is no lock xaddl instruction. Incrementation is done in three standard steps (load movl, increment addl, and store movl). Which are not atomic nor synchronized.

In the next chapter, I will focus on stdatomic with ARM architecture.

Kategoria: stdatomic