stdatomic.h under the hood #1
My favorite C11 feature is the support of atomic operations. It is useful, especially on embedded systems, where using mutexes is very often overkill. At this point, in many circumstances, we can put away platform-dependent, error-prone, and not-effective synchronization API and easily use variables from different contexts. Of course, not every operation is possible. But if something is not possible, it is probably caused by poor software design. Let’s see how it works on typical architectures.
I will run 2 threads using pthread native library and show how assembly is build to make presented operations atomic. The example is typical, placed in every book about multithread-programming.
#include <stdatomic.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
int32_t x = 0;
void * Thread(void * arg);
int main(void)
{
pthread_t t;
pthread_create(&t, NULL, &Thread, (void*)1);
Thread((void*)0);
pthread_join(t, NULL);
printf("%d\n", x);
return 0;
}
void * Thread(void * arg)
{
for (int i = 0; i < 1000000; ++i)
{
if (arg)
x++;
else
x--;
}
return NULL;
}
I think the code is self-explanatory. The only thing is the dirty passing of boolean value by the pointer. I know it’s not elegant, but it takes less coding. The output expected by synchronization-ignorant is 0. But of course, it is not:
$ gcc non-synchronized.c -lpthread -o non-synchronized
$ ./non-synchronized
193678
$ ./non-synchronized
-578251
$ ./non-synchronized
-509428
$ ./non-synchronized
-571838
$ ./non-synchronized
163562
Let’s see how stdatomic handle this situation on my native compiler. I’ve made similar code with the only difference of using stdatomic library and type supported by it:
$ diff non-synchronized.c synchronized.c
3c3
< #include <stdint.h>
---
> #include <stdatomic.h>
6c6
< int32_t x = 0;
---
> atomic_int_least32_t x = 0;
$ gcc synchronized.c -lpthread -o synchronized
$ ./synchronized
0
$ ./synchronized
0
$ ./synchronized
0
$ ./synchronized
0
$ ./synchronized
0
Now our results are surprisingly correct – with no locks, mutexes, etc. Let’s see what kind of magic is done behind it.
x86-64
The test in this part of the article series was performed on the native architecture of my laptop (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0). It produces the following assembly output for our program.
$ gcc synchronized.c -lpthread -S -o synchronized.s
$ cat synchronized.s
.file "synchronized.c"
.text
.globl x
.bss
.align 4
.type x, @object
.size x, 4
x:
.zero 4
.section .rodata
...
Thread:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $48, %rsp
movq %rdi, -40(%rbp)
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $0, -12(%rbp)
jmp .L5
.L8:
cmpq $0, -40(%rbp)
je .L6
movl $1, -28(%rbp)
movl -28(%rbp), %eax
lock xaddl %eax, x(%rip)
movl %eax, -24(%rbp)
jmp .L7
.L6:
movl $1, -20(%rbp)
movl -20(%rbp), %eax
negl %eax
lock xaddl %eax, x(%rip)
movl %eax, -16(%rbp)
.L7:
addl $1, -12(%rbp)
.L5:
cmpl $999999, -12(%rbp)
jle .L8
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L10
call __stack_chk_fail@PLT
.L10:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
...
The most interesting is the implementation of Thread
function. So I pasted only this one. It has a few parts:
.LFB6:
– variables initialization, not interesting.L5
and.L7
– this is ourfor
loop.L7
isi
variable incrementation..L8
and.L6
– isif
statement and its content..L8
afterje
instruction implementsif (arg)
branch.L6
implementselse
branch.
The most crucial are these two assembly snippets:
movl $1, -28(%rbp) # move 32-bit value "1" to memory[%rbp-28]
movl -28(%rbp), %eax # load this memory to %eax register
lock xaddl %eax, x(%rip) # this instruction does the job (fetch-and-add with lock prefix)
movl $1, -20(%rbp) # as above memory[%rbp-20]
movl -20(%rbp), %eax # as above
negl %eax # make negation of %eax to decrement variable i
lock xaddl %eax, x(%rip) # as above
So, the usage of stdatomic library forced the compiler to utilize lock xaddl
instruction, which allows atomic incrementation. lock
is x86 feature, that blocks cache line for whole instruction. Keep in mind, that this, like any kind of synchronization, has an impact on performance. Let’s see how code is built without stdatomic.h library. I will paste only the essential part:
.L8:
cmpq $0, -24(%rbp)
je .L6
movl x(%rip), %eax # load
addl $1, %eax # add
movl %eax, x(%rip) # store
jmp .L7
.L6:
movl x(%rip), %eax # load
subl $1, %eax # subtract
movl %eax, x(%rip) # store
As you can see. The code similar, but there is no lock xaddl
instruction. Incrementation is done in three standard steps (load movl
, increment addl
, and store movl
). Which are not atomic nor synchronized.
In the next chapter, I will focus on stdatomic with ARM architecture.