stdatomic.h under the hood #1

gru 18, 2020 #c, #c11, #lock, #lock-free, #mutex, #non-blocking, #programming, #stdatomic, #synchronization, #xaddl autorstwa Rafał

My favorite C11 feature is the support of atomic operations. It is useful, especially on embedded systems, where using mutexes is very often overkill. At this point, in many circumstances, we can put away platform-dependent, error-prone, and not-effective synchronization API and easily use variables from different contexts. Of course, not every operation is possible. But if something is not possible, it is probably caused by poor software design. Let’s see how it works on typical architectures.

I will run 2 threads using pthread native library and show how assembly is build to make presented operations atomic. The example is typical, placed in every book about multithread-programming.

#include <stdatomic.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>

int32_t x = 0;

void * Thread(void * arg);

int main(void)
{
	pthread_t t;
	pthread_create(&t, NULL, &Thread, (void*)1);
	Thread((void*)0);
	pthread_join(t, NULL);
	printf("%d\n", x);
	return 0;
}

void * Thread(void * arg)
{
	for (int i = 0; i < 1000000; ++i)
	{
		if (arg)
			x++;
		else
			x--;
	}

	return NULL;
}

I think the code is self-explanatory. The only thing is the dirty passing of boolean value by the pointer. I know it’s not elegant, but it takes less coding. The output expected by synchronization-ignorant is 0. But of course, it is not:

$ gcc non-synchronized.c -lpthread -o non-synchronized
$ ./non-synchronized 
193678
$ ./non-synchronized 
-578251
$ ./non-synchronized 
-509428
$ ./non-synchronized 
-571838
$ ./non-synchronized 
163562

Let’s see how stdatomic handle this situation on my native compiler. I’ve made similar code with the only difference of using stdatomic library and type supported by it:

$ diff non-synchronized.c synchronized.c 
3c3
< #include <stdint.h>
---
> #include <stdatomic.h>
6c6
< int32_t x = 0;
---
> atomic_int_least32_t x = 0;
$ gcc synchronized.c -lpthread -o synchronized
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0

Now our results are surprisingly correct – with no locks, mutexes, etc. Let’s see what kind of magic is done behind it.

x86-64

The test in this part of the article series was performed on the native architecture of my laptop (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0). It produces the following assembly output for our program.

$ gcc synchronized.c -lpthread -S -o synchronized.s
$ cat synchronized.s
	.file	"synchronized.c"
	.text
	.globl	x
	.bss
	.align 4
	.type	x, @object
	.size	x, 4
x:
	.zero	4
	.section	.rodata
...
Thread:
.LFB6:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$48, %rsp
	movq	%rdi, -40(%rbp)
	movq	%fs:40, %rax
	movq	%rax, -8(%rbp)
	xorl	%eax, %eax
	movl	$0, -12(%rbp)
	jmp	.L5
.L8:
	cmpq	$0, -40(%rbp)
	je	.L6
	movl	$1, -28(%rbp)
	movl	-28(%rbp), %eax
	lock xaddl	%eax, x(%rip)
	movl	%eax, -24(%rbp)
	jmp	.L7
.L6:
	movl	$1, -20(%rbp)
	movl	-20(%rbp), %eax
	negl	%eax
	lock xaddl	%eax, x(%rip)
	movl	%eax, -16(%rbp)
.L7:
	addl	$1, -12(%rbp)
.L5:
	cmpl	$999999, -12(%rbp)
	jle	.L8
	movl	$0, %eax
	movq	-8(%rbp), %rdx
	xorq	%fs:40, %rdx
	je	.L10
	call	__stack_chk_fail@PLT
.L10:
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
...

The most interesting is the implementation of Thread function. So I pasted only this one. It has a few parts:

.LFB6: – variables initialization, not interesting
.L5 and.L7 – this is our for loop. L7is ivariable incrementation.
.L8 and .L6 – is ifstatement and its content. .L8 after je instruction implements if (arg) branch. L6 implements else branch.

The most crucial are these two assembly snippets:

	movl	$1, -28(%rbp)         # move 32-bit value "1" to memory[%rbp-28]
	movl	-28(%rbp), %eax       # load this memory to %eax register
	lock xaddl	%eax, x(%rip) # this instruction does the job (fetch-and-add with lock prefix)

	movl	$1, -20(%rbp)         # as above memory[%rbp-20]
	movl	-20(%rbp), %eax       # as above
	negl	%eax                  # make negation of %eax to decrement variable i
	lock xaddl	%eax, x(%rip) # as above

So, the usage of stdatomic library forced the compiler to utilize lock xaddl instruction, which allows atomic incrementation. lock is x86 feature, that blocks cache line for whole instruction. Keep in mind, that this, like any kind of synchronization, has an impact on performance. Let’s see how code is built without stdatomic.h library. I will paste only the essential part:

.L8:
	cmpq	$0, -24(%rbp)
	je	.L6
	movl	x(%rip), %eax      # load
	addl	$1, %eax           # add
	movl	%eax, x(%rip)      # store
	jmp	.L7
.L6:
	movl	x(%rip), %eax      # load
	subl	$1, %eax           # subtract
	movl	%eax, x(%rip)      # store

As you can see. The code similar, but there is no lock xaddl instruction. Incrementation is done in three standard steps (load movl, increment addl, and store movl). Which are not atomic nor synchronized.

In the next chapter, I will focus on stdatomic with ARM architecture.

Tag: #synchronization

stdatomic.h under the hood #1

x86-64