Miesiąc: grudzień 2020

stdatomic.h under the hood #1

My favorite C11 feature is the support of atomic operations. It is useful, especially on embedded systems, where using mutexes is very often overkill. At this point, in many circumstances, we can put away platform-dependent, error-prone, and not-effective synchronization API and easily use variables from different contexts. Of course, not every operation is possible. But if something is not possible, it is probably caused by poor software design. Let’s see how it works on typical architectures.

I will run 2 threads using pthread native library and show how assembly is build to make presented operations atomic. The example is typical, placed in every book about multithread-programming.

#include <stdatomic.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>

int32_t x = 0;

void * Thread(void * arg);

int main(void)
{
	pthread_t t;
	pthread_create(&t, NULL, &Thread, (void*)1);
	Thread((void*)0);
	pthread_join(t, NULL);
	printf("%d\n", x);
	return 0;
}

void * Thread(void * arg)
{
	for (int i = 0; i < 1000000; ++i)
	{
		if (arg)
			x++;
		else
			x--;
	}

	return NULL;
}

I think the code is self-explanatory. The only thing is the dirty passing of boolean value by the pointer. I know it’s not elegant, but it takes less coding. The output expected by synchronization-ignorant is 0. But of course, it is not:

$ gcc non-synchronized.c -lpthread -o non-synchronized
$ ./non-synchronized 
193678
$ ./non-synchronized 
-578251
$ ./non-synchronized 
-509428
$ ./non-synchronized 
-571838
$ ./non-synchronized 
163562

Let’s see how stdatomic handle this situation on my native compiler. I’ve made similar code with the only difference of using stdatomic library and type supported by it:

$ diff non-synchronized.c synchronized.c 
3c3
< #include <stdint.h>
---
> #include <stdatomic.h>
6c6
< int32_t x = 0;
---
> atomic_int_least32_t x = 0;
$ gcc synchronized.c -lpthread -o synchronized
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0
$ ./synchronized 
0

Now our results are surprisingly correct – with no locks, mutexes, etc. Let’s see what kind of magic is done behind it.

x86-64

The test in this part of the article series was performed on the native architecture of my laptop (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0). It produces the following assembly output for our program.

$ gcc synchronized.c -lpthread -S -o synchronized.s
$ cat synchronized.s
	.file	"synchronized.c"
	.text
	.globl	x
	.bss
	.align 4
	.type	x, @object
	.size	x, 4
x:
	.zero	4
	.section	.rodata
...
Thread:
.LFB6:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$48, %rsp
	movq	%rdi, -40(%rbp)
	movq	%fs:40, %rax
	movq	%rax, -8(%rbp)
	xorl	%eax, %eax
	movl	$0, -12(%rbp)
	jmp	.L5
.L8:
	cmpq	$0, -40(%rbp)
	je	.L6
	movl	$1, -28(%rbp)
	movl	-28(%rbp), %eax
	lock xaddl	%eax, x(%rip)
	movl	%eax, -24(%rbp)
	jmp	.L7
.L6:
	movl	$1, -20(%rbp)
	movl	-20(%rbp), %eax
	negl	%eax
	lock xaddl	%eax, x(%rip)
	movl	%eax, -16(%rbp)
.L7:
	addl	$1, -12(%rbp)
.L5:
	cmpl	$999999, -12(%rbp)
	jle	.L8
	movl	$0, %eax
	movq	-8(%rbp), %rdx
	xorq	%fs:40, %rdx
	je	.L10
	call	__stack_chk_fail@PLT
.L10:
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
...

The most interesting is the implementation of Thread function. So I pasted only this one. It has a few parts:

  • .LFB6: – variables initialization, not interesting
  • .L5 and.L7 – this is our for loop. L7is ivariable incrementation.
  • .L8 and .L6 – is ifstatement and its content. .L8 after je instruction implements if (arg) branch. L6 implements else branch.

The most crucial are these two assembly snippets:

	movl	$1, -28(%rbp)         # move 32-bit value "1" to memory[%rbp-28]
	movl	-28(%rbp), %eax       # load this memory to %eax register
	lock xaddl	%eax, x(%rip) # this instruction does the job (fetch-and-add with lock prefix)
	movl	$1, -20(%rbp)         # as above memory[%rbp-20]
	movl	-20(%rbp), %eax       # as above
	negl	%eax                  # make negation of %eax to decrement variable i
	lock xaddl	%eax, x(%rip) # as above

So, the usage of stdatomic library forced the compiler to utilize lock xaddl instruction, which allows atomic incrementation. lock is x86 feature, that blocks cache line for whole instruction. Keep in mind, that this, like any kind of synchronization, has an impact on performance. Let’s see how code is built without stdatomic.h library. I will paste only the essential part:

.L8:
	cmpq	$0, -24(%rbp)
	je	.L6
	movl	x(%rip), %eax      # load
	addl	$1, %eax           # add
	movl	%eax, x(%rip)      # store
	jmp	.L7
.L6:
	movl	x(%rip), %eax      # load
	subl	$1, %eax           # subtract
	movl	%eax, x(%rip)      # store

As you can see. The code similar, but there is no lock xaddl instruction. Incrementation is done in three standard steps (load movl, increment addl, and store movl). Which are not atomic nor synchronized.

In the next chapter, I will focus on stdatomic with ARM architecture.

Formatting hexdump

Today I found out how useful hexdump (with custom formatting) utility is. It took me some time to figure out how it works. Most of the tutorials and man page are obvious when you know how to do it :). I will try to describe how to use it in a really clear way.

Single format

Custom formatting is activated with -e or --format option. Right after that, you must fill the so-called format string. The pattern looks like this:

-e 'N/M "XXXXXXX"'
  • N – iteration count. It says how many subsequent times this pattern (XXX) is used. By default, its value is 1. This option becomes more clear in complicated patterns.
  • M – byte count. It says how many bytes “consumes” this pattern in a single iteration. The default value depends on the next part (XXXXX – format).
  • XXXXXXX – format. This is the string telling how M bytes are displayed in a single iteration. The manual page says that it’s fprintf-based formatting with some exceptions.

In other words, the pattern XXXXX will format every M bytes N times. Take a look at this simple example:

$ hexdump -e '1/4 " %10d\n"' file.bin
         1
         0
        93
   3279800
    131072
  65536000
   6094848

'1/4 " %10d\n"' means, in other words, take one time 4 bytes and format it like a signed integer, with places for ten digits. Put a new line after each iteration. This pattern is used till the end of the file. During your tests, you may see * signs in output. It means that the last pattern occurs many subsequent times. Like in this example:

$ hexdump -e '1/4 "%10d\n"' /dev/zero
         0
*

To avoid this effect, you can turn off squeezing with -v or --no-squeezing option. In this example you may not understand, what is the meaning of iteration count. It will be clear after I show you the next one.

Multiple formats

More advanced formatting allows putting several patterns. It’s nice when you are working on custom binary file formats. To use this just put the several format strings one after another, like in this example:

$ hexdump -e '2/4 " %10u" 2/2 " %04X" 1/0 "\n"' 20201217_095026.bin.dat
         1         0 8000 0000
         2         0 8000 0002
         3         0 8000 0002
         4         0 8000 0002
         5         0 0039 0002
         6     25000 0039 0003
         7    282000 0039 0001
         8    320000 0039 0000
         9    377000 003A 0000
        10   4619000 003A 0004
        11  10463000 003B 0004
        12  12380000 003B 0000
        13  20092000 003B 0004
        14  20503000 003C 0004

This example is more complicated. Let me first tell you about the file format used. It is called COMTRADE, which is used in the power supply industry. Each row of such binary file contains:

| 32-bit counter | 32-bit relative timestamp in ns | reapeated 16-bit values |. The number of values per row is saved in a configuration file. In this example, there are two 16-bit values in each row.

To format this thing in a nice, readable way we tell to hexdump:

  1. 2/4 " %10u" – take 4 bytes and format them with %10u directive (as in the previous example, but with unsigned decimal). Do it twice! So we can format counter and relative timestamp in one directive.
  2. 2/2 " %04X" – then, take 2 bytes and format them as hexadecimal numbers, make room for 4 digits and write down leading zeroes (%04X). As previously, do it two times, because we have two 16-bit values for each row.
  3. 1/0 "\n" – this one is new for you. You may notice, that we had no newline character in previous format strings. It would break our nice row-oriented layout. This directive should make clear for you what’s the difference between iteration count and characters to be consumed by hexdump format. It means – after you format point 1 and 2, put 1 new-line character 1-time, which consumes 0 characters from the input.
Special conversion strings

Last but not least. There is an option to show some extra information. For example current file offset. Let’s see how to do it:

$ hexdump -e '1/0 "[%04_ax]" 2/4 " %10u" 2/2 " %04X" 1/0 "\n"' 20201217_095026.bin.dat
[0000]          1          0 8000 0000
[000c]          2          0 8000 0002
[0018]          3          0 8000 0002
[0024]          4          0 8000 0002
[0030]          5          0 0039 0002
[003c]          6      25000 0039 0003
[0048]          7     282000 0039 0001
[0054]          8     320000 0039 0000
[0060]          9     377000 003A 0000
[006c]         10    4619000 003A 0004
[0078]         11   10463000 003B 0004
[0084]         12   12380000 003B 0000
[0090]         13   20092000 003B 0004
[009c]         14   20503000 003C 0004

The difference between this and the previous example is 1/0 "[%04_ax]". It contains _a input offset conversion string. x means hexadecimal formatting, with small letters. You can change it to decimal or octal form. Like newline character, it is put one time and consumes no characters (0).

There are few more types of conversion strings, like converting bytes to readable strings. For example, let’s say, we want to display 16-bit values as pairs of printable characters. It actually has no sense, but let’s say it has :). To do this, we will use _p conversion string, which stands for a character from the default set, with non-printing characters displyed as a dot:

$ hexdump -e '1/0 "[%04_ax]" 2/4 " %10u" 4/1 " %_p" 1/0 "\n"' 20201217_095026.bin.dat
[0000]          1          0 . . . .
[000c]          2          0 . . . .
[0018]          3          0 . . . .
[0024]          4          0 . . . .
[0030]          5          0 9 . . .
[003c]          6      25000 9 . . .
[0048]          7     282000 9 . . .
[0054]          8     320000 9 . . .
[0060]          9     377000 : . . .
[006c]         10    4619000 : . . .
[0078]         11   10463000 ; . . .
[0084]         12   12380000 ; . . .
[0090]         13   20092000 ; . . .
[009c]         14   20503000 < . . .

This example differs from the previous one in one thing. 2/2 " %04X" changed to 4/1 " %_p". This conversion string says take one byte, four times, and each byte show as a character from the default set. Non-printable characters are displayed as dots. The output is based on the same file, so you can compare how these numbers are transformed into ASCII characters.

Hexdump also has an option to color output strings. I don’t think it’s really useful. So I will not write about it. To use it just add postfix _L with color specifiers between square brackets.

Summary

I felt disappointed at some points when I worked on this post. I haven’t found an option to parse the same bytes twice – like in canonical form (there is hex output appended with ASCII strings on each line). Maybe there is an option, but I failed in searching for it.

Another thing is, that hexdump has no option to change the endianness of input data. If you parse 2, 4, or 8-byte values, you must base them on platform endianness.

The next lacked option is the support of dynamic file formats. For example parsing M bytes, with M taken from the file content. I know we have XXI century, and we have python to do such things. But hexdump is a utility, which can be found on every Linux distribution, especially on embedded systems. It would be a nice option to avoid python runtime installation.

Nevertheless, hexdump is a nice tool to work on static, binary formats. I hope this article helped you understand how to use it with custom formatting.

Favourite strace options

I will try to avoid writing next article about Linux developer/admin swiss army knife, which is obviously strace. Instead, I would like to focus on most useful options:

Follow forks -f

If you try to debug your multithreading application, don’t forget about -f option. After that, strace will track all new threads and child processes.

This is really useful for checking what’s going on, after any sort of deamon, like sshd or getty, accept new client.

For example, I want to catch any file operations made by new, logged on user on SSH:

# pgrep sshd # (1) returns sshd PID -> 256
256
# strace -p 256 -e trace=file -f
 ... a lot of spam (checking user authentity in files like /etc/passwd...
[pid  9883] open("/root/.ash_history", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0600) = 3
[pid  9883] stat64("/bin/cat", {st_mode=S_IFREG|S_ISUID|0755, st_size=655180, ...}) = 0
Process 9894 attached
[pid  9894] execve("/bin/cat", ["cat", "/etc/network/interfaces"], [/* 20 vars */]) = 0
...ld and dynamic linking stuff...
[pid  9894] open("/etc/network/interfaces", O_RDONLY|O_LARGEFILE) = 3
[pid  9894] +++ exited with 0 +++
[pid  9883] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9894, si_uid=0, si_status=0, si_utime=1, si_stime=3} ---
# # (4)
  1. If you don’t have pgrep, just use top or htop. pgrep might return several PIDs, which are main sshd process and it’s children (existing sessions). Probably it will be first one.
  2. -e trace=fileis my next favourite option. It tracks all syscalls related to file manipulation and reading.
  3. This guy executed cat /etc/network/interfaces
Filtering out syscalls

strace have great filtering options. Embedded systems quite often has some limits on it, but it’s still quite useful

  • Track file related syscalls: -e trace=file
  • Filtering precise syscall: -e trace=mmap2.
  • Track network stuff, but not sendto/recvfrom: -e trace=network -e trace=\!sendto -e trace=\!recvfrom – some shells need backslash before exclamation mark
  • Last resort, if those filters not fit, you can use grep. This example show syscalls which fails on errno EAGAIN: strace -p XXX 2>&1 | grep EAGAIN – remember, that strace output is taken into stderr. To pipe it together with stdout you need add 2>&1. grep is much more flexible, however there is big efficiency impact without using strace builtin filters. Each process usually make huge amount of syscalls, and all of them are piped.
Performance analysis

Really nice feature is measuring time spent in each syscall. You can make it with -T option, and check how many seconds it lasts. Remember that if your process is preempted on a syscall, the time of other jobs is accumulated into measurement. That’s beacuse time is measured between entering syscall and returning it’s exit code.

Usually useful is adding -tt option which prefixes all records with microsecond timestamp. It lets you correlate syscalls with other logs.

Worth noting here is possibility to track several processes. You can use -p option multiple times and give several PIDs. I’ve used this option recently to track how threads connected through IPC are preempting each other.

Don’t forget…

…that ever debug tool have impact on system you follow. In this case every syscall generates huge console output, which slows down execution.

Strace implementation are different. Especially on embedded systems not all options are available

Obviously you must have root privileges to use strace.