Following the kernel, Linux

Following the kernel #4 – MLO part 1

lis 8, 2022 #arm, #armv7, #assembly, #init, #linux, #lowlevel, #sitara, #uboot, AM335x, MLO, SPL, TI, u-boot autorstwa Rafał

We have finished last period on jumping into reset vector of the MLO code. If you haven’t done it yet, I encourage you to clone U-Boot repository from GitHub https://github.com/u-boot/u-boot. I’m working on the newest master commit a6ba59583abd4085db5ab00358d751f175e2a451. As I wrote, U-Boot is configured via KBuild with default am335x set.

Reset vectors of MLO are defined in common for U-Boot and MLO (TPL and SPL) place – arch/arm/lib/vectors.S file. If you were careful enough, your debugger should stop on the line 87:

/*
 *************************************************************************
 *
 * Exception vectors as described in ARM reference manuals
 *
 * Uses indirect branch to allow reaching handlers anywhere in memory.
 *
 *************************************************************************
 */

_start:
#ifdef CONFIG_SYS_DV_NOR_BOOT_CFG
	.word	CONFIG_SYS_DV_NOR_BOOT_CFG
#endif
	ARM_VECTORS // <--- right here
#endif /* !defined(CONFIG_ENABLE_ARM_SOC_BOOT0_HOOK) */

ARM_VECTORS is the macro defined few lines above (21-34).

/*
 * A macro to allow insertion of an ARM exception vector either
 * for the non-boot0 case or by a boot0-header.
 */
        .macro ARM_VECTORS
#ifdef CONFIG_ARCH_K3
	ldr     pc, _reset
#else
	b	reset
#endif
	ldr	pc, _undefined_instruction
	ldr	pc, _software_interrupt
	ldr	pc, _prefetch_abort
	ldr	pc, _data_abort
	ldr	pc, _not_used
	ldr	pc, _irq
	ldr	pc, _fiq
	.endm

Our architecture is OMAP2+ (CONFIG_ARCH_OMAP2PLUS – check it with make menuconfig for practice), so our code includes b reset part, and this is the instruction placed under our first breakpoint. We are jumping into the reset routine. This one is defined in arch/arm/cpu/armv7/start.S file. Reset routine starts in line 38.

reset:
	/* Allow the board to save important registers */
	b	save_boot_params
save_boot_params_ret:
#ifdef CONFIG_ARMV7_LPAE
/*
 * check for Hypervisor support
 */
	mrc	p15, 0, r0, c0, c1, 1		@ read ID_PFR1
	and	r0, r0, #CPUID_ARM_VIRT_MASK	@ mask virtualization bits
	cmp	r0, #(1 << CPUID_ARM_VIRT_SHIFT)
	beq	switch_to_hypervisor
switch_to_hypervisor_ret:
#endif
	/*
	 * disable interrupts (FIQ and IRQ), also set the cpu to SVC32 mode,
	 * except if in HYP mode already
	 */
	mrs	r0, cpsr
	and	r1, r0, #0x1f		@ mask mode bits
	teq	r1, #0x1a		@ test for HYP mode
	bicne	r0, r0, #0x1f		@ clear all mode bits
	orrne	r0, r0, #0x13		@ set SVC mode
	orr	r0, r0, #0xc0		@ disable FIQ and IRQ
	msr	cpsr,r0

/*
 * Setup vector:
 * (OMAP4 spl TEXT_BASE is not 32 byte aligned.
 * Continue to use ROM code vector only in OMAP4 spl)
 */
#if !(defined(CONFIG_OMAP44XX) && defined(CONFIG_SPL_BUILD))
	/* Set V=0 in CP15 SCTLR register - for VBAR to point to vector */
	mrc	p15, 0, r0, c1, c0, 0	@ Read CP15 SCTLR Register
	bic	r0, #CR_V		@ V = 0
	mcr	p15, 0, r0, c1, c0, 0	@ Write CP15 SCTLR Register

#ifdef CONFIG_HAS_VBAR
	/* Set vector address in CP15 VBAR register */
	ldr	r0, =_start
	mcr	p15, 0, r0, c12, c0, 0	@Set VBAR
#endif
#endif

	/* the mask ROM code should have PLL and others stable */
#ifndef CONFIG_SKIP_LOWLEVEL_INIT
#ifdef CONFIG_CPU_V7A
	bl	cpu_init_cp15
#endif
#ifndef CONFIG_SKIP_LOWLEVEL_INIT_ONLY
	bl	cpu_init_crit
#endif
#endif

	bl	_main

As we can see, here we have more code to focus on. This code prepares CPU before execution and call _main function at last stage. First thing is jumping into save_boot_params routine. This one is defined in arch/arm/mach-omap2/lowlevel_init.S.

ENTRY(save_boot_params)
	ldr	r1, =OMAP_SRAM_SCRATCH_BOOT_PARAMS
	str	r0, [r1]
	b	save_boot_params_ret
ENDPROC(save_boot_params)

Do you remember part 1 of this series? One of the last thing done by the ROM Code is passing boot device information via pointer passed to r0 register, just before running MLO. As we can see, this pointer is stored under OMAP_SRAM_SCRATCH_BOOT_PARAMS address. r0 for sure will be utilized later, so this is the first thing done. After that, flow is going back to rest of reset.

CONFIG_ARMV7_LPAE is turned off, so we are going to line 56. The Current Program State Register is set. If processor is in Hypervisor Mode it is left unchanged, otherwise Supervisor mode is set. After that all interrupts are masked with that register, no matter which mode is running. Supervisor mode (SVC) is state of processor proper for running kernel mode, this is worth noting. If you want to learn more about ARM processor modes, examine Reference Manual of ARMv7 architecture.

Right after that more sophisticated things are done. MCR and MRC instructions are used. These are instructions communicating with coprocessors attached on the silicon. There are maximum 16 coprocessors – p15 is so called Control Processor , described here https://developer.arm.com/documentation/ddi0360/e/control-coprocessor-cp15/about-control-coprocessor-cp15?lang=en. Instruction mnemonics supported by it are extensively described here https://developer.arm.com/documentation/ddi0360/e/control-coprocessor-cp15/summary-of-cp15-instructions. These sites refer to not our architecture – some things might be not the same as in our chip, but they are well explained.

As you could read in above links, first instruction mrc p15, 0, r0, c1, c0, 0 loads current value of MMU-related Control Register into local, core register r0. Then program clears V-bit (with bic instruction), which means, that exception vectors are kept in 0x00000000 address range and write it back to the Control Register with mcr.

Next instruction mcr p15, 0, r0, c12, c0, 0 sets the Vector Base Address Register inside Control Processor to the address of _start. This symbol was defined in our first listing in this article. Now all the exceptions caught by processor should be handled by the Vector defined in MLO (described in vector.s).

Before we jump to the long-awaited C code there are couple of assembly routines which must be executed. The first one is cpu_init_cp15. This one is defined in file arch/arm/cpu/armv7/start.S, line 143. As you can see, there are a lot of MCR transfers, I will try to explain general idea behind the whole procedure without analyzing every single mnemonic.

The name of routine tells us, that once again coprocessor number 15 will be initialized. As the comments says line 143-148 invalidates all data saved in L1 cache, which after restart may have wrong values. These are TLBs (Translation Lookaside Buffers – records which speed up resolving virtual address into physical memory), instruction caches which might buffer adjacent blocks of assembly code, to speed up instruction fetches, and lastly invalidation of BP (Branch Prediction) arrays is done. Branch prediction is another mechanism of speeding up pipeline processors to fetch the right instruction, even if it is not the next mnemonic of our assembly code. If you want learn more in this area, I encourage to read wiki article https://en.wikipedia.org/wiki/Branch_predictor.

After invalidating caches, Data Synchronization Barrier is done to explicitly wait for the end of all memory operations, and ISB (Instruction Synchronization Barrier), described in documentation as flushing prefetch buffer is done. As documentation says:

The Flush Prefetch Buffer instruction flushes the pipeline in the processor, so that all instructions following the pipeline flush are fetched from memory, (including the instruction cache), after the instruction has been completed. Combined with Data Synchronization Barrier, and potentially a memory barrier, this ensures that any instructions written by the processor are executed. This guarantee is required as part of the mechanism for handling self-modifying code. The execution of a Data Synchronization Barrier instruction and the invalidation of the Instruction Cache and Branch Target Cache are also required for the handling of self-modifying code. The Flush Prefetch Buffer is guaranteed to perform this function, while alternative methods of performing the same task, such as a branch instruction, can be optimized in the hardware to avoid the pipeline flush (for example, by using a branch predictor).

Next bunch of assembly is commented as disabling MMU stuff and caches. Once again c1,c0 operation (Read Control Register) is made, than V-bit (once again), and CAM bits are cleared. Then A (bit 1) and Z (bit 11) are set. V-bit was described, previously. I don’t know why it’s repeated. CAM bits (https://developer.arm.com/documentation/ddi0360/e/control-coprocessor-cp15/register-descriptions/c1–control-register?lang=en) disables data Cache, Alignment fault checking and MMU. According to the hash-defines, we can turn on or turn off instruction cache in this place. In our configuration it is turned on, so setting bit with orr instruction is made.

After that, bunch of Errata stuff is done. Fortunately, none of these are applied in our build beside CONFIG_ARM_CORTEX_A8_CVE_2017_5715. But before that more interesting operations are done

	mov	r5, lr			@ Store my Caller
	mrc	p15, 0, r1, c0, c0, 0	@ r1 has Read Main ID Register (MIDR)
	mov	r3, r1, lsr #20		@ get variant field
	and	r3, r3, #0xf		@ r3 has CPU variant
	and	r4, r1, #0xf		@ r4 has CPU revision
	mov	r2, r3, lsl #4		@ shift variant field for combined value
	orr	r2, r4, r2		@ r2 has combined CPU variant + revision

/* Early stack for ERRATA that needs into call C code */
#if defined(CONFIG_SPL_BUILD) && defined(CONFIG_SPL_STACK)
	ldr	r0, =(CONFIG_SPL_STACK)
#else
	ldr	r0, =(CONFIG_SYS_INIT_SP_ADDR)
#endif
	bic	r0, r0, #7	/* 8-byte alignment for ABI compliance */
	mov	sp, r0

Once again CP15 is used to read our silicon ID. CPU variant and revision is combined and stored in r2 register. After that, according to configuration stack pointer address is chosen – in our case it is 0x4030ff10 (the value of CONFIG_SYS_INIT_SP_ADDR). Check this address in part #1 of my series. You will find out, that it’s one of the highest accessible address on Internal SRAM memory. This address is aligned to 8-bytes with clearing 3 least significant bits. Finally stack pointer is initialized with movopcode. Wohoo! Doing it is last thing prepared by cpu_init_cp15 routine, so we are going back to the reset routine and jump to another place, called cpu_init_crit.

ENTRY(cpu_init_crit)
	/*
	 * Jump to board specific initialization...
	 * The Mask ROM will have already initialized
	 * basic memory. Go here to bump up clock rate and handle
	 * wake up conditions.
	 */
	b	lowlevel_init		@ go setup pll,mux,memory
ENDPROC(cpu_init_crit)

It is the place, where more specialized initialization might be done. Our lowlevel_init procedure is implemented in arch/arm/cpu/armv7/lowlevel_init.S

WEAK(s_init)
	bx	lr
ENDPROC(s_init)
.popsection

.pushsection .text.lowlevel_init, "ax"
WEAK(lowlevel_init)
	/*
	 * Setup a temporary stack. Global data is not available yet.
	 */
#if defined(CONFIG_SPL_BUILD) && defined(CONFIG_SPL_STACK)
	ldr	sp, =CONFIG_SPL_STACK
#else
	ldr	sp, =CONFIG_SYS_INIT_SP_ADDR
#endif
	bic	sp, sp, #7 /* 8-byte alignment for ABI compliance */
#ifdef CONFIG_SPL_DM
	mov	r9, #0
#else
	/*
	 * Set up global data for boards that still need it. This will be
	 * removed soon.
	 */
#ifdef CONFIG_SPL_BUILD
	ldr	r9, =gdata
#else
	sub	sp, sp, #GD_SIZE
	bic	sp, sp, #7
	mov	r9, sp
#endif
#endif
	/*
	 * Save the old lr(passed in ip) and the current lr to stack
	 */
	push	{ip, lr}

	/*
	 * Call the very early init function. This should do only the
	 * absolute bare minimum to get started. It should not:
	 *
	 * - set up DRAM
	 * - use global_data
	 * - clear BSS
	 * - try to start a console
	 *
	 * For boards with SPL this should be empty since SPL can do all of
	 * this init in the SPL board_init_f() function which is called
	 * immediately after this.
	 */
	bl	s_init
	pop	{ip, pc}
ENDPROC(lowlevel_init)

arch/armv7 directory contains generic implementation of low-level init. All routines are defined as so-called weak symbols. If linker will find other with the same name, it will leave weak symbol not used and choose the other. Basically it doesn’t implement nothing new. Once again stack pointer is initialized. CONFIG_SPL_DM is defined (Driver Model for SPL), so r9 is zeroed.

Our program stack is ready, so we can use it with a pushinstruction. Link-register and instruction pointer is passed, after that branch with copying return address to link-register is done. We are jumping to s_init, but as we can see in the listing, it is empty.

Finally after returing to resetroutine, we are jumping to the _main routine, which is really, really close to the C-world. It is defined in arch/arm/lib/crt0.S. As we can see, it is the code initializing C environment (clearing BSS, preparing heap etc. We will cover it in the next part of the series.

C, stdatomic

stdatomic.h under the hood #2

sty 13, 2021 #arm, #barrier, #c, #cortex-a8, #data, #dmb, #lock-free, #memory, #programming, #stdatomic autorstwa Rafał

In today’s part of the series, I will find out, how the code from #1 (http://olejniczak.ovh/index.php/2020/12/18/stdatomic-h-under-the-hood-1/) is compiled on ARM architecture. I don’t want to expand this text without need, so if you want to examine code, check #1 of this series.

ARM

Below you can find details of tested architecture and compiler:

$ arm-linux-gnueabihf-gcc -v
Using built-in specs.
...
Target: arm-linux-gnueabihf
... --disable-multilib --enable-multiarch --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard ... --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-plugin --enable-gold --enable-c99 --enable-long-long --with-mode=thumb --disable-multilib --with-float=hard
Thread model: posix
gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09)

Let’s find out how the compiler works on this machine:

# ./non-synchronized-arm 
-22301
# ./non-synchronized-arm 
96532
# ./non-synchronized-arm 
225150
# ./non-synchronized-arm 
-66366
# ./non-synchronized-arm 
-120416
# ./non-synchronized-arm 
4340
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0

The difference in using stdatomic is obvious, so let’s check how ARM compiler works to make it happen. The first assembly code is generated from non-synchronized.c file:

Thread:
	@ args = 0, pretend = 0, frame = 16
	@ frame_needed = 1, uses_anonymous_args = 0
	@ link register save eliminated.
	str	fp, [sp, #-4]!
	add	fp, sp, #0
	sub	sp, sp, #20
	str	r0, [fp, #-16]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L7:
	ldr	r3, [fp, #-16]
	cmp	r3, #0
	beq	.L5
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	add	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
	b	.L6
.L5:
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	sub	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
.L6:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L7
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #0
	@ sp needed
	ldr	fp, [sp], #4
	bx	lr
	.size	Thread, .-Thread

Code is quite similar to x86. The most crucial part is L7, which implements addition, and L5, which implements subtraction. Let’s see what they are doing:

	movw	r3, #:lower16:x /* Load less significant 16 bit of x address */
	movt	r3, #:upper16:x /* ... and more significant part */
	ldr	r3, [r3]        /* Load value of x to r3 register */
	add	r2, r3, #1      /* Add 1 to x (sub in L5) */
	movw	r3, #:lower16:x /* Put address to register (like previously)  */
	movt	r3, #:upper16:x
	str	r2, [r3]        /* Store result */

We can see, that this code is longer than its x86 counterpart. Now let’s see how arm-linux-gnueabihf stdatomic implementation looks like:

Thread:
	@ args = 0, pretend = 0, frame = 40
	@ frame_needed = 1, uses_anonymous_args = 0
	stmfd	sp!, {fp, lr}
	add	fp, sp, #4
	sub	sp, sp, #40
	str	r0, [fp, #-40]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L11:
	ldr	r3, [fp, #-40]
	cmp	r3, #0
	beq	.L5
	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8
.L5:
	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10
.L7:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L11
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #4
	@ sp needed
	ldmfd	sp!, {fp, pc}
	.size	Thread, .-Thread

This time, synchronized implementation is much more complex, but it’s not a problem for us :). After short initialization, the code jumps to L4 code, where for loop is implemented. If the break condition is false, the code jumps to L11. L11 in turn starts with examining arg parameter against NULL (check code in part #1). If it’s not, L11-L6 code makes the addition otherwise, the jump to L5 is made. If you compare code in L11-L6 and L5-L9, you can see, that they are almost the same:

	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8

	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10

Each code start from saving literal 1 to the local variable on the stack (frame pointer-32 for addition and frame pointer-20 for subtraction). Right after that register r3 is filled with x variable address. Then we can see something new – instruction dmb sy which was not used in non-synchronized code. According to the ARM developers guide (https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb)

Data Memory Barrier acts as a memory barrier. It ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program order after the DMB instruction.

The dmb instruction has the second part, defining which operation should unlock our barrier. In this case, we are locking till the whole memory subsystem finishes its job (sy). It ensures so-called Sequential Consistency. It is done before ldr instruction (which as str, should be considered as asynchronous) loads current x value, to make sure, that all stores called before were done (in other words, whether we load most actual value).

After finishing ldr operation (which is synchronized with second dmb sy), we are storing a loaded value to local copy on the stack (fp-28 in addition branch and fp-16in subtraction branch). Keep in mind, that only accessing variable shared by both threads is surrounded by dmb instructions.

L8 and L10 blocks load the value of x to r2 register, then value 1 to r3 and adds (or subtracts) them, saving the result in r3. After that result is stored in the new place – fp-24 in addition branch and fp-12 in subtraction branch. The result is also moved to ip (r12) register. Right after that, r3 register once again is filled with x address, r2 is filled with fp-28 (fp-16). Let’s go back to the previous paragraph – these locations are our local copies of x value, which are loaded to r0 register. Keep in mind that this value was not changed by our addition (subtraction) operation.

Then the comparison between r1 (value of x) and r0 (local copy before operation) is done. If these values are not equal, it means, that someone changed x during our L8(L10) execution. Let’s analyze this case – we are jumping to L14 (L15). This block is not intuitive at all – it synchronizes memory operations, loads 1 to r3 if our previous compare was true (not this time), and loads 0 otherwise (this case). Then it compares r3 against 0 (we loaded 0 just before – true) and calls bne to L6 (L9), which in our scenario is not done (bne is “branch if not equal”). ARM is known of conditional instructions, it is nice, but puzzling it in such context is not easy :). We didn’t make a jump, so we are storing r1 (x value) to address under r2 (or local copy). After that cmp + bne + b set moves our instruction pointer to L8(L10) and all described operations are repeated. Not efficient :(.

Now let’s go back to cmp r1,r0 and consider, that these values didn’t change. In this case, we are storing our result with strex in a shared x location. strex stands for store exclusive, so we can expect some kind of synchronization – probably dependent on ldrex usage. It stores our changed value in a shared x value address and returns the result to the lr – a synonym of r14 – register. If the store is successful, it returns 0, and our execution path goes to L14(L16). The previous paragraph puzzled this code, in this case, we are finishing in common L7 block, which prepares next for loop step. If the store failed, we are repeating L13 (L15) once again.

Could you write it simpler?

Yes :). I prepared some pseudocode, describing this assembly listing. Here you have:

   l_1  <- 1
   ------ Memory barrier
   r3   <- x
   ------ Memory barrier
   r3   -> l_x
L10:
   r2   <- l_x
   r3   <- l_1
   r3   <- r2 +/- r3
   r3   -> l_res
   ip   <- l_res
   r0   <- l_x
   ------ Memory barrier
L15:
   r1   <- x /* exclusive */
   if (r1 != r0) goto L16
   ip   -> x /* exclusive wit result to lr */
   if (lr != 0) goto L15
L16:
   ------ Memory barrier
   if (r1 != r0) goto L10

   l_x = x;
   l_1 = 1;



start:
   l_res = l_x +/- l_1;




storing:
   if (l_x != x(ex)) /* exclusive x load */
   {
      l_x = x;
      goto start;
   }
   else
   {
   /* exclusive store returning 0 if succeed */
      if (x =(ex) l_res)
         goto storing;
   }

l_ prefix stands for a local variable copy. Exclusive operations (strex and ldrex) are marked with (ex). If you didn’t get my assembly analysis, you should focus on C-like pseudocode.

The thing, which does the job in this program is ldrex and strex instructions. This couple allows atomic load-modify-store operation via the so-called “Exclusive monitor”. ldrex instruction initializes its state machine and tells it to wait for the following strex. If any context switch happens between these two steps, which may corrupt atomic operation, strex will return 1 into lr register. In this case, ldrex + strex operation must be repeated. If this context switch also changed the value of x, the whole calculation done before must be repeated with the current value (goto start).

Memory barrier

I found on the internet some opinions, that dmb instruction is only needed on multi-core processors. I was curious, whether it is needed in our context (TI AM3359 single-core processor) or not. To check this, I have removed all memory barrier instructions from assembly and compiled a new version:

$ cat synchronized-arm.s | grep -v dmb > synchronized-arm-no-barrier.s 
$ arm-linux-gnueabihf-gcc synchronized-arm-no-barrier.s -lpthread -o synchronized-arm-no-barrier
$ scp synchronized-arm-no-barrier root@mydevice:/root
$ ssh root@mydevice
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0

My test shows, that these barriers are redundant however, I’m not sure if corruption, in this case, is not possible or not likely.

Summary

This code is complicated, comparing to the x86 implementation. It may be due to the old gcc version. However I suppose, that ARM instruction set is focused on low power usage and that is the main reason. I made some measurements of execution time. To make it more reliable, I should change for loop to a longer run, but I will leave it as an exercise for the reader 🙂

$ ##### x86 :
$ time ./non-synchronized
708439
real	0m0,011s
user	0m0,017s
sys	0m0,004s
$ time ./non-synchronized
-998811
real	0m0,009s
user	0m0,016s
sys	0m0,000s
$ time ./non-synchronized
986583
real	0m0,011s
user	0m0,015s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,039s
user	0m0,068s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,053s
user	0m0,097s
sys	0m0,008s
$ time ./synchronized
0
real	0m0,053s
user	0m0,103s
sys	0m0,000s

# ##### ARM Cortex-A8
# time ./non-synchronized-arm 
-396494
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./non-synchronized-arm 
0
real	0m 0.08s
user	0m 0.03s
sys	0m 0.01s
# time ./non-synchronized-arm 
-531472
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.33s
sys	0m 0.01s
# time ./synchronized-arm 
0
real	0m 0.54s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.12s
user	0m 0.06s
sys	0m 0.02s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.09s
user	0m 0.07s
sys	0m 0.01s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.10s
user	0m 0.07s
sys	0m 0.01s

We can see, that x86 stdatomic made execution 3-5 times slower. The code generated by ARM compiler slowed down 7-8 times. Surprisingly, the main bottleneck in the generated code was dmb sy instruction. Removing it, gives synchronized code, which executes only 1.5-2 times slower than non-synchronized! The question is if removing memory barriers is 100% bullet-proof.

I hope you are not sleepy after reading this post :). If you have any suggestions regarding this series, or you have some remarks, please add a comment. Next time I will try to compile something similar in typical embedded architecture, which is AVR.

Tag: #arm

Following the kernel #4 – MLO part 1

stdatomic.h under the hood #2

ARM

Could you write it simpler?

Memory barrier

Summary