Autor: Rafał

C++, Qt

Limited QTextEdit – do it the right way

Recently during development, I encountered a problem in adding a character limit to the QTextEdit element. I will describe the most frequently suggested solution, which is in my opinion wrong. And the other way, which was usually suggested by more experienced developers, but not presented in the code.

The wrong way

Usually, you can find the following solution, which is quite intuitive in Qt:

  1. Create textChanged event handler.
  2. Extract QTextEdit text, and check its length.
  3. Modify it the way you want and call textEdit->setText with modified text.
  4. Update QTextEdit cursor (carriage) position.

What is wrong with that solution? The problem is that calling setText in textChanged handler is a way to create endless recursion. Qt signals and slots are a nice way to decouple things, but unfortunately, some scenarios can lead you to problems.

Even if this way will work in your environment, it is strongly related to the Qt implementation. In my case, the program ends on Segmentation fault.

The right way

The right way, proposed by more advanced developers on the Qt forum looks like that:

  1. Create a new class, for example, LimitedTextEdit which derives from QTextEdit.
  2. Overload keyPressEvent and keyReleaseEvent methods. Their content should check if the text, extracted with toPlainText() matches the character limit. If it does, the base class method should be called (keyPressEvent or keyReleaseEvent), otherwise, it will filter out the pressed key.
  3. Promote your QTextEdit to LmitedTextEdit in Design editor.

Below you can find my implementation, limitedtextedit.h:

#ifndef LIMITEDTEXTEDIT_H
#define LIMITEDTEXTEDIT_H

#include <QTextEdit>

class LimitedTextEdit : public QTextEdit
{
public :
    LimitedTextEdit(QWidget *parent = 0) : QTextEdit(parent) { }
    virtual ~LimitedTextEdit() { }

protected :
    virtual void keyPressEvent(QKeyEvent *e) override;
    virtual void keyReleaseEvent(QKeyEvent *e) override;
};

#endif // LIMITEDTEXTEDIT_H

limitedtextedit.cpp:

#include "limitedtextedit.h"

#define LIMITED_TEXT_EDIT_MAX 120

void LimitedTextEdit::keyPressEvent(QKeyEvent *e)
{
    int len = QTextEdit::toPlainText().toAscii().length();
    if (len < LIMITED_TEXT_EDIT_MAX)
    {
        QTextEdit::keyPressEvent(e);
    }
}

void LimitedTextEdit::keyReleaseEvent(QKeyEvent *e)
{
    int len = QTextEdit::toPlainText().toAscii().length();
    if (len < LIMITED_TEXT_EDIT_MAX)
    {
        QTextEdit::keyReleaseEvent(e);
    }
}

This part is scalable. You may add parsing keys, passed to the event handler, and for example, unlock arrow keys or block pasting content, Everything here is in your hands. If you want to react on a key normally, just pass it to QTextEdit implementation. Now you have to promote your generated by a designer QTextEdit object:

After that just click Add and Promote.

Following the kernel #2 – Kernel Build System

Multi-platform projects like Linux or U-Boot need a flexible tool to configure all the conditional stuff inside and run the build in a concise form. Our series is long before going into Linux code. However, I will describe Kbuild system in the example of Linux source code. This is the original version without any modifications. If you want to see it I recommend you clone the repository given in the previous part. The way it works is the same on U-Boot, but it may differ at some points.

Makefile scripts show how things can be made complicated, but it is a standard living much longer than I do and it does the job for all these years. Their strength is that behind the enigmatic form, hides simple shell invocations, which are very flexible. I expect from you basic knowledge in this topic.

scripts/Makefile.build

The central place of Kbuild is the Makefile.build script, placed under scripts directory. It is a smart, handy helper which is generically used in every source compilation. It gathers configuration and accepts one parameter obj, which is a path to a directory with a source code to build.

If you want to build some kernel module in Kbuild, just call make -f scripts/Makefile.build obj=path/to/kernel/module, or with Kbuild helper variable make $(build)=path/to/kernel/module. Of course, it is a simplification. I haven’t mentioned about configuration step, but this is a general idea.

# Init all relevant variables used in kbuild files so
# 1) they have correct type
# 2) they do not inherit any value from the environment
obj-y :=
obj-m :=
lib-y :=
lib-m :=
always :=
always-y :=
always-m :=
targets :=
subdir-y :=
subdir-m :=
EXTRA_AFLAGS   :=
EXTRA_CFLAGS   :=
EXTRA_CPPFLAGS :=
EXTRA_LDFLAGS  :=
asflags-y  :=
ccflags-y  :=
cppflags-y :=
ldflags-y  :=

subdir-asflags-y :=
subdir-ccflags-y :=

The first thing done in it is the initialization of build variables. At this point, you may see the modular nature of Linux. The variables with the suffix -y corresponds to objects compiled into the Kernel (in the case of U-Boot everything is compiled and linked into one binary). The suffix -m, as you may expect, corresponds to modules loaded during Linux runtime. These variables tell Makefile.build what should be compiled and eventually, are there any local build options. These variables are supplied by a developer in the local module Makefile script. I will describe it later.

Below this code, there are several inclusions of other files. Now I will briefly describe what they do.

include/config/auto.conf

This one includes generated Makefile-readable configuration. It contains all parameters, which decide, how Linux code should be compiled. At this point I can tell you, that the repository contains a lot of prepared, specific boards configurations, so you don’t have to make it on your own. Usually, you choose one of it, and create small changes, up to your needs.

# Read auto.conf if it exists, otherwise ignore
-include include/config/auto.conf

- before the include clause cause that there is no error if the file doesn’t exist. In the bare, cloned repository, there is no include/config directory, since we didn’t specify the configuration. More info about it I will write later in this article. Let’s assume, that configuration is right there. Below you can find a sample configuration, taken from the U-Boot repository:

#
# Automatically generated file; DO NOT EDIT.
# U-Boot 2021.04-rc1 Configuration
#
CONFIG_ENV_SUPPORT=y
CONFIG_DISPLAY_BOARDINFO=y
CONFIG_CMD_BOOTM=y
CONFIG_ENV_FAT_FILE="uboot.env"
CONFIG_CMD_EXT4=y
CONFIG_SYS_OMAP24_I2C_SPEED=100000
scripts/Kbuild.include

Now you see, that customization stuff is accessible at the early beginning of Makefile analysis. Right after that, we can see the next inclusion. It contains Kbuild-specific Makefile helper definitions.

include scripts/Kbuild.include

If you are experienced, using Makefile scripts, you may see that it’s problematic with special character usage. The first thing done in this included script is defining such characters as variables:

# Convenient variables
comma   := ,
quote   := "
squote  := '
empty   :=
space   := $(empty) $(empty)
space_escape := _-_SPACE_-_
pound := \#

For example, comma sign is used by Makefile in many functions, such as subst or call, as an argument separator. But if it is placed in a statement as a variable $(comma), it is only substituted by Makefile without interpreting it as a special character. After that, there are some other helpers, commonly used in Kbuild scripts.

I will briefly describe more complicated ones. The first of them is filechk. It is quite complex, but let’s puzzle it out:

define filechk
	$(Q)set -e;						\
	mkdir -p $(dir $@);					\
	trap "rm -f $(dot-target).tmp" EXIT;			\
	{ $(filechk_$(1)); } > $(dot-target).tmp;		\
	if [ ! -r $@ ] || ! cmp -s $@ $(dot-target).tmp; then	\
		$(kecho) '  UPD     $@';			\
		mv -f $(dot-target).tmp $@;			\
	fi
endef

define directive creates Makefile macro, which can be called in this form:

filechk_sample = echo $(KERNELRELEASE)
version.h: FORCE
	$(call filechk,sample)

Pay attention to the ; \ signs in the filechk definition. Everything is run in one /bin/sh invocation (every recipe line in Makefile calls a separate shell with a command supplied there if there is a single line, everything is called in one shell environment). $(Q) variables are optional and say if the command should be written to the stdout. set -e cause immediate exit on any command failure within this function. mkdir -p generates the target directory and the parameter -p suppresses error if the directory already exists.

trap function might be new for you. It is the exit handler installed in our /bin/sh invocation. For example, if we enter CTRL+C combination, during the build, the makefile will be stoped, but before it, the commands between the quotes will be executed –rm -f $(dot-target).tmp.

dot-target is another macro defined in Kbuild.include file. It is quite simple, so I will not tell much on that. Briefly speaking it generates the name of the temporary (prefixed with a dot) file for a target. In our case it is .version.h. If the build is interrupted, we should remove it to allow a clean build, in later make invocation.

Last 4 lines are most important. { $(filechk_$(1)); } > $(dot-target).tmp; executes in recipe the commands defined before. $(1) is the argument passed in call statement. In our example it is sample, so as the effect, our recipe executes filecheck_sample function –echo $(KERNELRELEASE)and redirects it output to the temporary file mentioned before.

Right after that, there is an if statement. It uses Makefile recipe variable $@ which is the target file, in our example it is version.h, to check its existence and accessibility (-r). If the file exists it also compares its content with the generated temporary file. If any of these conditions are true (file not exists or it has different content), the file is updated with new, generated content (mv -f $(dot-target).tmp $@;).

So the filechk macro defines a way of checking the content of a generated file before creating it. It is very useful. If you have ever written more complex Makefile scripts with generating files, you probably noticed, that every generation of such file triggers not necessary builds of other targets, which rely on it – Makefile looks at the last touch timestamp of the file, not its content.

The next function is try-run:

try-run = $(shell set -e;		\
	TMP=$(TMPOUT)/tmp;		\
	TMPO=$(TMPOUT)/tmp.o;		\
	mkdir -p $(TMPOUT);		\
	trap "rm -rf $(TMPOUT)" EXIT;	\
	if ($(1)) >/dev/null 2>&1;	\
	then echo "$(2)";		\
	else echo "$(3)";		\
	fi)

At the beginning, it sets environment variables TMP, TMPO and makes output directory $(TMPOUT) – this variable should be set before try-run call. You are already familiar with trap. The purpose of it is the same, but in this case, it removes the ouput directory on exit. As the first argument, the function gets the command to be executed. If it succeeds, the second argument is outputted. If it files, the third one.

Worth noting here is that TMP and TMPO environment variables are accessible in the wrapped command. They are valid only inside try-run call. A good example of usage is a next definition:

as-option = $(call try-run,\
	$(CC) $(KBUILD_CFLAGS) $(1) -c -x assembler /dev/null -o "$$TMP",$(1),$(2))

It runs the compiler $(CC) with default C-related flags $(KBUILD_CFLAGS) and given as the as-option parameter assembly-related flag $(1). It tells the compiler (-c) to stop at the compilation step (without linking), since we only want to check the passed option. As the name of the functions says, it is related to the assembly, so we tell to the $(CC) that the language is assembler (-x assembler). As the input file /dev/null is given (as I have mentioned it is just the compiler check). The output file is a temporary $$TMP file defined in try-run. We must supply it here with double $$ sign, because otherwise, Makefile would try to resolve it during definition as a variable (like for example $(CC)).

Sample usage of as-option is placed in arch/arm/Makefile script.

AFLAGS_NOWARN	:=$(call as-option,-Wa$(comma)-mno-warn-deprecated,-Wa$(comma)-W)

This statement assigns to variable AFLAGS_NOWARN result of the as-option call. It runs the compiler with the -Wa,-mno-warn-deprecated flag. If it is supported, the -Wa,-W value is assigned to the AFLAGS_NOWARN. Otherwise, the variable remains empty.

Next important part are shorthand definitions, common build statements:

###
# Shorthand for $(Q)$(MAKE) -f scripts/Makefile.build obj=
# Usage:
# $(Q)$(MAKE) $(build)=dir
build := -f $(srctree)/scripts/Makefile.build obj

###
# Shorthand for $(Q)$(MAKE) -f scripts/Makefile.dtbinst obj=
# Usage:
# $(Q)$(MAKE) $(dtbinst)=dir
dtbinst := -f $(srctree)/scripts/Makefile.dtbinst obj

###
# Shorthand for $(Q)$(MAKE) -f scripts/Makefile.clean obj=
# Usage:
# $(Q)$(MAKE) $(clean)=dir
clean := -f $(srctree)/scripts/Makefile.clean obj

When we want to build an object (statically linked or loaded as a module), the make -f scripts/Makefile.build obj=<module_dir> must be called. With the encapsulating build variable it is easier since the call looks like this: make $(build)=<module_dir>. Similar purposes have dtbinst (device-tree related) and clean variables.

I encourage you to read all of these Makefile helpers. They are commonly used around Kbuild scripts, and knowing them lets you read them fluently. I tried to describe a few, so it should be easier to read all others.

Local Kbuild/Makefile

After Kbuild.include helpers, the local Kbuild/Makefile script is included:

# The filename Kbuild has precedence over Makefile
kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
include $(kbuild-file)

This is the part filled by the module developer. It defines files, which should be built into the kernel (obj-y) or as a loadable module (obj-m). These are the most common definitions. At the beginning of Makefile.build I have shown all variables, which might be supplied here. These are lib- which will be built as statically linked libraries, always- are targets, which are – as the name says – are always triggered during the build of our module. Good examples are generated header files. The recipe for it should be described in our local Makefile. The next variable is targets – this is a variable related to the way, the Kbuild works, and to if_changed helper, defined in Kbuild.include. Briefly, it helps with a situation in which the command building something changed in Makefile, so the rebuild of this particular thing should be done. If you want to read more about it, please refer to https://www.kernel.org/doc/Documentation/kbuild/makefiles.rst chapter 3.12 – Command change detection. subdir- are variables that trigger descending into subdirectories in Kbuild framework. More about it I will tell you later. All other variables with *flags* part lets apply specific build flags locally in our module.

A simple example of such local Makefile is placed in drivers/tty/serial/8250/Makefile :

obj-$(CONFIG_SERIAL_8250)		+= 8250.o 8250_base.o
8250-y					:= 8250_core.o
8250-$(CONFIG_SERIAL_8250_PNP)		+= 8250_pnp.o
8250_base-y				:= 8250_port.o
8250_base-$(CONFIG_SERIAL_8250_DMA)	+= 8250_dma.o
8250_base-$(CONFIG_SERIAL_8250_DWLIB)	+= 8250_dwlib.o
8250_base-$(CONFIG_SERIAL_8250_FINTEK)	+= 8250_fintek.o
obj-$(CONFIG_SERIAL_8250_GSC)		+= 8250_gsc.o
obj-$(CONFIG_SERIAL_8250_PCI)		+= 8250_pci.o
obj-$(CONFIG_SERIAL_8250_EXAR)		+= 8250_exar.o
obj-$(CONFIG_SERIAL_8250_HP300)		+= 8250_hp300.o

The often shorthand is concatenating obj- with configuration variable like $(CONFIG_SERIAL_8250). If CONFIG_SERIAL_8250 is declared as y, it is built into the kernel. If this tristate variable is set to m these files are built as a loadable module. If it is set to n it is not built at all

Some modules may be linked from several compilation units (.c files). This situation is presented in 8250.o and 8250_base.o files. After placing them in obj-y/m definition, the Makefile script defines which files must be compiled before linking into 8250.o/8250_base.o. It is done in 8250-y and 8250_base-y variables.

scripts/Makefile.lib

The next part of Makefile.build is Makefile.lib script. It defines all commands, which are used to build the kernel. In the beginning we can find backward compatibility variables support code

asflags-y  += $(EXTRA_AFLAGS)
ccflags-y  += $(EXTRA_CFLAGS)
cppflags-y += $(EXTRA_CPPFLAGS)
ldflags-y  += $(EXTRA_LDFLAGS)

EXTRA_ variables are now deprecated. New modules should use xxflags-y variables in their Makefile scripts (described above). But old ones also must be properly compiled.

# When an object is listed to be built compiled-in and modular,
# only build the compiled-in version
obj-m := $(filter-out $(obj-y),$(obj-m))

The important thing here is a declaration, that if a module is configured for both built-into kernel and as a loadable module, the built-in option is chosen.

# Libraries are always collected in one lib file.
# Filter out objects already built-in
lib-y := $(filter-out $(obj-y), $(sort $(lib-y) $(lib-m)))

Some objects may be built as libraries, but similarly, if they are declared as obj-y, they are filtered out from lib-y.

# Subdirectories we need to descend into
subdir-ym := $(sort $(subdir-y) $(subdir-m) \
			$(patsubst %/,%, $(filter %/, $(obj-y) $(obj-m))))

This part needs more comment. Kbuild system is designed to traverse all configured directories from root folders to more specialized ones if they are configured. As we can see in the subdir-ym definition, these subdirectories might be passed from local Makefile through subdir-y, subdir-m but also with obj-y and obj-m variables – but these one are treated like that, only when they end with a /.

Repository structure reflects a logical design of kernel. For example, device drivers are placed under the drivers folder, under this location, we can find tty directory (teletypewriter). For now, it is a broad area of devices, mainly based on serial ports, but also the virtual devices are inside this group

The tty directory in turn has some generic stuff for all tty devices and serial directory which conform to more specialized device drivers like omap-serial.c. A completely separate directory is arch that includes stuff dependent on the core architecture or the fs directory, containing file system-related code.

The general principle is to start building from the main Makefile, placed in a root directory and descend to more specific directories, chosen according to the configuration. The example code is Makefile in drivers directory, always including tty subsystem directory:

obj-y				+= tty/

Descending is implemented in Makefile.build file, line 488-503:

__build: $(if $(KBUILD_BUILTIN), $(targets-for-builtin)) \
	 $(if $(KBUILD_MODULES), $(targets-for-modules)) \
	 $(subdir-ym) $(always-y)
	@:

endif

# Descending
# ---------------------------------------------------------------------------

PHONY += $(subdir-ym)
$(subdir-ym):
	$(Q)$(MAKE) $(build)=$@ \
	$(if $(filter $@/, $(KBUILD_SINGLE_TARGETS)),single-build=) \
	need-builtin=$(if $(filter $@/built-in.a, $(subdir-builtin)),1) \
	need-modorder=$(if $(filter $@/modules.order, $(subdir-modorder)),1)

subdir-ym is one of the last prerequisites performed in __build target. The recipe for it is taking one by one each subdirectory (the order has significance) and recursive calling make $(build)=$@ with some conditional parameters and the current target variable $@.

Worth noting here is that descending is often related to configuration. We may add subdir- variables conditionally, according to CONFIG_ variables.

The next part, of Makefile.build which I’d like to enlighten, are build rules for source code files. Kbuild supports code written in C and assembly. If you scroll this file, you will find definitions like:

quiet_cmd_cc_s_c = CC $(quiet_modtag)  $@
      cmd_cc_s_c = $(CC) $(filter-out $(DEBUG_CFLAGS), $(c_flags)) $(DISABLE_LTO) -fverbose-asm -S -o $@ $<

$(obj)/%.s: $(src)/%.c FORCE
	$(call if_changed_dep,cc_s_c)

The cmd_cc_s_c abbreviation tells, that its command using CC (C Compiler) program to generate .S file (assembly) from the C source. This is an intermediate step, which is sometimes helpful to debug problems and see how the human-readable machine code of our module looks like. Next, a similar command is cmd_cpp_i_c which generates with C preprocessor code in .i files (pure C after preprocessing). All other rules are written similarly.

In this example, you may see how if_changed... helper is used. Build command is defined as cmd_cc_s_c and is passed (without cmd_ prefix) to the helper. Besides that quite_cmd_cc_s_c is defined to tell Kbuild, how this command should be displayed at some level of verbosity. FORCE is given as a prerequisite, to let if_changed_dephelper decide if the file should be built or not.

The whole path between C (or assembly) source code and object file is presented below. as you may see, Kbuild adds some extra steps. Some of them are optional, some are essential to kernel work. We will not describe them in detail, because it is a topic for a whole series. Maybe I will go back here when it will be needed to code analysis.

define rule_cc_o_c
	$(call cmd_and_fixdep,cc_o_c)
	$(call cmd,gen_ksymdeps)
	$(call cmd,checksrc)
	$(call cmd,checkdoc)
	$(call cmd,objtool)
	$(call cmd,modversions_c)
	$(call cmd,record_mcount)
endef

define rule_as_o_S
	$(call cmd_and_fixdep,as_o_S)
	$(call cmd,gen_ksymdeps)
	$(call cmd,objtool)
	$(call cmd,modversions_S)
endef
Root Makefile

Now after reading all these raw definitions it’s time to split it up. This is the job of the main Makefile script in the kernel repo. This is the root caller of Makefile.build recursion. Unfortunately, it’s long and messy, but after following each called rule we will know what is most essential.

Configuration

As I mentioned above, CONFIG_ variables, placed in auto.conf are choosing drivers, architecture, and all customization stuff. At first sight, you might be afraid, that the whole kernel configuration is in your hands. Fortunately, it is not as bad. There are predefined configuration files for each architecture and specific boards. They are placed under arch/$ARCH/configs/$PLATFORM_defconfig files. For example, defconfig dedicated for BeagleBone Black (and some other boards) is placed under arch/arm/configs/omap2plus_defconfig. The terminology here might be misleading. OMAP is the larger SoC containing ARM processor (for example AM335x) and multimedia coprocessor, probably that is why Linux puts Sitara AM335x processor under this config file.

To configure our cross-compilation we need to set up a couple of things. The first one is assigning ARCH variable (in the terminal before executing make). In our case, it must be set to arm (export ARCH=arm). If it is not set properly, the Makefile script will choose the architecture of the underlying build machine. This variable lets Makefile decide which arch/$SRCARCH/Makefile should be included:

include arch/$(SRCARCH)/Makefile
export KBUILD_DEFCONFIG KBUILD_KCONFIG CC_VERSION_TEXT

config: outputmakefile scripts_basic FORCE
	$(Q)$(MAKE) $(build)=scripts/kconfig $@

%config: outputmakefile scripts_basic FORCE
	$(Q)$(MAKE) $(build)=scripts/kconfig $@

The next thing is telling Kbuild which toolchain should be used. By default, build machine gcc is used. In embedded systems however, it’s rarely used. We must explicitly assign chosen one to CROSS_COMPILE variable. In my case, it is export CROSS_COMPILE=arm-linux-gnueabihf-. You can download such toolchains from Linaro https://releases.linaro.org/components/toolchain/binaries/latest-5/arm-linux-gnueabihf/. Of course, after uncompressing it, you must add it to your local PATH variable.

When we are done, the next step is choosing default defconfig. So we exectue make omap2plus_defconfig. This will run the recipe for the target %config placed in the upper Makefile snippet. It will descend into scripts/kconfig and run Makefile.build on the local Makefile script. As the result, a host program called conf will be built and executed. This rule is implemented in the following part of scripts/kconfig/Makefile:

%_defconfig: $(obj)/conf
	$(Q)$< $(silent) --defconfig=arch/$(SRCARCH)/configs/$@ $(Kconfig)

$< variable is first prerequisite (theconf) program, and it’s called with parameters --defconfig=arch/arm/configs/omap2plus_defconfig. This program gathers all the parameters supplied in omap2plus_defconfig and environment variables declared by make.

The output, generated by it, is include/config directory, containing all the configuration (CONFIG_ variables), readable in C language (as header files) and auto.conf mentioned before. The input data are environment variables, Kconfig files from all Linux subdirectories (which probably are validators), and the most important defconfig file – in our case omap2plus_defconfig.

Running the build

When these files are generated, we are almost there. The only thing to do is running our long-lasting kernel build. If our ARCH and CROSS_COMPILE variables are still in the terminal session, we can just execute make. The default target is all, triggers vmlinux target.

# Final link of vmlinux with optional arch pass after final link
cmd_link-vmlinux =                                                 \
	$(CONFIG_SHELL) $< "$(LD)" "$(KBUILD_LDFLAGS)" "$(LDFLAGS_vmlinux)";    \
	$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)

vmlinux: scripts/link-vmlinux.sh autoksyms_recursive $(vmlinux-deps) FORCE
	+$(call if_changed,link-vmlinux)

We will focus on the most important parts – vmlinux-deps are all the objects linked into the kernel image, in a single variable. Let’s see how it’s defined.

$ cat Makefile | grep -w "vmlinux-deps .="
vmlinux-deps := $(KBUILD_LDS) $(KBUILD_VMLINUX_OBJS) $(KBUILD_VMLINUX_LIBS)

$ cat Makefile | grep -w "KBUILD_LDS \|KBUILD_VMLINUX_OBJS \|KBUILD_VMLINUX_LIBS "
KBUILD_VMLINUX_OBJS := $(head-y) $(patsubst %/,%/built-in.a, $(core-y))
KBUILD_VMLINUX_OBJS += $(addsuffix built-in.a, $(filter %/, $(libs-y)))
KBUILD_VMLINUX_OBJS += $(patsubst %/, %/lib.a, $(filter %/, $(libs-y)))
KBUILD_VMLINUX_LIBS := $(filter-out %/, $(libs-y))
KBUILD_VMLINUX_LIBS := $(patsubst %/,%/lib.a, $(libs-y))
KBUILD_VMLINUX_OBJS += $(patsubst %/,%/built-in.a, $(drivers-y))
export KBUILD_LDS          := arch/$(SRCARCH)/kernel/vmlinux.lds

$ cat Makefile | grep -w "core-y\|libs-y\|drivers-y"
core-y		:= init/ usr/
drivers-y	:= drivers/ sound/
drivers-y	+= net/ virt/
libs-y		:= lib/
core-y		+= kernel/ certs/ mm/ fs/ ipc/ security/ crypto/ block/

$ cat arch/arm/Makefile | grep -w "head-y"
head-y		:= arch/arm/kernel/head$(MMUEXT).o

This is the glue, joining all the things described previously. The main Makefile after preparing configuration (auto.conf and generated headers) descend into main source directories (init, drivers, etc.) and after building all configured objects, it links built-in.a (this archive contains all related kernel objects) using scripts/link-vmlinux.sh script. The makefile recipe, doing it is called descend:

vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, \
		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
		     $(libs-y) $(libs-m)))
...
build-dirs	:= $(vmlinux-dirs)
...
# Handle descending into subdirectories listed in $(build-dirs)
# Preset locale variables to speed up the build process. Limit locale
# tweaks to this spot to avoid wrong language settings when running
# make menuconfig etc.
# Error messages still appears in the original language
PHONY += descend $(build-dirs)
descend: $(build-dirs)
$(build-dirs): prepare
	$(Q)$(MAKE) $(build)=$@ \
	single-build=$(if $(filter-out $@/, $(filter $@/%, $(KBUILD_SINGLE_TARGETS))),1) \
	need-builtin=1 need-modorder=1
Summary

Kbuild system is a really broad topic. I have mentioned the most important skeleton, which hopefully eases the understanding of more complicated parts.

In the next chapter, I will describe debug environment setup. And finally, we will dive into some interesting stuff.

Who uses this port?

Today I will give you a pretty one-liner, fetching programs using port passed as the parameter.

# for i in $(fuser 22/tcp); do cat /proc/$i/cmdline; echo ""; done

The main part is the fuser command which retrieves PID. There may be many processes using this port as a client, so we need to loop on them with for. The cmdline content does not include a new-line character, so we add it with echo "". The sample output is following:

# for i in $(fuser 22/tcp); do cat /proc/$i/cmdline; echo ""; done
/usr/sbin/sshd
sshd: root@pts/0

The same way may be used to check file usage or UDP port.

Following the kernel #1 – ROM Code

With this article I am starting the large series, telling how precisely the Linux kernel works. Me and my readers will investigate each line of kernel code from the beginning, to a fully operable system. Hopefully, it gives us a strong foundation of Linux knowledge. I expect from you the C programming knowledge and computer architecture basics however, I will try to simplify more complicated statements, to keep less experienced readers here. As my article describes kernel code, I will frequently refer to git repository content. My suggestion is to clone the whole repo from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and checkout to da2968ff879b9e74688cdc658f646971991d2c56 commit (the one I’m working on).

The kernel has many ports to different architectures. Telling about startup and running kernel is hard without describing whole booting process. To keep the article simple, without non-checkable abstractions I will tell you about the BeagleBone Black booting sequence. It is open-source board – schematics is right here https://github.com/beagleboard/beaglebone-black/blob/master/BBB_SCH.pdf – which contains ARM-based TI AM335x family microprocessor – AM3358. We can find its reference manual easily (https://www.ti.com/lit/ug/spruh73q/spruh73q.pdf). I will refer to these documents frequently. We have everything ready, so let’s get started.

Powering up

The story begins by powering up the device. When the voltage is turned on, PRCM (Power, Reset, and Clock Management) module detects it. This module is a central unit of power management. It decides to turn on or off voltage on electronic domains and start or stop clocking them. All the details about it are described in Chapter 8 of Reference Manual – PRCM. It includes a lot of information since this module handle functionalities such as low-power modes, different types of reset, etc. We will not cover all of them, since it is not the topic of this article. Our ARM Cortex A8 Core belongs to MPU (Microprocessor Unit) subsystem, which is quite complex and covers several Power domains – details are described in Chapter 3. For some reason, there is a split MPU power domain (Cortex A8 core, cache memories, some other modules such as IceCrusher) and Core power domain (which includes Interrupt Controller).

ROM Code

I will not cover electronic details, since it is not the essence of this article. The key point is that only the most important parts of our processor are powered up and supplied with the clock signal. Reset exception redirects CPU to non-erasable on-chip memory, written by silicon vendor (TI). As a consequence, the so-called, ROM code is started by our micro. It is the only part of the boot sequence, executed by not open-source code, but we can find a huge amount of information about it in Reference Manual and on the Internet.

Boot ROM memory offsets

Boot ROM memory is addressable directly by our micro with an 0x4000 0000 offset. All the addresses presented on the upper photo should be prefixed with 4 on the most significant digit, like in the picture below. Why they are not? During early startup, ROM may be accessed through alias address, without leading 4. So it may operate on 0x0000 0000 0x0002 BFFF space.[1]

Boot ROM entry in main memory map

Boot ROM has some advanced features, like peripheral booting or checking image signature. It needs some operational memory. At this, early-stage the only one is internal static RAM:

Internal SRAM memory map

The answer to how much of this is usable depends on the type of device. It might be GP (General-Purpose) with the secure boot features disabled, or HS (High-Secure), which do not allow to boot untrusted code. The latter of course need more operational memory during ROM Code execution.

First, 1kB is reserved on both types and could not be used. Next 109 kB (on GP) or 46 kB (on HS) is the place where the next part of the boot chain (SPL) is copied by ROM Code. The last 18 kB is general-purpose operational memory used at this stage (more info on the diagram below). For the sake of simplicity, I will cover GP device boot.

Let’s get back to the place where we begin. When our micro is powered up, it starts execution from Secure Boot ROM – maybe from address 0x0000 0000. There is not much information about this process. Only that it uses ARM TrustZone architecture to obstruct reverse engineering. We can deduce, that layout of Secure Boot ROM is similar to Public Boot ROM (own exception vectors, CRC, code, and version). We will not focus on it. The only valuable piece of information is that the first 128 kB of ROM is reserved and it is executed on the earliest startup.

ROM Code Exception Vector – reset entry

First public information is that after Secure Boot ROM execution, code jumps to public exception vector, which in turn redirects execution to Public ROM Code (0x4002 0100, aliased with 0x0002 0100). On the initialization phase, the stack is set up in Internal SRAM, ROM Code CRC is calculated, and checked against address 0x4002 0020 to detect eventual memory issues. The watchdog WDT1 is started and set up for three minutes. Exception vector base is redirected to Public Boot ROM, so any exceptions are now handled by it. Then the first configurable part is done – setting up clocks.

Clock configuration

The clocks are configured for their default values. To do this, the board creator must inform Boot ROM code, what crystal is used. It is done with hardware configuration using SYSBOOT pins.

Supported crystals and SYSBOOT pins to configure the right one

Now let’s check what crystal is placed on BeagleBone Black schematics and how SYSBOOT is configured.

OSC0 wiring
Boot configuration pins on BeagleBone Black

As we can see, the main oscillator (OSC0) is connected to a 24 MHz crystal, and as the manual says SYSBOOT pins are wired to logical 1 (14) and logical 0 (15). You may wonder why there are pull-ups and pull-downs added to the schematics on each line. I think it is to make eventual changes easier. Every pull-up has DNI annotation, which I suppose stands for Do Not Integrate. So the R80 resistor pulls down the voltage for SYSBOOT 15 pin and R56 pulls up SYSBOOT 14. R55 and R81 should not be soldered on the board.

According to these settings Public ROM code configures default clocking rates of essential devices. You can find them on the diagram below:

The peripheral clocks list is not accidental – it contains all the devices which allow chaining next parts of boot sequence – SPI memory, MMC card, UART, USB. These are a possible source of the Secondary Program Loader – which will be described in one of the next parts. (MPU_CLK sets core clocking, L3F is one of the internal silicon buses, I2C I suppose is used to check the voltage conditions. I think the EMAC clock should also be listed here).

Boot chaining

Now the Boot ROM is going to the essence of its existence. It starts searching for the next elements of the boot chain – Secondary Program Loader. It will be described in one of the next parts of this series. For now, let’s look at how ROM Code searches for it. SPL is the first program executed by chip, coming from outside the AM335x processor. The board creator must choose the right booting device list (using SYSBOOT pins), to tell ROM code, where SPL should be placed. This process is similar to setting up PC BIOS booting sequence. If one device will not contain it, the next one is taken from the list.

AM335x allows booting from memory (like MMC or internal NAND) or peripherals (UART, Ethernet, or USB). If all of the sequenced boot methods fail, ROM Code goes to one of the dead-loops (0x20080 offset). All possible SYSBOOT combinations are described in table 26-7 of Reference Manual. It is too big to paste it here, however below you can find entries, which are used on BeagleBone Black. During this series, I will use and describe external SD card boot (MMC0), to easily update new firmware and follow kernel on it :). If you want to read more about other types of boot devices, please check the Reference manual, chapter 26.

BeagleBone Black boot options

The first combination begins to boot from MMC1 (soldered 4GB embedded MMC memory). If there is no image written on eMMC, the MMC0 interface is examined, wired with the micro SD slot, which will be supplied with our SD card:

If there is no image on eMMC, then both scenarios are almost the same. We will boot from an SD card. However, if it’s not empty, we must push the uSD BOOT (S2) button, when powering up. It will pull-down the voltage on SYSBOOT 2 and force the second boot scenario – examining SPI memory (which is not attached and fails), and then the MMC0 interface. The last two options (UART/USB) are not considered, since we will supply the board with a correct image on the SD card.

MMCSD

MMCSD controller in our AM335x processor is flexible enough to handle communication with a micro SD card. There is a lot of information about MMCSD controller communication in chapter 18 of the Reference Manual. The protocol used with SD cards is quite simple. There is a clock signal (MMC0_CLK), which allows transfer through 4 data lines (MMC0_DAT0-3). Data transfers are controlled by commands, serially sent through the MMC0_CMD line. MMC0_CD is a card detection line. As you can see on the schematics, it is pulled-up to 3,3 V. If the card is inserted, mechanical switch wires this line with grounded housing, and drives this pin low. According to documentation, Boot ROM Code does not use it, but sends a command and waits for a response instead. It is reasonable, since the polarity of this signal may vary among different boards.

MMC initialization is quite complicated because it covers different standards of memory storage. MMCSD controller supports MMC memories (8 data lanes), SD cards (4, 1 DAT lane or differential transfer on UHC-II) with different size and transfer rates. The transfer might use Single Data Rate or Double Data Rate – data clocked on rising and falling edge. All these details must be figured out on the command line. The protocol used there is compatible with all standards. Data packets are sent there serially in 48-bit requests. The first two (01) creates start sequence, next 6 are the command number. Right after that, 4 bytes of argument is passed. The packet is ended with 7-bit CRC and stop bit. Command numbers are described in manuals as CMDXX, where XX is a 6-bit command number. Sometimes they are prefixed with ACMDXX, which stands for Application-Specific Command. The response might be 48 or 136-bit, depending on the request type.

MMC initialization probably starts with CMD0 – the card chip reset. I haven’t checked the sequence, however, I assume it’s safer to make sure, that card is in an idle state. In the idle state, the default 400kHz rate clocks commands and responses, which should be supported by all standards. The next command strongly depends on card type, so I will focus on SD standard. So the CMD8 is sent to determine if current voltage conditions are OK for the card. It also tells whether the card supports SD standard in version 2.0 (this command was declared there). If there is no response for it, the controller knows, that card is in an older standard.

Let’s assume, that response was correct – with the same value in the VHS field and check pattern. The next command is ACMD41. It is an Application Specific command (preceded by CMD55), which starts the initialization process. The host sends suggested configuration bits (HCS, SDXC Power Control, S18R) and check if these settings are supported by the Card. If the settings are not supported, the card goes into an Inactive state, and the whole initialization must be repeated. Otherwise, the card sends a response with OCR (Operational Conditions Register) value. It contains a busy bit that is set if card chip initialization is completed, or it is still ongoing. In that case, ACMD41 must be repeated to poll the initialization. All configuration bits are presented on the diagram below. With HCS, the host declares its support of High Capacity or eXtended Capacity conformance. If the SDXC standard is supported, Card might be put into power saving or fast mode with XPC. UHC-I standard also allows switching logic voltage to 1,8V (shorter edges and faster data transfers). It might be checked with the S18R bit and activated with CMD11 later on.

Card responds with configuration bits and current Operational Condition Register value in this format:

After this handshake, the host and card know which standard will be used during later communication (Data Transfer Mode). The next phase is Card Identification Process. All previous commands were sent in broadcast addressing, now the host must allocate an address for each card connected to the bus. To do that, it issues CMD2, as a response Card sends its Identification Number (CID). This triggers Identification State on the Card. Next CMD3 is sent by the host. As a response Card suggests a shorter Relative Card Address (RCA). If the host accepts it, in later communication card will be identified with this RCA Address. If the host does not accept it, it must repeat CMD3.

As you may notice, a card address assignment is needed. It implies, that this interface support connecting many cards on that interface. We will not go too deep into it. The important thing is that RCA is used during Data Transfer Mode, which started right after CMD3. Now we can request data read, write, or card erasure, using yielded RCA.

To do that, we must transit the card chip from the Stand-by state (in Data Transfer Mode) to the Transfer state, using CMD7. Of course, we must supply RCA argument to this request. Other cards on the bus are transitioned to a Stand-by state (if it is in Transfer mode). The whole state diagram of the Data Transfer Mode is placed below.

After triggering the CMD7 command our card is in Transfer State. By default, data transfers are made on a single DAT0 line. To extend it, we send ACMD6 request with 10b argument.

At this moment distinction between SDSC (Standard Capacity) and SDHC/XC (High Capacity/eXtended Capacity) must be made. The first group could change block length (data are sent in blocks) and could address data using bytes – but has less capacity. The latter have extended capacity, but it is always addressed using block number, which is always 512-byte sized. After this short setup, we can access our data with CMD17 (single block read) or CMD18 (multiple block read). CMD18 transfer is finished with CMD12 (stop transmission).

This state machine and read process is managed by the MMCSD driver inside ROM Code. Right above that we need some logical data structure to gather the SPL. There are several ways, we can put SPL (later named as MLO – MMC Loader) in the SD Card. The first is Raw mode – MLO is directly written on the SD Card in four copies (0x0, 0x20000, 0x40000, 0x60000 offsets), without the usage of any file system. There is also an option of writing a file called MLO in the active, FAT partition. This is the way which I will cover. In this case, FAT module handles logical memory structure.

Reference Manual presents the layered structure of ROM Code. On the top of the MMCSD driver, the FAT module is used to access data in a formatted SD Card if we use this file system.
MBR

The card can be formatted with the so-called Master Boot Record (it allows putting several partitions on the card), or whole memory may be formatted with FAT. The first approach will be used, so the first sector on our memory is a Master Boot Record (MBR). This is the logical structure, telling about partitions present on the memory and specifying their details like name or usage flags (active/not active, boot partition). The job for ROM Code is to find active FAT 12/16/32 partition with MLO file in its root folder. The structure of the MBR is presented below.

The first thing done is to recognize, if a sector is indeed MBR. To do this, the signature at the end is used. It must be equal to 0xAA55. Right after that, partition entries are examined whether any of them contains the FAT file system and it is active. There is an obvious error in the Partition End Head position – it has a 1-byte length, not 16 like on the diagram below.

This structure gives information about the placement of the partition on the SD Card. It might be written in two ways, using CHS (Cylinder Head Sector) way on 3-byte addresses (start at offset 1, end at offset 5). Or it might be specified using 4-byte LBA (Logical Block Addressing) – start at offset 8, size of the partition at offset 12. According to these data, we can check if the MBR entry is not malformed (address goes outside available memory) and move further.

FAT File System

FAT stands for File Allocation Table. It is the second logical part of this file system, after the Boot Sector, which includes BIOS Parameters Block and before Root Folder and Data Area. Boot Sector is placed on the first sector occupied by partition. Important for us is that it contains much information about file system structure:

  • Bytes per sector – in the flash drives it should be the same as block size since it is the smallest erasable piece of memory. Usually, it is equal to 512.
  • Sectors per cluster – FAT makes from whole Data Area small parts, called clusters. We could see it as the smallest allocation area. Each cluster is assigned to a single file and a single file may lay in several clusters distributed among the whole data area.
  • Position of the Root directory – it is not directly written in BIOS Parameters Block, but it may be calculated from parameters given there.
    • Number of sectors per Boot Sector
    • Number of FAT copies (to prevent data malformation FATs are usually duplicated)
    • Absolute position of the FAT partition in flash memory space

After the Boot Sector, there are File Allocation Tables. This is a register of all clusters used by a file system. The offset of the FAT cell tells, to which cluster in Data Area it is assigned (offset between start of FAT and cell position, corresponds to offset between Cluster and start of Data Area). FAT cells create a structure of a singly linked list. Each file has assigned HEAD of this list – which is the offset of the first FAT cell used by this file. This FAT cell is assigned to a cluster at the beginning of the file and it contains the offset of the next FAT cell used by this file. If the file is small enough to fit into one cluster, the list contains one cell with the value 0xFFFF. If the file is bigger than the cluster, the cell value is the offset of the next one (for example 0x0010). If the cell under offset 0x0010 contains 0xFFFF, the file is written in two clusters.

It’s quite simple, but where is the information about head cells assigned to files? The heads of files placed in the root directory are placed in the Root Folder part. As I mentioned, it is statically addressed. This address may be calculated from BIOS Parameter Block. Our MLO file must be placed there, so I will not tell you about the subdirectories structure. Root Folder contains up to 511 entries, which structure is described below:

Boot ROM Code focuses on checking if the file is called MLO and it has correct attributes (not hidden file). According to this FAT Directory Entry and File Allocation Table look-up, we can easily access MLO file data. This is the thing, done in the next step.

Using MMC and FAT file system implies, that we must Shadow the MLO code. The shadowing is copying data into another place (RAM), from it could be easily executed. AM335X allows also the use of XIP (eXecute In Place) memories, which could avoid it. But I only give you it as nice curio.

Running the code

The MLO file, using FAT and MMC layers, is parsed and the image included there is copied to the 0x402F0400 address (it is placed in the internal Static RAM). For Secure devices this address is different and the available area is smaller.

It took me some time to resolve the MLO file structure. It is a generic file format, which is not fully used here. In the beginning, we can find two 32-byte so-called Table Of Contents entries. The first word in it is offset of described entry, the second is the size (like on table 26-38). The next 12-bytes are not used by us. At the end of the TOC entry, we have a section name, which is CHSETTINGS. The last TOC entry must be filled with the FF, that is why we have FFs between 0x20-0x40 address.

HEX-decoded MLO file (0x00-0x20) is first TOC entry. 0x20-0x40 second one. Under 0x200 you can find size, destination address and first instruction of SPL code – branch to reset routine.
Disasembly of the SPL code (0xea00000f instruction can be easily found under 0x208 offset on upper screenshot).

According to the documentation, TOC is required when booting from MMC/SD in RAW Mode. MLO which I have also have this preamble, however. TI documentation doesn’t say much about the purpose of this beginning. We know, according to MLO code (more on that in next chapters), that the first TOC entry points to settings structure, which looks like this:

CHSETTINGS structure

TOC entries and their content takes first 512 bytes of MLO. Under offset 0x200 GP Device Image format starts, its structure is presented below (under HS device it looks different).

Under offset 0x200 of MLO, there is a size of the image to be shadowed (copied). I’m not sure, why the Destination address is supplied (offset 0x204) because the image is always copied into the same area, which may vary only between GP/HS devices. Maybe it is caused by Image unification between TI platforms.

The last and most important part is the Secondary Program Loader code. It is copied directly to internal SRAM. On the upper screens, you have seen the first line of the code, which is placed right there (offset 0x208 MLO) – branch to reset routine instruction (0xea00000f). Only this part of MLO is copied to the 0x402F0400 address. After successful image load, Program Counter is placed right there. ROM Code leaves some information about boot device and reset reason in a structure presented below. Pointer to this structure is passed in R0 register..

Summary

When I started this article, I had a completely different concept. I thought, that one article will be sufficient to describe Boot ROM Code, MLO, U-Boot, and head of the Linux kernel. I have noticed, that the topic is much deeper than I thought, and the first part allowed me to create this long article. I’m really happy about that because I found a lot of new information, which hopefully will be new for the readers also.

We could start the next chapter from the topic I already mentioned – MLO Code, however, I think that a better idea will be focusing on Kernel Build System. It is common for the U-Boot, Linux and some other projects. It uses mainly Makefile scripts. The knowledge on it will give us strong foundations before diving into the code.

I hope it will be shorter article because the next one will be much more interesting

[1] – https://e2e.ti.com/support/processors/f/791/t/308183?AM335x-boot-ROM-memory-map
[2] – http://www.staroceans.org/e-book/AM335x_U-Boot_User’s_Guide.html
[3] – http://academy.cba.mit.edu/classes/networking_communications/SD/SD.pdf

stdatomic.h under the hood #2

In today’s part of the series, I will find out, how the code from #1 (http://olejniczak.ovh/index.php/2020/12/18/stdatomic-h-under-the-hood-1/) is compiled on ARM architecture. I don’t want to expand this text without need, so if you want to examine code, check #1 of this series.

ARM

Below you can find details of tested architecture and compiler:

$ arm-linux-gnueabihf-gcc -v
Using built-in specs.
...
Target: arm-linux-gnueabihf
... --disable-multilib --enable-multiarch --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard ... --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-plugin --enable-gold --enable-c99 --enable-long-long --with-mode=thumb --disable-multilib --with-float=hard
Thread model: posix
gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09)

Let’s find out how the compiler works on this machine:

# ./non-synchronized-arm 
-22301
# ./non-synchronized-arm 
96532
# ./non-synchronized-arm 
225150
# ./non-synchronized-arm 
-66366
# ./non-synchronized-arm 
-120416
# ./non-synchronized-arm 
4340
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0
# ./synchronized-arm 
0

The difference in using stdatomic is obvious, so let’s check how ARM compiler works to make it happen. The first assembly code is generated from non-synchronized.c file:

Thread:
	@ args = 0, pretend = 0, frame = 16
	@ frame_needed = 1, uses_anonymous_args = 0
	@ link register save eliminated.
	str	fp, [sp, #-4]!
	add	fp, sp, #0
	sub	sp, sp, #20
	str	r0, [fp, #-16]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L7:
	ldr	r3, [fp, #-16]
	cmp	r3, #0
	beq	.L5
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	add	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
	b	.L6
.L5:
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	ldr	r3, [r3]
	sub	r2, r3, #1
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	str	r2, [r3]
.L6:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L7
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #0
	@ sp needed
	ldr	fp, [sp], #4
	bx	lr
	.size	Thread, .-Thread


Code is quite similar to x86. The most crucial part is L7, which implements addition, and L5, which implements subtraction. Let’s see what they are doing:

	movw	r3, #:lower16:x /* Load less significant 16 bit of x address */
	movt	r3, #:upper16:x /* ... and more significant part */
	ldr	r3, [r3]        /* Load value of x to r3 register */
	add	r2, r3, #1      /* Add 1 to x (sub in L5) */
	movw	r3, #:lower16:x /* Put address to register (like previously)  */
	movt	r3, #:upper16:x
	str	r2, [r3]        /* Store result */

We can see, that this code is longer than its x86 counterpart. Now let’s see how arm-linux-gnueabihf stdatomic implementation looks like:

Thread:
	@ args = 0, pretend = 0, frame = 40
	@ frame_needed = 1, uses_anonymous_args = 0
	stmfd	sp!, {fp, lr}
	add	fp, sp, #4
	sub	sp, sp, #40
	str	r0, [fp, #-40]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L4
.L11:
	ldr	r3, [fp, #-40]
	cmp	r3, #0
	beq	.L5
	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8
.L5:
	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10
.L7:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L4:
	ldr	r2, [fp, #-8]
	movw	r3, #16959
	movt	r3, 15
	cmp	r2, r3
	ble	.L11
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #4
	@ sp needed
	ldmfd	sp!, {fp, pc}
	.size	Thread, .-Thread

This time, synchronized implementation is much more complex, but it’s not a problem for us :). After short initialization, the code jumps to L4 code, where for loop is implemented. If the break condition is false, the code jumps to L11. L11 in turn starts with examining arg parameter against NULL (check code in part #1). If it’s not, L11-L6 code makes the addition otherwise, the jump to L5 is made. If you compare code in L11-L6 and L5-L9, you can see, that they are almost the same:

	mov	r3, #1
	str	r3, [fp, #-32]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-28]
.L8:
	ldr	r2, [fp, #-28]
	ldr	r3, [fp, #-32]
	add	r3, r2, r3
	str	r3, [fp, #-24]
	ldr	r3, [fp, #-24]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #28
	ldr	r0, [r2]
	dmb	sy
.L13:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L14
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L13
.L14:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L6
	str	r1, [r2]
.L6:
	cmp	r3, #0
	bne	.L7
	b	.L8
	mov	r3, #1
	str	r3, [fp, #-20]
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	dmb	sy
	ldr	r3, [r3]
	dmb	sy
	str	r3, [fp, #-16]
.L10:
	ldr	r2, [fp, #-16]
	ldr	r3, [fp, #-20]
	rsb	r3, r3, r2
	str	r3, [fp, #-12]
	ldr	r3, [fp, #-12]
	mov	ip, r3
	movw	r3, #:lower16:x
	movt	r3, #:upper16:x
	sub	r2, fp, #16
	ldr	r0, [r2]
	dmb	sy
.L15:
	ldrex	r1, [r3]
	cmp	r1, r0
	bne	.L16
	strex	lr, ip, [r3]
	cmp	lr, #0
	bne	.L15
.L16:
	dmb	sy
	moveq	r3, #1
	movne	r3, #0
	cmp	r3, #0
	bne	.L9
	str	r1, [r2]
.L9:
	cmp	r3, #0
	bne	.L7
	b	.L10

Each code start from saving literal 1 to the local variable on the stack (frame pointer-32 for addition and frame pointer-20 for subtraction). Right after that register r3 is filled with x variable address. Then we can see something new – instruction dmb sy which was not used in non-synchronized code. According to the ARM developers guide (https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb)

Data Memory Barrier acts as a memory barrier. It ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program order after the DMB instruction.

The dmb instruction has the second part, defining which operation should unlock our barrier. In this case, we are locking till the whole memory subsystem finishes its job (sy). It ensures so-called Sequential Consistency. It is done before ldr instruction (which as str, should be considered as asynchronous) loads current x value, to make sure, that all stores called before were done (in other words, whether we load most actual value).

After finishing ldr operation (which is synchronized with second dmb sy), we are storing a loaded value to local copy on the stack (fp-28 in addition branch and fp-16in subtraction branch). Keep in mind, that only accessing variable shared by both threads is surrounded by dmb instructions.

L8 and L10 blocks load the value of x to r2 register, then value 1 to r3 and adds (or subtracts) them, saving the result in r3. After that result is stored in the new place – fp-24 in addition branch and fp-12 in subtraction branch. The result is also moved to ip (r12) register. Right after that, r3 register once again is filled with x address, r2 is filled with fp-28 (fp-16). Let’s go back to the previous paragraph – these locations are our local copies of x value, which are loaded to r0 register. Keep in mind that this value was not changed by our addition (subtraction) operation.

Then the comparison between r1 (value of x) and r0 (local copy before operation) is done. If these values are not equal, it means, that someone changed x during our L8(L10) execution. Let’s analyze this case – we are jumping to L14 (L15). This block is not intuitive at all – it synchronizes memory operations, loads 1 to r3 if our previous compare was true (not this time), and loads 0 otherwise (this case). Then it compares r3 against 0 (we loaded 0 just before – true) and calls bne to L6 (L9), which in our scenario is not done (bne is “branch if not equal”). ARM is known of conditional instructions, it is nice, but puzzling it in such context is not easy :). We didn’t make a jump, so we are storing r1 (x value) to address under r2 (or local copy). After that cmp + bne + b set moves our instruction pointer to L8(L10) and all described operations are repeated. Not efficient :(.

Now let’s go back to cmp r1,r0 and consider, that these values didn’t change. In this case, we are storing our result with strex in a shared x location. strex stands for store exclusive, so we can expect some kind of synchronization – probably dependent on ldrex usage. It stores our changed value in a shared x value address and returns the result to the lr – a synonym of r14 – register. If the store is successful, it returns 0, and our execution path goes to L14(L16). The previous paragraph puzzled this code, in this case, we are finishing in common L7 block, which prepares next for loop step. If the store failed, we are repeating L13 (L15) once again.

Could you write it simpler?

Yes :). I prepared some pseudocode, describing this assembly listing. Here you have:

   l_1  <- 1
   ------ Memory barrier
   r3   <- x
   ------ Memory barrier
   r3   -> l_x
L10:
   r2   <- l_x
   r3   <- l_1
   r3   <- r2 +/- r3
   r3   -> l_res
   ip   <- l_res
   r0   <- l_x
   ------ Memory barrier
L15:
   r1   <- x /* exclusive */
   if (r1 != r0) goto L16
   ip   -> x /* exclusive wit result to lr */
   if (lr != 0) goto L15
L16:
   ------ Memory barrier
   if (r1 != r0) goto L10

   l_x = x;
   l_1 = 1;



start:
   l_res = l_x +/- l_1;




storing:
   if (l_x != x(ex)) /* exclusive x load */
   {
      l_x = x;
      goto start;
   }
   else
   {
   /* exclusive store returning 0 if succeed */
      if (x =(ex) l_res)
         goto storing;
   }

l_ prefix stands for a local variable copy. Exclusive operations (strex and ldrex) are marked with (ex). If you didn’t get my assembly analysis, you should focus on C-like pseudocode.

The thing, which does the job in this program is ldrex and strex instructions. This couple allows atomic load-modify-store operation via the so-called “Exclusive monitor”. ldrex instruction initializes its state machine and tells it to wait for the following strex. If any context switch happens between these two steps, which may corrupt atomic operation, strex will return 1 into lr register. In this case, ldrex + strex operation must be repeated. If this context switch also changed the value of x, the whole calculation done before must be repeated with the current value (goto start).

Memory barrier

I found on the internet some opinions, that dmb instruction is only needed on multi-core processors. I was curious, whether it is needed in our context (TI AM3359 single-core processor) or not. To check this, I have removed all memory barrier instructions from assembly and compiled a new version:

$ cat synchronized-arm.s | grep -v dmb > synchronized-arm-no-barrier.s 
$ arm-linux-gnueabihf-gcc synchronized-arm-no-barrier.s -lpthread -o synchronized-arm-no-barrier
$ scp synchronized-arm-no-barrier root@mydevice:/root
$ ssh root@mydevice
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0
# ./synchronized-arm-no-barrier 
0

My test shows, that these barriers are redundant however, I’m not sure if corruption, in this case, is not possible or not likely.

Summary

This code is complicated, comparing to the x86 implementation. It may be due to the old gcc version. However I suppose, that ARM instruction set is focused on low power usage and that is the main reason. I made some measurements of execution time. To make it more reliable, I should change for loop to a longer run, but I will leave it as an exercise for the reader 🙂

$ ##### x86 :
$ time ./non-synchronized
708439
real	0m0,011s
user	0m0,017s
sys	0m0,004s
$ time ./non-synchronized
-998811
real	0m0,009s
user	0m0,016s
sys	0m0,000s
$ time ./non-synchronized
986583
real	0m0,011s
user	0m0,015s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,039s
user	0m0,068s
sys	0m0,004s
$ time ./synchronized
0
real	0m0,053s
user	0m0,097s
sys	0m0,008s
$ time ./synchronized
0
real	0m0,053s
user	0m0,103s
sys	0m0,000s

# ##### ARM Cortex-A8
# time ./non-synchronized-arm 
-396494
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./non-synchronized-arm 
0
real	0m 0.08s
user	0m 0.03s
sys	0m 0.01s
# time ./non-synchronized-arm 
-531472
real	0m 0.07s
user	0m 0.04s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm 
0
real	0m 0.56s
user	0m 0.33s
sys	0m 0.01s
# time ./synchronized-arm 
0
real	0m 0.54s
user	0m 0.34s
sys	0m 0.00s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.12s
user	0m 0.06s
sys	0m 0.02s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.09s
user	0m 0.07s
sys	0m 0.01s
# time ./synchronized-arm-no-barrier 
0
real	0m 0.10s
user	0m 0.07s
sys	0m 0.01s

We can see, that x86 stdatomic made execution 3-5 times slower. The code generated by ARM compiler slowed down 7-8 times. Surprisingly, the main bottleneck in the generated code was dmb sy instruction. Removing it, gives synchronized code, which executes only 1.5-2 times slower than non-synchronized! The question is if removing memory barriers is 100% bullet-proof.

I hope you are not sleepy after reading this post :). If you have any suggestions regarding this series, or you have some remarks, please add a comment. Next time I will try to compile something similar in typical embedded architecture, which is AVR.