Computer memory from the ground up (part 2)

In my previous post I explained about the physical memory and how it is connected to the CPU. In this post I will talk about how the program file residing in permanent storage is loaded into the memory and executed.

For the purpose of this post we will refer to a generic binary format and a generic operating system. I will only cover native executable files, i.e. I will not talk about scripts nor formats such as the JVM formats. Scripting language interpreters vary greatly and there is little resemblance between how a native executable is treated and how an interpreter handles a script, although the interpreter is usually a native executable. Files for the JVM are more similar to native executables (they are native to the JVM), but the JVM is a larger topic so I will not include them here.

The code snippets in this post are using C.

The binary file

A program is usually stored as a binary file. There are many formats for executable files. Regardless of their differences, executable files have the following three sections:

  1. Code to be executed
  2. Constants, literals and initialized global data.
  3. Uninitialized global data

The code to be executed is the set of machine instructions that the CPU will run.

Constants, literals and initialized global data are needed for the program to run, for example:

unsigned int PAGE_SIZE = 64 * 1024;

The value for PAGE_SIZE will be stored in the binary file so the value is preserved and the variable can be correctly initialized when the programs is run.

Uninitialized global data is the data that is not assigned a value at program start.

Loading an executable file for execution

After a request has been issued to run a program, the OS needs to open the corresponding binary file and load it into memory.

The OS parses the binary file and then allocates pages of memory to load the code, the constants, literals and initialized global data. If you do not know what a memory page is, I recommend you to read my previous post https://blog.carlosware.com/2021/01/04/computer-memory-from-the-ground-up-part-1/.

Finally the OS loads the uninitialized global data, which is usually compacted in the binary file. Only the size of the uninitialized global data is recorded in the binary file.

Memory layout versus file layout

It seems natural to think that the layout of the file in disk would be the same as in memory. This would make the loading easier since the OS would just need to load the pages from disk directly into memory.

In reality the layout of a file in disk and in memory is very different. A file in disk is optimized for size in order to make it easier to copy and transfer. The file layout saves space by omitting regions that are empty or zero initialized and instead has a description of the memory layout so the OS is able to correctly load the file in memory.

The following code excerpt shows what is called a linker script, linker map or simply a linker description. It is the linker script used to describe the memory layout of CoLilo (https://www.nevis.columbia.edu/~chi/NCC/n3c/bootloader/colilo-HOWTO.html), a bootloader used to load the Linux kernel into a Coldfire Processor.

MEMORY {
	flash	: ORIGIN = 0x30020000, LENGTH = 0x00004000
	ram 	: ORIGIN = 0x003c0000, LENGTH = 0x0003ffff
}

SECTIONS {

        .text : {
		_stext = . ;
        	*(.text)
		*(.rodata)
		_etext = . ;
        } > flash

        .data : AT (ADDR(.text) + SIZEOF(.text)) {
		_sdata = . ;
        	*(.data)
		_edata = . ;
        } > ram

        .bss : AT (ADDR(.text) + SIZEOF(.text) + SIZEOF(.data)) {
		_sbss = . ;
		*(.bss)
		*(COMMON)
		_ebss = . ;
	} > ram
}

The file starts by describing the memory layout and then describes the sections that need to be mapped into memory:

  • .code The code to be executed
  • .data The initialized global data
  • .bss The uninitialized global data

Since this is a bootloader, the layout it is very simple. In fact, the memory layout of the file is almost the same as the memory layout in memory. The only difference is the uninitialized global data section, which will only contain the start and end address.

One interesting detail in the script is that the code section resides in a region of memory called flash. This is very common in embedded systems, in which the bootloader is run directly from the ROM memory.

The code section of the program is always loaded into its own pages and does not share any pages with the other sections. The reason for this is that the code has to be protected from writing and therefore the pages used for execution are marked as eXecute, while the data pages might be marked as ReadOnly or ReadWrite.

The tricky bits

As soon as a program makes use of external libraries, the layout becomes more complex. In fact, even the simplest program requires external libraries:

#include <stdio.h>

int main(int argc, char **argv) {
  printf("%s", "Hello world!);
  return 0;
}

The hello world program makes use of the printf function, which is defined in the C Library. Therefore the C Library needs to be included in our memory layout. There are two main ways to do this:

  1. Static linking: the code for the function or functions is included directly in our code
  2. Dynamic linking: the code for the function is not included, only a reference that needs to be satisfied at runtime by the OS.

If the build process used static linking, then the code for the function and its dependencies will be directly included in our file. This of course increases the size of the binary file, but it makes our program more portable since we do not need to worry about the host OS having the given library. This is similar to a fat jar in the JVM, all dependencies are packed in the fat jar so we do not need to worry about missing dependencies.

If the build process used dynamic linking, then our binary file will include a table with the libraries, and probably the symbols that are required from each library. Following with our example, we will include a reference to printf from the C library.

The library code will be mapped into our process space by the dynamic loader, in the case of Linux it is ld.so. A library like the C library it is one of the first libraries to be loaded, and it usually resident in memory already. The OS will modify the page tables for our process and include the references to the C library so that our process can access the library that is already in memory. For more information about page tables and page mapping read my previous post https://blog.carlosware.com/2021/01/04/computer-memory-from-the-ground-up-part-1/.

There are several ways to reference a function in a library, and again it is dependent on the format of the binaries and the platform. The simplest way to refer to a symbol is by including the offset from the beginning of the code in the library:

printf = CLibraryStart + 0x00F00000

When our binary file is loaded, the OS will see there is a reference to the symbol printf which is located at an offset from the start of the C library code. The loader will then make sure that all references in the code are updated with the corresponding values, so that when printf is called it is directed to the right section of the code.

This way of resolving symbols is very fragile. If the layout of the C library changes, for example due to a different version in the host system versus the build system, then the reference might point to a wrong region of the code with potentially catastrophic results.

A slightly better way to refer to symbols is to refer to an index in a symbol table. This way our build system and our host system do not need to be identical. At link time our program will include a reference to the printf symbol as the corresponding index in the symbol table for the C library. The loader will then read that value and update the references with the value obtained from the symbol table in the C library.

printf = CLibrary[42]

This method is not fail proof either, but it is an improvement. As long as both versions of the C library are binary compatible, then our program will run without problems. Binary compatibility means that both versions will have the same indexes for the symbols, and new symbols are added only at the end of the symbol table so they do not interfere with the old symbols.

The hidden bits

Have you ever wondered how does the main function receives its parameters and why we can return from it?

int main(int argc, char **argc) {
  return 0;
}

Before our code is executed, there is code that is run. The code that is run before main has the responsibility of setting up the C runtime, known as CRT.

The CRT is run after the symbols have been resolved and it has the responsibility of setting up the stack and recover the command line parameters from the OS. Once that is done, it prepares the first stack frame and calls the main function.

Once main is done, the control returns to the CRT which will make sure that the OS receives the exit code from our process.

What’s next

My next post will be about the management of dynamic memory in a process. In the meanwhile you can help me by sharing this post with your friends.

Blog post from https://blog.carlosware.com: Loading and running a binary file

Published by carlosware

Busy dad of three with a passion for fly fishing and computers.

2 thoughts on “Computer memory from the ground up (part 2)

Leave a comment