Binary Loading Basics
Ever wondered how a program executable file are loaded by the operating system, despite these files appearing as random blobs of data when opened through a text editor?
This post gives you all the answers.
Prerequisites
- Basic knowledge of the C++ programming language and command line tools.
- A Windows C/C++ development environment, preferably Visual Studio.
- The
nasm
,gcc
, andobjdump
tools. They can be installed on any Linux distro, including those running on top of WSL.
What's inside an executable file
An executable file is a representation of a computer program.
As learned before, a program contains two parts: code and data. These two components are also what's inside an executable file.
Executable files also contain instructions for the operating system to prepare the code and data,
mostly located in "section headers". You can use the objdump
tool to explore these headers:
trung@DESKTOP-5OCA2N2:~$ objdump -h /bin/bash
/bin/bash: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .interp 0000001c 0000000000000318 0000000000000318 00000318 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .note.gnu.property 00000030 0000000000000338 0000000000000338 00000338 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .note.gnu.build-id 00000024 0000000000000368 0000000000000368 00000368 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .note.ABI-tag 00000020 000000000000038c 000000000000038c 0000038c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .gnu.hash 00004cb8 00000000000003b0 00000000000003b0 000003b0 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .dynsym 0000f168 0000000000005068 0000000000005068 00005068 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
6 .dynstr 0000a143 00000000000141d0 00000000000141d0 000141d0 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
7 .gnu.version 0000141e 000000000001e314 000000000001e314 0001e314 2**1
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .gnu.version_r 000000f0 000000000001f738 000000000001f738 0001f738 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
9 .rela.dyn 0000e2e0 000000000001f828 000000000001f828 0001f828 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
10 .rela.plt 000014b8 000000000002db08 000000000002db08 0002db08 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
11 .init 0000001b 000000000002f000 000000000002f000 0002f000 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
12 .plt 00000de0 000000000002f020 000000000002f020 0002f020 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
13 .plt.got 00000030 000000000002fe00 000000000002fe00 0002fe00 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
14 .plt.sec 00000dd0 000000000002fe30 000000000002fe30 0002fe30 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
15 .text 000e573a 0000000000030c00 0000000000030c00 00030c00 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
16 .fini 0000000d 000000000011633c 000000000011633c 0011633c 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
17 .rodata 00019df0 0000000000117000 0000000000117000 00117000 2**5
CONTENTS, ALLOC, LOAD, READONLY, DATA
18 .eh_frame_hdr 0000492c 0000000000130df0 0000000000130df0 00130df0 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
19 .eh_frame 0001c358 0000000000135720 0000000000135720 00135720 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
20 .init_array 00000008 00000000001529b0 00000000001529b0 001529b0 2**3
CONTENTS, ALLOC, LOAD, DATA
21 .fini_array 00000008 00000000001529b8 00000000001529b8 001529b8 2**3
CONTENTS, ALLOC, LOAD, DATA
22 .data.rel.ro 00002b20 00000000001529c0 00000000001529c0 001529c0 2**5
CONTENTS, ALLOC, LOAD, DATA
23 .dynamic 00000200 00000000001554e0 00000000001554e0 001554e0 2**3
CONTENTS, ALLOC, LOAD, DATA
24 .got 00000920 00000000001556e0 00000000001556e0 001556e0 2**3
CONTENTS, ALLOC, LOAD, DATA
25 .data 000087b0 0000000000156000 0000000000156000 00156000 2**5
CONTENTS, ALLOC, LOAD, DATA
26 .bss 0000afb0 000000000015e7c0 000000000015e7c0 0015e7b0 2**5
ALLOC
27 .gnu_debuglink 00000034 0000000000000000 0000000000000000 0015e7b0 2**2
CONTENTS, READONLY
Or, using the tool on a Windows binary
trung@DESKTOP-5OCA2N2:~$ objdump -h /mnt/c/Windows/System32/cmd.exe
/mnt/c/Windows/System32/cmd.exe: file format pei-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0003463d 0000000140001000 0000000140001000 00001000 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .rdata 00009602 0000000140036000 0000000140036000 00036000 2**4
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .data 00001000 0000000140040000 0000000140040000 00040000 2**4
CONTENTS, ALLOC, LOAD, DATA
3 .pdata 00002568 000000014005c000 000000014005c000 00041000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .didat 000000a8 000000014005f000 000000014005f000 00044000 2**2
CONTENTS, ALLOC, LOAD, DATA
5 .rsrc 000084f8 0000000140060000 0000000140060000 00045000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
6 .reloc 000001d0 0000000140069000 0000000140069000 0004e000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
Notice some labels attached to the sections:
CODE
: The section contains machine code.DATA
: The section contains data.READONLY
: The section is meant to be readonly. This is applied to most code sections and some data sections holding constants.
Also notice some common section names:
.text
: A readonly section, usually code..data
: A section containing data, usually writable. The section stores the initial values of variables..rodata
or.rdata
: A section containing constant data.
The ELF format
Looking closely at the /bin/bash
example above, you can see that the file format is specified
as elf64-x86-64
. ELF (the Executable Linkable Format) is the most popular executable
format, used in most operating systems except Windows (using the Portable Executable) and
macOS (using Mach-O).
While the format seems complex with a lot of information to support different scenarios, to build our first loader, we only need to care about these parts:
File header
Contains general information about the executable. Located right at the beginning of the file.
e_ident[EI_MAG0]
throughe_ident[EI_MAG3]
: The ELF magic number:0x7F
,'E'
,'L'
,'F'
. These 4 bytes are used to verify whether the file in question is actually an ELF file.e_entry
: The entry point. This is the pointer to the binary's starting function.e_phoff
: The pointer to the program headers, relative to the file's start address.e_phnum
: The number of program headers.
Program header
Contains more information about the program included in the executable.
p_type
: The header type. As a simple loader, we only need to take care ofPT_LOAD
headers.p_flags
: Memory protection flags for the section. Will be discussed more below.p_offset
: Pointer to the data this header represents, relative to the file's start address.p_vaddr
: The address this segment should be loaded in memory.p_filesz
: Size in bytes of the data contained in the original file. These bytes have to be mapped into memory by the loader.p_memsz
: Size in bytes of the data mapped into memory. This size can be greater thanp_filesz
, in which case the OS usually fill the excess part with zeroes.
Don't worry, all of this will be clear once we start writing our loader.
Memory mappings
Usually, the operating system runs multiple processes sharing the same executable file, such as
when the user opens two different cmd.exe
windows. In this case, reading the executable file
and copying the data into memory allocated by normal C++ routines would result in the binary being
written multiple times in RAM.
To solve this problem, operating systems have a mechanism to map the same physical pages (memory regions residing in RAM) to different virtual pages (memory regions in each process's separate virtual address space). The physical pages can either be set to mirror a file on disk, or represent "anonymous" memory not associated with any file.
To map files to memory, on Windows, we use the MapViewOfFile
function. On other OSes, the
equivalent function is mmap
.
To protect important data from being unintentionally corrupted by programming errors, each virtual page is usually "protected" with one of the following three flags:
READ
: The page can be read.WRITE
: The page can be write.EXECUTE
: The page has machine code that can be executed by the CPU.
Attempting to access (read, write, execute code) on a virtual page without the appropriate flag set will result in an access violation exception (which normally crashes the program).
The p_flags
value in the ELF Program Header mentioned above instructs the loader to set the
corresponding protection values to the pages containing the mapped data after loading. On Windows,
we can use the VirtualProtect
function for this (mprotect
on most other OSes).
Creating our loader
To understand the theory above, let us get our hands dirty and create a new ELF loader from scratch!
Since there has been so so many ELF loaders for Linux and other UNIX OSes, let's do something a bit different: An ELF loader for Windows.
Creating the project
Launch Visual Studio, click "Create a new project" on the options on the right, and create a new C/C++ console application.
Make sure the selected language is C++.