The Assembler
Part of Programming Fundamentals
An assembler is a program that translates human-readable assembly language into machine code the CPU can execute.
Why This Matters
Writing programs by encoding opcodes as hexadecimal numbers by hand is possible but brutally tedious and error-prone. Move the wrong byte, and an instruction becomes something entirely different. An assembler solves this problem: you write readable symbolic code — MOV A, B instead of 0x78 — and the assembler converts it to the correct binary bytes.
For a civilization rebuilding computation, the assembler is the first tool that pays for its own existence. Once you have a working assembler, every subsequent program is easier to write, easier to read, and easier to fix. The assembler bridges the gap between human thinking and machine execution, and it is small enough to be written by hand in a few hundred to a few thousand lines of code.
Understanding how an assembler works demystifies the entire toolchain. When you know that a compiler eventually produces assembly, and assembly is fed to an assembler, and the assembler produces binary that the CPU runs, you understand the complete stack from thought to electrons.
What an Assembler Does
An assembler reads a source file containing assembly language text and writes an output file containing machine code bytes. The translation is largely mechanical: each assembly instruction corresponds to a specific opcode byte or sequence of bytes defined by the CPU architecture.
The simplest assembler — a single-pass assembler — reads each line, looks up the mnemonic in a table, and emits the corresponding bytes. For MOV A, 42, it emits the opcode for “load immediate value into register A” followed by the byte 42.
A two-pass assembler handles forward references: situations where code refers to a label defined later in the file. On the first pass, it scans the entire source file and records the address of every label. On the second pass, it emits the actual machine code, now able to resolve any label reference to its known address.
The output is typically a binary file containing raw machine code ready to be loaded into RAM and executed, or an object file that a linker will combine with other object files to produce a final executable.
Building a Simple Assembler
A minimal assembler needs:
1. A lexer (tokenizer): Reads the source text character by character and groups characters into tokens — mnemonics, register names, numbers, labels, comments. A token is the smallest meaningful unit. The line MOV A, 42 ; load value produces tokens: MOV, A, ,, 42, with the comment stripped.
2. An opcode table: Maps mnemonic strings to their binary encodings. For a simple 8-bit CPU this might be a two-column table with 50-200 entries. Store it as a sorted array and use binary search for lookup, or as a hash table if you have implemented one.
3. A symbol table: Maps label names to their addresses. Populated during the first pass, consulted during the second pass when labels appear as operands.
4. An expression evaluator: Handles numeric operands that might be constants, labels, or simple arithmetic like BUFFER + 10. At minimum, evaluate decimal and hexadecimal literals and label references.
5. Code emission: Writes bytes to the output file. Keep a location counter tracking the current output address so you know where each instruction will land in memory.
A concrete example for the Z80 CPU: when the assembler sees LD A, (HL), it looks up LD with operands A and (HL) in its opcode table and finds the encoding 0x7E — a single byte. When it sees LD A, n where n is a literal value like 0x3F, it emits two bytes: 0x3A (the opcode for load A with immediate) followed by 0x3F.
Handling Labels
Labels are the assembler’s most important contribution beyond simple opcode translation. A label marks a position in the code — the start of a routine, the target of a jump, the location of a data buffer.
In the source file, a label appears as a name followed by a colon at the start of a line:
LOOP:
LD A, (HL)
INC HL
DJNZ LOOP
LOOP is a label. The instruction DJNZ LOOP means “decrement register B, jump to LOOP if not zero.” During the first pass, the assembler records LOOP = <current address>. During the second pass, when it processes DJNZ LOOP, it looks up LOOP in the symbol table, computes the relative offset from the current instruction to the target, and encodes that offset in the jump instruction.
For absolute jumps, the label’s address is encoded directly. For relative jumps (common on 8-bit processors to save space), the offset is the difference between the target address and the address of the instruction following the jump — a signed byte, typically, allowing jumps of -128 to +127 bytes.
Forward references — jumping to a label defined later — require two passes. The first pass discovers all label addresses; the second pass fills in all label references.
Directives
Beyond instructions, assemblers recognize directives — commands to the assembler itself rather than to the CPU. Common directives:
ORG address — set the location counter to a specific address. Tells the assembler where this code will be loaded in memory. Critical for getting addresses right.
DB value, value, ... — define byte(s). Embeds literal bytes into the output. Used for data tables, strings, constants.
DW value, value, ... — define word(s). Embeds 16-bit values.
DS count — define storage. Reserves a block of uninitialized memory.
EQU name, value — equate. Creates a symbolic constant without allocating memory. Lets you write SCREEN_WIDTH EQU 80 and then use SCREEN_WIDTH throughout the code instead of the magic number 80.
INCLUDE filename — include another source file. Allows splitting large programs across multiple files.
Cross-Assemblers
If you have an existing computer available, you can write an assembler for your target CPU on that machine. This is called cross-assembly: assembling code for one architecture on a different architecture.
This approach was used throughout the history of embedded systems. Engineers at companies with mainframes wrote assemblers for the new 8-bit microprocessors on those mainframes, then loaded the output onto the new chip via paper tape or ROM programmer. For rebuilders who have access to a working computer of any kind, writing a cross-assembler is far easier than bootstrapping an assembler on the target machine itself.
A cross-assembler is just a regular program that happens to know the opcode table of a different CPU. Write it in any available language — C, Python, BASIC. Its input is assembly source text; its output is a binary file that can be transferred to the target machine.
Bootstrap Problem
The deepest challenge in assembler construction is the bootstrap problem: to write an assembler for a new machine, you need to program that machine, but programming without an assembler is painful. The traditional solution is to hand-assemble a tiny assembler — perhaps only a few hundred bytes — directly in machine code. This minimal assembler handles only the most basic instructions and directives. You use it to assemble a more capable assembler written in assembly language. That assembler can then assemble itself plus any other program you write.
This self-hosting capability — the assembler assembling itself — is both a practical milestone and a philosophical landmark. The tool has become capable enough to reproduce itself.
Practical Notes for Rebuilders
Start with the simplest possible assembler for your CPU: no macros, no conditional assembly, no fancy expressions. Get instruction encoding and label resolution working first. Expand features as you discover you need them.
Write the opcode table from the CPU’s data sheet, checking each entry against the reference as you transcribe it. Errors in the opcode table will produce subtly wrong code that is very hard to debug.
Keep the assembler source code itself in assembly language once the bootstrap is complete. An assembler written in its own target language is the most portable documentation of that CPU’s capabilities.
Test every instruction against known correct machine code. Encode a short test program by hand, then assemble the same program and compare. Any discrepancy means a bug in the opcode table or operand encoding logic.
The assembler is not glamorous, but everything that follows depends on it working correctly. Invest the time to make it solid.