Compilers
Part of Programming Fundamentals
A compiler translates a high-level language program into machine code, enabling programmers to write fast, efficient software without thinking in binary.
Why This Matters
The assembler solved the problem of writing programs without memorizing opcode numbers. But assembly language still requires thinking about registers, memory addresses, and CPU-specific instructions. Building a large system in assembly is like constructing a building one brick at a time β possible, but slow and exhausting.
A compiler solves this problem at the next level. You write in a language that resembles mathematics or plain description β assign a variable, compute a formula, call a function by name β and the compiler handles the translation to efficient machine code. Programs that would take months in assembly take weeks in a compiled language.
For a rebuilding civilization, the compiler represents a major productivity threshold. Once you have a working compiler for even a simple language, the pace of software development increases dramatically. A team of programmers can build operating systems, databases, scientific calculation tools, and communication software in years rather than decades.
What a Compiler Does
A compiler is a program that takes source code written in a high-level language and produces machine code (or assembly language) that performs the same computation. Unlike an interpreter β which reads and executes source code one statement at a time β the compiler processes the entire program before execution, producing a standalone binary that runs without the compiler present.
The produced code runs at full hardware speed. An interpreter adds overhead on every operation because it is simultaneously reading the source, parsing it, and dispatching to the right operation. Compiled code has none of this overhead: the translation happened once at compile time, and the resulting binary executes directly.
This speed advantage matters for anything time-sensitive: control systems, sensor processing, real-time communication, numerical simulation.
The Compilation Pipeline
Compilation happens in stages, each transforming the program representation into something closer to machine code.
Lexical analysis (lexing): The compiler reads the source text and breaks it into tokens β the smallest meaningful units. Keywords (IF, WHILE, RETURN), identifiers (variable and function names), literals (numbers and strings), and operators (+, =, <) are all tokens. Comments and whitespace are discarded. The output is a stream of tokens.
Parsing (syntax analysis): The token stream is organized into a parse tree (or abstract syntax tree, AST) according to the grammar of the language. The grammar defines what constitutes a valid expression, statement, and program. For example, an assignment statement is: identifier = expression ;. The parser checks that the token sequence follows these rules and builds a tree where the structure reflects the logical nesting of the program.
If the source violates the grammar β mismatched parentheses, missing semicolons, undefined syntax β the parser reports a syntax error with location information.
Semantic analysis: The parse tree is checked for meaning, not just structure. Are all variables declared before use? Are function call arguments the correct number and type? Does the condition in an IF statement produce a boolean value? These checks catch programming errors that are syntactically valid but meaningless.
Intermediate code generation: The verified parse tree is converted to an intermediate representation (IR) β a simple, lower-level form that is easier to analyze and optimize than the source language but not yet tied to a specific CPU. Common IR forms include three-address code (every operation has at most three operands: a = b + c) and static single assignment (SSA) form.
Optimization: The intermediate code is analyzed and transformed to be more efficient. Dead code (code that can never execute) is removed. Constant expressions are evaluated at compile time. Loops are analyzed for opportunities to move invariant calculations outside the loop. These transformations preserve the programβs behavior while improving speed or reducing code size.
Code generation: The intermediate code is translated to the target CPUβs instruction set. Each IR operation maps to one or more machine instructions. The code generator assigns variables to registers (register allocation) and determines the exact instruction sequence.
Assembly or binary output: The final stage emits either assembly language (which is then processed by an assembler) or raw machine code directly.
Building a Simple Compiler
For a rebuilding civilization, the most practical starting point is a compiler for a subset of C or a simple expression language targeting your available CPU.
Start with expressions only: handle numeric literals, variables, the four arithmetic operations, and parentheses. Write a recursive descent parser β a family of functions where each function handles one grammar construct and calls other functions for sub-constructs.
For an expression grammar:
expression = term (('+' | '-') term)*
term = factor (('*' | '/') factor)*
factor = NUMBER | IDENTIFIER | '(' expression ')'
The parser for expression calls term(), then checks for + or -, calls term() again if found, and generates a machine code sequence that adds or subtracts the results.
This recursive descent approach produces compact, readable compiler code and extends naturally to handle statements, functions, and more complex constructs. Early C compilers on minicomputers were written this way and ran in 64 KB of memory.
Compiler vs. Interpreter Trade-offs
Interpreters are simpler to write and provide better error reporting because the source code is present during execution. BASIC interpreters on early microcomputers took a few kilobytes of code. An interpreter is the right starting point for a new computing environment.
Compilers produce faster programs and are better for deployed software that will be run many times. Once the interpreter is working and the community has programming experience, building a compiler for the most-used language is a worthwhile investment.
A practical progression for a rebuilding civilization:
- Hand-assemble a minimal assembler
- Use the assembler to write a more capable assembler in assembly language
- Write a BASIC interpreter in assembly language
- Use BASIC to prototype programs and establish programming culture
- Implement a simple compiled language (a subset of C is ideal) once resources allow
- Self-host the compiler: compile the compiler with itself
Bootstrapping
The bootstrap problem for compilers is the same as for assemblers: to compile a new language, you need a compiler, but you need the language to write the compiler. The traditional solution is to write the first version of the compiler in assembly language or an existing language. Once it works well enough to compile a program, you rewrite it in the new language, compile the new version with the old compiler, and use the result to compile itself.
This bootstrap process was followed by nearly every significant programming language. It is a meaningful milestone: a compiler written in its own language, compiling itself correctly.
Practical Notes for Rebuilders
A compiler is a large project β expect months of concentrated work by a skilled programmer. Do not attempt it before you have a working assembler, a functional higher-level language (like BASIC), and a reasonably stable hardware platform.
The most important compiler optimization for resource-constrained systems is register allocation: keeping frequently used variables in registers rather than loading and storing them from memory on every access. Even a simple register allocator dramatically improves generated code quality.
Document your intermediate representation thoroughly. The IR is the heart of the compiler, and any future optimization work or retargeting to a new CPU proceeds from there.
Consider targeting assembly language output rather than binary. This lets you inspect and verify the generated code, insert hand-optimized assembly for critical routines, and reuse your existing assembler for the final translation step.