# Compilers: How To Make a Programming Language
# Classical compiler construction
Other compiler-related topics out of scope:
Modern compilers are generally "multi-pass" meaning that they go through phases where they take in some representation of the program and generate a new version, a common architecture for this would be:
- What are programs anyway (Forth, LISP)
- Computation: Turing machines, lambda calculus
Read file byte streams
- Compiler optimization
- GC (motivated by LISP)
Lexing tokens
Parsing AST
Output executable or interpret
## Lexer
- stay motivated; stick close to a c-like, imperative, procedural model because that's what people are used to
motivation / goals / what questions are we trying to answer:
A lexer is a phase of compilation which converts raw text into a stream of tokens (basically words for programming languages) and from there we can convert that into something with more semantic meaning. A simple C-like lexer boils down to skipping non-token characters (whitespace generally speaking), classifying a token based on the first character and following some rule until it no longer applies and there you have one token, keep running through an entire text stream and you've got a tokenized file. Due to the nature of C-like lexers whitespace is "insignificant" only acting as a separator to tokens meaning that `a>=b` is identical to `a >= b`, a token can start once the last one ends. An example of lexing for C would look like this:
- I want to be able to make a toy procedural language.
- like C, Algol, JS, Lua, etc.
// Input source file
- possible user motivations:
bool foo(int a) {
- I want to make a game scripting language
return a > 0;
- I want to make a DSL for my job (and I want syntax highlighting!)
- I want to do simple static analysis of my projects
ok what do we want to cover
// Token stream
- classical compiler structure (lexer -> parser -> codegen)
'bool' 'foo' '(' 'int' 'a' ')' '{' 'return' 'a' '>' '0' ';' '}'
These separate tokens are generally classified into different types, in a C-like language you might have:
* Identifiers - these start with a letter or underscore and can then be followed with letters, numbers and underscores. Usually these are used for naming variables within the language. Some examples are `apple60`, `_abc`, `normal_stuff` while `16u` is not valid because it starts with a number.
* Literals - these are constant values such a strings or numerical values. Some examples are `6.0`, `24`, `"Hello, World!\n"`.
* Keywords - these are special identifiers that the compiler may use for special behavior like builtin operations. In C some example keywords are `int`, `void`, `char` and none of these can be used in place of normal identifiers as they are reserved for a specific purpose.
* Punctuators - these are the random symbols you might see sprinkled, for example: `?`, `(`, `)`, `%`, `+=`
## Parser
- modern compiler structures (IR / SSA, optimization)
Parsing is generally the job of converting tokens into a structured format usually called an AST (Abstract syntax tree). Abstract syntax trees are considered abstract because of how they represent certain higher level constructs from the raw tokens, one example might be opening and closing braces.
might be representable in a tree as:
(compound (call foo)
(call bar))
as you can see in the diagram there's no node to represent the closing brace and instead that information is used to know when the compound statement is complete, it's a terminator but only like termination in a token stream where it splits tokens in a flat sense you can have nesteed structures in parsing such as:
{ (compound
a = 5; (assign a 5)
b = 3; (assign b 3)
{ (compound
a += b; (assign a (add a b))))
A simple recursive decent parser can be written as just a function which reads a token and does some work based off of that which may include calling more functions which also read a token and do work with each type of function *generally* being responsible for creating a specific type of node based on the grammar.
struct Node* parse_statement(void) {
// in this example, try_eat means if it sees an '{' it'll go to the next token
// and return true, otherwise it stays in the same place and returns false.
if (try_eat('{')) {
// create a new compound statement and eat up more statements
struct Node* n = new_node(NODE_STMT);
- interpreters vs. JITs vs. AOTs vs. "transpilers"
return n;
} else if (try_eat(TOKEN_VAR)) {
} else {
- executables and linkers
## Type systems
## Semantics / Type checking
- Simple expression interpreter (parse and evaluate)
(TODO caveat about how unidirectional type checking isn't the only way) This stage boils down to walking the parse tree/AST and figuring out how all the expressions fit with each other. It's not terribly complicated to get started here it's mostly just about writing a visitor or some switch statement and recursively walking the tree to grab types from nodes which may derive their types from the nodes above and so on.
- Classical compiler construction (lex -> parse -> output), semantic analysis / type checking
struct Type* type_check_expr(struct Node* n) {
switch (n->tag) {
case NODE_INT: {
return TYPE_INT;
