education-2022/selected/compilers.md

6.9 KiB

Classical compiler construction

Modern compilers are generally "multi-pass" meaning that they go through phases where they take in some representation of the program and generate a new version, a common architecture for this would be:

	Read file         byte streams
	VVV
	Lexing            tokens
	VVV
	Parsing           AST
	VVV
	Output            executable or interpret

Lexer

A lexer is a phase of compilation which converts raw text into a stream of tokens (basically words for programming languages) and from there we can convert that into something with more semantic meaning. A simple C-like lexer boils down to skipping non-token characters (whitespace generally speaking), classifying a token based on the first character and following some rule until it no longer applies and there you have one token, keep running through an entire text stream and you've got a tokenized file. Due to the nature of C-like lexers whitespace is "insignificant" only acting as a separator to tokens meaning that a>=b is identical to a >= b, a token can start once the last one ends. An example of lexing for C would look like this:

// Input source file
bool foo(int a) {
    return a > 0;
}

// Token stream
'bool' 'foo' '(' 'int' 'a' ')' '{' 'return' 'a' '>' '0' ';' '}'

These separate tokens are generally classified into different types, in a C-like language you might have:

  • Identifiers - these start with a letter or underscore and can then be followed with letters, numbers and underscores. Usually these are used for naming variables within the language. Some examples are apple60, _abc, normal_stuff while 16u is not valid because it starts with a number.
  • Literals - these are constant values such a strings or numerical values. Some examples are 6.0, 24, "Hello, World!\n".
  • Keywords - these are special identifiers that the compiler may use for special behavior like builtin operations. In C some example keywords are int, void, char and none of these can be used in place of normal identifiers as they are reserved for a specific purpose.
  • Punctuators - these are the random symbols you might see sprinkled, for example: ?, (, ), %, +=

A good place to start with lexers is Bisqwit's series on writing a compiler:

Here's an excerpt about lexer:

Another good resource which is more focused on broad compiler architecture is:

Parser

Parsing is generally the job of converting tokens into a structured format usually called an AST (Abstract syntax tree). Abstract syntax trees are considered abstract because of how they represent certain higher level constructs from the raw tokens, one example might be opening and closing braces.

{
    foo();
    bar();
}

might be representable in a tree as:

(compound (call foo)
          (call bar))

as you can see in the diagram there's no node to represent the closing brace and instead that information is used to know when the compound statement is complete, it's a terminator but only like termination in a token stream where it splits tokens in a flat sense you can have nesteed structures in parsing such as:

{                   (compound
    a = 5;              (assign a 5)
    b = 3;              (assign b 3)
    {                   (compound
        a += b;             (assign a (add a b))))
    }
}

A simple recursive decent parser can be written as just a function which reads a token and does some work based off of that which may include calling more functions which also read a token and do work with each type of function generally being responsible for creating a specific type of node based on the grammar.

struct Node* parse_statement(void) {
	// in this example, try_eat means if it sees an '{' it'll go to the next token
	// and return true, otherwise it stays in the same place and returns false.
	if (try_eat('{')) {
		// create a new compound statement and eat up more statements
		struct Node* n = new_node(NODE_STMT);
		
		// it'll only exit if it sees a closing } (ideally we also
		// check for end of file and throw an error accordingly)
		while (peek('}')) {
			add_kid(n, parse_statement());
		}
		
		return n;
	} else if (try_eat(TOKEN_VAR)) {
		...
	} else {
		...
	}
}

TODO

Type systems

Before we continue with the next phase of compilation we should go over what types and a type system are. In the most abstract form a type system simply describes which operations can be applied to values and those imply, when you say a + b in C depending on what the types of a and b are the operation may mean different things or may not even be allowed. In C this is knownable at compile time which is known as static typing, while something like Python, JS or Lua sit into the dynamically typed category as types are only fully known at the time of execution of any specific operation. One operation on types you might see quite a bit of is a "type cast" or type conversion, these represent changing between one type to another and usually languages will place rules on which casts are allowed and in which cases, if at all. Some common type casts include:

  • Down casting - this is converting a type into a narrower definition, depending on the type of language this might be an unsafe operation or just lead to dynamic type errors (since we can't in a vacuum prove this operation to be safe statically), This encompasses things like converting between a base class in Java to it's derived class or truncating an int in C to a char.
  • Up casting - this converts between some value to the same value represented in a "broader" type and because of that we know that it is safe since the type we cast to must be able to represent a superset of the original type, usually this could be converting an integer to a bigger integer type or a derived class to a base class.

Semantics / Type checking

(TODO caveat about how unidirectional type checking isn't the only way) This stage boils down to walking the parse tree/AST and figuring out how all the expressions fit with each other. It's not terribly complicated to get started here it's mostly just about writing a visitor or some switch statement and recursively walking the tree to grab types from nodes which may derive their types from the nodes above and so on.

struct Type* type_check_expr(struct Node* n) {
	switch (n->tag) {
		case NODE_INT: {
			return TYPE_INT;
		}
		
		case NODE_ADD:
		case NODE_SUB:
		case NODE_MUL:
		case NODE_DIV:  {
			struct Type* l = type_check_expr(n->operands[0]);
			struct Type* r = type_check_expr(n->operands[1]);
			
			return type_promotion(l, r);
		}
		
		default: assert(0);
	}
}

Terms we might wanna throw into the acronym lister: AST, CST, DAG, DFS, CFG, AOT SSA JIT