Update 'selected/compilers.md'

This commit is contained in:
RealNeGate 2022-07-13 23:14:48 +00:00
parent b237334f52
commit 7400ab92da
1 changed files with 111 additions and 176 deletions

View File

@ -1,196 +1,131 @@
# Compilers: How To Make a Programming Language
# Classical compiler construction
Other compiler-related topics out of scope:
- What are programs anyway (Forth, LISP)
- Computation: Turing machines, lambda calculus
- Compiler optimization
- GC (motivated by LISP)
Modern compilers are generally "multi-pass" meaning that they go through phases where they take in some representation of the program and generate a new version, a common architecture for this would be:
```
Read file byte streams
VVV
Lexing tokens
VVV
Parsing AST
VVV
Output executable or interpret
```
aesthetics:
- stay motivated; stick close to a c-like, imperative, procedural model because that's what people are used to
## Lexer
motivation / goals / what questions are we trying to answer:
- I want to be able to make a toy procedural language.
- like C, Algol, JS, Lua, etc.
- possible user motivations:
- I want to make a game scripting language
- I want to make a DSL for my job (and I want syntax highlighting!)
- I want to do simple static analysis of my projects
A lexer is a phase of compilation which converts raw text into a stream of tokens (basically words for programming languages) and from there we can convert that into something with more semantic meaning. A simple C-like lexer boils down to skipping non-token characters (whitespace generally speaking), classifying a token based on the first character and following some rule until it no longer applies and there you have one token, keep running through an entire text stream and you've got a tokenized file. Due to the nature of C-like lexers whitespace is "insignificant" only acting as a separator to tokens meaning that `a>=b` is identical to `a >= b`, a token can start once the last one ends. An example of lexing for C would look like this:
```c
// Input source file
bool foo(int a) {
return a > 0;
}
ok what do we want to cover
- classical compiler structure (lexer -> parser -> codegen)
// Token stream
'bool' 'foo' '(' 'int' 'a' ')' '{' 'return' 'a' '>' '0' ';' '}'
```
These separate tokens are generally classified into different types, in a C-like language you might have:
* Identifiers - these start with a letter or underscore and can then be followed with letters, numbers and underscores. Usually these are used for naming variables within the language. Some examples are `apple60`, `_abc`, `normal_stuff` while `16u` is not valid because it starts with a number.
* Literals - these are constant values such a strings or numerical values. Some examples are `6.0`, `24`, `"Hello, World!\n"`.
* Keywords - these are special identifiers that the compiler may use for special behavior like builtin operations. In C some example keywords are `int`, `void`, `char` and none of these can be used in place of normal identifiers as they are reserved for a specific purpose.
* Punctuators - these are the random symbols you might see sprinkled, for example: `?`, `(`, `)`, `%`, `+=`
|Star| Title | Page |
|----|---------------------------------------|------|
| | Phases of a Compiler | https://www.geeksforgeeks.org/phases-of-a-compiler/ |
|* | Compiler Architecture | https://cs.lmu.edu/~ray/notes/compilerarchitecture/ |
| | Compiler Design | https://www.tutorialspoint.com/compiler_design/compiler_design_phases_of_compiler.htm |
| | Structure of a Compiler | https://www.csd.uwo.ca/~mmorenom/CS447/Lectures/Introduction.html/node10.html |
| | The Structure of a Compiler | https://www.brainkart.com/article/The-Structure-of-a-Compiler_8121/ |
| | Wikipedia: Compiler | https://en.wikipedia.org/wiki/Compiler |
| | The Structure of a Compiler (slides) | https://pages.cs.wisc.edu/~fischer/cs536.s08/lectures/Lecture04.4up.pdf |
| | Wikipedia: Code Generation | https://en.wikipedia.org/wiki/Code_generation_(compiler) |
| | Intro to Code Generation | https://cs.lmu.edu/~ray/notes/codegen/ |
| | Code Generation | https://www.tutorialspoint.com/compiler_design/compiler_design_code_generation.htm |
|*? | Phases of a Compiler | https://www.guru99.com/compiler-design-phases-of-compiler.html |
|*? | Writing a C Compiler (Pt. 1) | https://norasandler.com/2017/11/29/Write-a-Compiler.html |
| | V: Phases of a Compiler | https://www.youtube.com/watch?v=jE7f3sGLGVk |
|* | V: Different Phases of Comp | https://www.youtube.com/watch?v=TApMNhQPaCM |
|* | V: Parser and Lexer (Pt. 1) | https://www.youtube.com/watch?v=eF9qWbuQLuw |
A good place to start with lexers is Bisqwit's series on writing a compiler:
- https://www.youtube.com/watch?v=eF9qWbuQLuw
- semantic analysis / type checking
Here's an excerpt about lexer:
- https://www.youtube.com/embed/eF9qWbuQLuw?start=166&end=299
|Star| Title | Page |
|----|-------------------------------|------|
| | Wikipedia | https://en.wikipedia.org/wiki/Compiler#Front_end |
| | Compiler Design - SA | https://www.tutorialspoint.com/compiler_design/compiler_design_semantic_analysis.htm |
|* | What is Semantic Analysis? | https://home.adelphi.edu/~siegfried/cs372/372l8.pdf |
|*? | SA in Compiler Design | https://iq.opengenus.org/semantic-analysis-in-compiler-design/ |
|*? | Implementation of SA | https://pgrandinetti.github.io/compilers/page/implementation-semantic-analysis/ |
|*? | What is SA in a Compiler? | https://pgrandinetti.github.io/compilers/page/what-is-semantic-analysis-in-compilers/ |
| | SA (Slides) | https://www.computing.dcu.ie/~davids/courses/CA4003/CA4003_Semantic_Analysis_2p.pdf |
| | V: The Semantic Analysis! | https://www.youtube.com/watch?v=j172YWmBk5A |
| | V: Intro to Semantic Analysis | https://www.youtube.com/watch?v=cC8YRnDGMwI |
| | V: Compiler Design SA | https://www.youtube.com/watch?v=57U6pQRnSJA |
| | V: Semantic Analysis: Intro | https://www.youtube.com/watch?v=7pHmBEkeIdQ |
Another good resource which is more focused on broad compiler architecture is:
- https://cs.lmu.edu/~ray/notes/compilerarchitecture/
|Star| Title | Page |
|----|----------------------------------|------|
| | Type Checking in Compiler Design | https://www.geeksforgeeks.org/type-checking-in-compiler-design/ |
| | Type Checking (Slides) | https://www.slideshare.net/dipongkersen81/type-checkingcompilier-design |
| | What is Static Type Checking? | https://www.tutorialspoint.com/what-is-static-type-checking |
| | What is Dynamic Type Checking? | https://www.tutorialspoint.com/what-is-dynamic-type-checking |
| | Type Systems | https://www.csd.uwo.ca/~mmorenom/CS447/Lectures/TypeChecking.html/node1.html |
| | V: Type Checking | https://www.youtube.com/watch?v=-TQVAKby6oI |
## Parser
- modern compiler structures (IR / SSA, optimization)
Parsing is generally the job of converting tokens into a structured format usually called an AST (Abstract syntax tree). Abstract syntax trees are considered abstract because of how they represent certain higher level constructs from the raw tokens, one example might be opening and closing braces.
```
{
foo();
bar();
}
```
might be representable in a tree as:
```
(compound (call foo)
(call bar))
```
as you can see in the diagram there's no node to represent the closing brace and instead that information is used to know when the compound statement is complete, it's a terminator but only like termination in a token stream where it splits tokens in a flat sense you can have nesteed structures in parsing such as:
```
{ (compound
a = 5; (assign a 5)
b = 3; (assign b 3)
{ (compound
a += b; (assign a (add a b))))
}
}
```
A simple recursive decent parser can be written as just a function which reads a token and does some work based off of that which may include calling more functions which also read a token and do work with each type of function *generally* being responsible for creating a specific type of node based on the grammar.
```c
struct Node* parse_statement(void) {
// in this example, try_eat means if it sees an '{' it'll go to the next token
// and return true, otherwise it stays in the same place and returns false.
if (try_eat('{')) {
// create a new compound statement and eat up more statements
struct Node* n = new_node(NODE_STMT);
|Star| Title | Page |
|----|--------------------------------------------|------|
| | Modern Compiler Design (Book) | https://www.cs.usfca.edu/~galles/compilerdesign/cimplementation.pdf |
| | Modern Compiler Design (Different Book) |http://160592857366.free.fr/joe/ebooks/ShareData/Modern%20Compiler%20Design%202e.pdf|
| | Wikipedia (IR) | https://en.wikipedia.org/wiki/Intermediate_representation |
|* | Intermediate Representations | https://cs.lmu.edu/~ray/notes/ir/ |
| | Intermediate Representations in Comp Design| https://iq.opengenus.org/intermediate-representations-in-compiler-design/ |
|*? | Intermediate Representation (Slides) | https://www.cs.princeton.edu/courses/archive/spring03/cs320/notes/IR-trans1.pdf |
| | Single Static Assignment (Slides) | https://www.cs.cmu.edu/~fp/courses/15411-f08/lectures/09-ssa.pdf |
| | Wikipedia (SSA) | https://en.wikipedia.org/wiki/Static_single_assignment_form |
| | Understanding SSA Forms | https://blog.yossarian.net/2020/10/23/Understanding-static-single-assignment-forms |
|* | V: Anders Hejlsberg on Modern Comp. Construction | https://www.youtube.com/watch?v=wSdV1M7n4gQ |
// it'll only exit if it sees a closing } (ideally we also
// check for end of file and throw an error accordingly)
while (peek('}')) {
add_kid(n, parse_statement());
}
- interpreters vs. JITs vs. AOTs vs. "transpilers"
return n;
} else if (try_eat(TOKEN_VAR)) {
...
} else {
...
}
}
```
|Star| Title | Page |
|----|--------------------------------------------|------|
| | Wikipedia (Compiler) | https://en.wikipedia.org/wiki/Compiler |
| | Wikipedia (Interpreter) | https://en.wikipedia.org/wiki/Interpreter_(computing) |
| | Interpreters vs. Compilers | https://www.programiz.com/article/difference-compiler-interpreter |
|*? | Compiler vs. Interpreter, What's the Difference? | https://www.guru99.com/difference-compiler-vs-interpreter.html |
| | Wikipedia (Transpiler [Source-to-source compiler]) | https://en.wikipedia.org/wiki/Source-to-source_compiler
| | Compiling vs. Transpiling (Stack Overflow) | https://stackoverflow.com/questions/44931479/compiling-vs-transpiling |
| | What does a JIT do? (Stack Overflow) | https://stackoverflow.com/questions/95635/w hat-does-a-just-in-time-jit-compiler-do |
| | JIT Compilation Explained | https://www.freecodecamp.org/news/just-in-time-compilation-explained/ |
| | Wikpedia (JIT Compilation) | https://en.wikipedia.org/wiki/Just-in-time_compilation |
TODO
- executables and linkers
## Type systems
|Star| Title | Page |
|----|--------------------------------------------|------|
| | Wikipedia (Linker) | https://en.wikipedia.org/wiki/Linker_(computing) |
| | Intro to compiler, linker, and libraries (C++) | https://www.learncpp.com/cpp-tutorial/introduction-to-the-compiler-linker-and-libraries/ |
| | Differences Between Compilers and Linkers (Stack Overflow) | https://stackoverflow.com/questions/3831312/what-are-the-differences-between-a-compiler-and-a-linker |
|* | Beginner's Guide to Linkers | https://www.lurklurk.org/linkers/linkers.html |
|* | V: Compiling, Assembling, and Linking | https://www.youtube.com/watch?v=N2y6csonII4 |
| | V: How the Linker Combines Object Files | https://www.youtube.com/watch?v=oXk87NRTL1Y |
| | V: Assembler, Linker, and Loader (C) | https://www.youtube.com/watch?v=cJDRShqtTbk |
| | What is an executable file? | https://www.computerhope.com/jargon/e/execfile.htm |
| | Wikipedia (Executable) | https://en.wikipedia.org/wiki/Executable |
|* | V: What are Executables? | https://www.youtube.com/watch?v=WnqOhgI_8wA |
|* | V: What is an EXE file? | https://www.youtube.com/watch?v=r5ldP1P1Rzc |
Before we continue with the next phase of compilation we should go over what types and a type system are. In the most abstract form a type system simply describes which operations can be applied to values and those imply, when you say `a + b` in C depending on what the types of `a` and `b` are the operation may mean different things or may not even be allowed. In C this is knownable at compile time which is known as static typing, while something like Python, JS or Lua sit into the dynamically typed category as types are only fully known at the time of execution of any specific operation. One operation on types you might see quite a bit of is a "type cast" or type conversion, these represent changing between one type to another and usually languages will place rules on which casts are allowed and in which cases, if at all. Some common type casts include:
- regular expressions?
- regular languages / grammars / language structure / automata
- terminology?
* Down casting - this is converting a type into a narrower definition, depending on the type of language this might be an unsafe operation or just lead to dynamic type errors (since we can't in a vacuum prove this operation to be safe statically), This encompasses things like converting between a base class in Java to it's derived class or truncating an `int` in C to a `char`.
* Up casting - this converts between some value to the same value represented in a "broader" type and because of that we know that it is safe since the type we cast to must be able to represent a superset of the original type, usually this could be converting an integer to a bigger integer type or a derived class to a base class.
*
the c compilation process
- translation units
- preprocessing -> (the whole compilation process) -> object files -> linking
- c compilation model is not in favor any more, don't like compiling all these files separately
- ABIs and FFI
- should maybe be in separate article
experts / consultants:
- Bill
- NeGate
## The actual progression
## Semantics / Type checking
- Simple expression interpreter (parse and evaluate)
- Classical compiler construction (lex -> parse -> output), semantic analysis / type checking
- motivation: complex structures! recursion! etc.
- many of these resources exist and cover different aspects of the process in different ways
- Grammars and language structure
- Types of output (interpreter vs. AOT vs. JIT, etc.)
- We can probably find resources on specific ones of these
- Modern phases (IR / SSA)
- Mention WASM?
- The terrors of the real world
- Executables, linkers, and debug info
- Also debug info
- The C ABI and FFI
- Debug info
- Codegen
- Specifically: machine code generation, of reasonable quality
- Note: not necessary for all "compilers"
- Topics: register allocation, instruction selection, instruction scheduling
- Some examples of optimization passes
- There are not a lot of resources for this. Place a public TODO here?
- Appendix:
- Grammar basics (BNF, EBNF)
- Need not go into exhaustive detail on categories of grammars
- C is not the only language
- Brief summaries and examples of different language approaches:
- LISP
- Forth
- Languages in the ML family
- This could be a whole topic maybe
(TODO caveat about how unidirectional type checking isn't the only way) This stage boils down to walking the parse tree/AST and figuring out how all the expressions fit with each other. It's not terribly complicated to get started here it's mostly just about writing a visitor or some switch statement and recursively walking the tree to grab types from nodes which may derive their types from the nodes above and so on.
## Link dump
```
struct Type* type_check_expr(struct Node* n) {
switch (n->tag) {
case NODE_INT: {
return TYPE_INT;
}
### Books
case NODE_ADD:
case NODE_SUB:
case NODE_MUL:
case NODE_DIV: {
struct Type* l = type_check_expr(n->operands[0]);
struct Type* r = type_check_expr(n->operands[1]);
- Engineering a Compiler:
[Well liked]
http://www.r-5.org/files/books/computers/compilers/writing/Keith_Cooper_Linda_Torczon-Engineering_a_Compiler-EN.pdf
- Compiler Design in C:
[May have a full implementation inside]
https://holub.com/goodies/compiler/compilerDesignInC.pdf
- Dragon Book:
[Potentially outdated -- mixed reviews]
http://ce.sharif.edu/courses/94-95/1/ce414-2/resources/root/Text%20Books/Compiler%20Design/Alfred%20V.%20Aho,%20Monica%20S.%20Lam,%20Ravi%20Sethi,%20Jeffrey%20D.%20Ullman-Compilers%20-%20Principles,%20Techniques,%20and%20Tools-Pearson_Addison%20Wesley%20(2006).pdf
return type_promotion(l, r);
}
### Webpages
default: assert(0);
}
}
```
- lua grammar: http://lua-users.org/wiki/LuaGrammar
- pascal railroad diagrams: https://www.cs.utexas.edu/users/novak/grammar.html
- tons of links: https://github.com/aalhour/awesome-compilers
- expression parsing examples:
- pratt parsing and recursive descent: https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
- dunno, not recursive descent: https://www.cs.rochester.edu/u/nelson/courses/csc_173/grammars/parsing.html
- gary bernhardt's compiler from scratch: https://www.destroyallsoftware.com/screencasts/catalog/a-compiler-from-scratch
- lambda calculus interpreter: https://justine.lol/lambda/
- chibicc (full, readable C compiler): https://github.com/rui314/chibicc
- A Compiler Writing Journey (has many pages/topics): https://github.com/DoctorWkt/acwj
#### From NeGate
- Near-Optimal Instruction Selection on DAGs: https://llvm.org/pubs/2008-CGO-DagISel.pdf
- The Design and Implementation of Gnu Compiler Generation Framework: https://www.cse.iitb.ac.in/~uday/courses/cs715-10/cs715-gcc-intro-handout.pdf
- Lecture Notes on Static Single Assignment Form: https://www.cs.cmu.edu/~rjsimmon/15411-f15/lec/10-ssa.pdf
- Simple and Efficient Construction of Static Single Assignment Form: https://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
- LLVM Greedy Register Allocator Improving Region Split Decisions: https://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf
- NULLSTONE Optimization Categories: ttp://www.nullstone.com/htmls/category.htm
## Articles that we need to write
- Codegen (needs more details)
- Debug info
Terms we might wanna throw into the acronym lister:
AST, CST, DAG, DFS, CFG, AOT
[SSA](https://blog.yossarian.net/2020/10/23/Understanding-static-single-assignment-forms)
[JIT](https://en.wikipedia.org/wiki/Just-in-time_compilation)