|author||Ori Bernstein <email@example.com>||2012-08-18 20:32:44 -0400|
|committer||Ori Bernstein <firstname.lastname@example.org>||2012-08-18 20:32:44 -0400|
Add start of compiler internals documentation.
Diffstat (limited to 'doc/compiler.txt')
1 files changed, 131 insertions, 0 deletions
diff --git a/doc/compiler.txt b/doc/compiler.txt
new file mode 100644
@@ -0,0 +1,131 @@
+ Structure of the Myrddin Compiler
+ Aug 2012
+ Ori Bernstein
+TABLE OF CONTENTS:
+ 1. OVERVIEW
+ 2. PARSING
+ 2.1. Lexing
+ 2.2. Parsing
+ 2.3. Type checking
+ 3. FLATTENING
+ 4. OPTIMIZATION
+ 5. CODE GENERATION
+ 6. TUTORIAL: ADDING A STATEMENT
+ The Myrddin compiler suite consists of a set of binaries, written in C,
+ which translate Myrddin source code to the assembly most appropriate for
+ the target platform, and subsequently invoke the native assembler on it.
+ The linker is not invoked by the compiler, and the final output is an
+ object file for the target platform.
+ The compilers are named with a single character for the target platform,
+ with a single character for the language being compiled. A table of the
+ compilers and their names is below:
+ Compiler Platform
+ 6m x86-64
+ The compilation is divided into a small number of phases. The first phase
+ is parsing. The first phase is parsing, where the source code is first
+ tokenized, parsed, and semantically checked. The second phase is the
+ machine dependent tree flattening. In this phase, the tree is decomposed
+ function by function into simple operations that are relatively close to
+ the machine. Sizes are fixed, and all loops, if statements, etc are
+ replaced with gotos. The next phase is a machine independent optimizer,
+ which currenty does nothing other than simply folding trees. In the final
+ phase, the instructions are selected and the registers are allocated.
+ So, to recap, the phases are as follows:
+ parse Tokenize, parse and analyze the source.
+ flatten Rewrite the complex nodes into simpe ones
+ opt Optimize the flattened source trees
+ gen Generate the assembly code
+ This phase takes in a source file, and outputs a tree that is guaranteed
+ to be valid.
+ 2.1. Lexing:
+ Lexing occurs in parse/tok.c. Because we desire to use this lexer from
+ within yacc, the entry point to this code is in 'yylex()'. As required
+ by yacc, 'yylex()' returns an integer defining the token type, and
+ sets the 'tok' member of yylval to the token that was taken from the
+ input stream. In addition, to allow for better error messages, the
+ global variable 'curtok' is set to the value of 'yylval.tok'. This
+ allows yyerror to print the last token that was seen.
+ The tokens that are allowable are generated by Yacc from the '%token'
+ definiitions in parse/gram.y, and are placed into the file
+ 'parse/gram.h'. The lexer and parser code is the only code that
+ depends on these token constants.
+ The lexer is initalized through 'tokinit(char *file)'. This function
+ will open the file passed in, read all the data from it in one go
+ and set up the internal data for the tokenizer. The tokenizing is then
+ done while the whole file is in memory, which means that this code
+ will work poorly on files that are larger than the address space
+ available to the compiler. If this is a problem, you deserve all the
+ pain that is caused.
+ The file data is stored in the three global variables 'fidx', 'fbuf',
+ and 'fbufsz'. The actual tokenization happens through 'toknext()' and
+ its callees, which operate on these data structures character by
+ character, matching the values read, and shoving them into the 'Tok'
+ data structure.
+ 2.2. Parsing:
+ The parser used is a traditional Yacc based parser. It is generated
+ from the source in parse/gram.y. The starting production is 'file',
+ which fills in a global 'file' tree node. This 'file' tree node must
+ be initialized before yyparse() is called.
+ 2.3. Type Checking:
+ Type checking is done through unification of types. It's implemented
+ in parse/infer.c
+ 2.4. Speciaization:
+ 2.4. Serialization:
+ 2.5. Use Files:
+ This phase is invoked repeatedly on each top level declaration that we
+ want to generate code for.
+ 4.1. Constant Folding:
+5. CODE GENERATION:
+ 5.1. Instruction Selection:
+ 5.2. Register Allocation:
+6: TUTORIAL: ADDING A STATEMENT:
+ 6.1. Stubbing in the node types:
+ 6.2. Parsing:
+ 6.3. Flattening:
+ 6.4. Optimization:
+ 6.5. Instruction Selection: