diff options
author | Ori Bernstein <ori@eigenstate.org> | 2017-02-01 01:06:24 -0800 |
---|---|---|
committer | Ori Bernstein <ori@eigenstate.org> | 2017-02-01 01:06:24 -0800 |
commit | d811a3c7a8fab7269e0798ee884d4d1363d913df (patch) | |
tree | 1bde627a1258e8335a62abd947ae5d894e2e3672 /doc | |
parent | 6368976a9e8e4121c5867ec81842e1d4bac55351 (diff) | |
download | mc-d811a3c7a8fab7269e0798ee884d4d1363d913df.tar.gz |
Not worth the work of finishing or maintaining the compiler doc.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/compiler.txt | 336 |
1 files changed, 0 insertions, 336 deletions
diff --git a/doc/compiler.txt b/doc/compiler.txt deleted file mode 100644 index 05c5981..0000000 --- a/doc/compiler.txt +++ /dev/null @@ -1,336 +0,0 @@ - Structure of the Myrddin Compiler - Aug 2012 - Ori Bernstein - -TABLE OF CONTENTS: - - 1. OVERVIEW - 1.1. Tree Structure - 2. PARSING - 2.1. Lexing - 2.2. AST Creation - 2.3. Type checking - 2.4. Generic Specialization - 2.5. Serialization - 2.6. Usefiles - 3. FLATTENING - 3.1. Control Flow - 3.2. Complex Expressions - 4. OPTIMIZATION - 4.1. Constant Folding - 5. CODE GENERATION - 5.1. Instruction Selection - 5.2. Register Allocation - 6. TUTORIAL: ADDING A STATEMENT - 6.1. Stubbing in the node types - 6.2. Parsing - 6.3. Flattening - 6.4. Optimization - 6.5. Instruction Selection - -1. OVERVIEW: - - The Myrddin compiler suite consists of a set of binaries, written in C, - which translate Myrddin source code to the assembly most appropriate for - the target platform, and subsequently invoke the native assembler on it. - The linker is not invoked by the compiler, and the final output is an - object file for the target platform. - - The compilers are named with a single character for the target platform, - with a single character for the language being compiled. A table of the - compilers and their names is below: - - Compiler Platform - ------------------------- - 6m x86-64 - - - The compilation is divided into a small number of phases. The first phase - is parsing, where the source code is first tokenized, the abstract syntax - tree (AST) is generated, and semantically checked. The second phase is the - machine-dependent tree flattening. In this phase, the tree is decomposed - function by function into simple operations that are relatively close to - the machine. Sizes are fixed, and all loops, if statements, etc. are - replaced with gotos. The next phase is a machine-independent optimizer, - which currently does nothing other than simply folding trees. In the final - phase, the instructions are selected and the registers are allocated. - - So, to recap, the phases are as follows: - - parse Tokenize, parse, and analyze the source - flatten Rewrite the complex nodes into simple ones - opt Optimize the flattened source trees - gen Generate the assembly code - - 1.1. Tree Structure: - - File nodes (n->type == Nfile) represent the files being compiled. The current - node is held in a global variable called, unsurprisingly, 'file'. The - global symbol table, export table, uses, and other compilation-specific - information is stored in this node. This implies that the compiler can - only process one file at a time. - - Name nodes (n->type == Nname) are simply names, possibly with a namespace - attached. They are left as leaf nodes in the tree, specifying variable - names, union tags, and just about anything else with a name. - - Use nodes (n->type == Nuse) simply tell the compiler that a usefile by the - name stored in this node will be loaded. - - Expression nodes (n->type == Nexpr) represent expressions. They consist of - an operator, a type, a few flags, possibly a declaration ID, and a list of - arguments. - - Operators are defined in parse/ops.def, and start with an 'O' by - convention; eg: Oadd, Osub, etc. - - The declaration id (n->expr.did) is only valid on expressions representing - a single variable (n->expr.op == Ovar). The DID is a unique identifier - representing the declaration node that the variable refers to. This is used - for a variety of things, from fast comparisons to allowing us to put the - node into a bit set easily. - - Literal nodes (n->type == Nlit) hold a literal value. The type held is - stored in littype, which are defined in parse/lits.def. - - The various statement nodes (Nloopstmt, Nifstmt, Nmatchstmt, Nblock, - Nlbl) are all statements that may appear within a block node (Nblock). - - Declaration nodes declare a name in a symbol table. TODO: MORE DETAIL. - - Uelt nodes declare a union element. TODO: MORE DETAIL. - - Func nodes declare a function. TODO: MORE DETAIL. - - - -2. PARSING: - - This phase takes in a source file, and outputs a tree that is guaranteed - to be valid. The tree nodes are defined in parse/parse.h in struct Node, - and have one of the types defined in parse/nodetypes.def. Node types - start with 'N' by convention; eg: Nfile, Nifstmt, etc. - - 2.1. Lexing: - - Lexing occurs in parse/tok.c. Because we want to use this lexer from - within yacc, the entry point to this code is in 'yylex()'. As required - by yacc, 'yylex()' returns an integer defining the token type, and - sets the 'tok' member of yylval to the token that was taken from the - input stream. In addition, to allow for better error messages, the - global variable 'curtok' is set to the value of 'yylval.tok'. This - allows yyerror to print the last token that was seen. - - The tokens that are allowable are generated by Yacc from the '%token' - definitions in parse/gram.y, and are placed into the file - 'parse/gram.h'. The lexer and parser code is the only code that - depends on these token constants. - - The lexer is initialized through 'tokinit(char *file)'. This function - will open the file passed in, read all the data from it in one go - and set up the internal data for the tokenizer. The tokenizing is then - done while the whole file is in memory, which means that this code - will work poorly on files that are larger than the address space - available to the compiler. If this is a problem, you deserve all the - pain that is caused. - - The file data is stored in the three global variables 'fidx', 'fbuf', - and 'fbufsz'. The actual tokenization happens through 'toknext()' and - its callees, which operate on these data structures character by - character, matching the values read, and shoving them into the 'Tok' - data structure. - - 2.2. AST Creation: - - The parser used is a traditional Yacc-based parser. It is generated - from the source in parse/gram.y. The starting production is 'file', - which fills in a global 'file' tree node. This 'file' tree node must - be initialized before yyparse() is called. - - - 2.3. Type Checking: - - Type checking is done through unification of types. It's implemented - in parse/infer.c. It proceeds through a simple unification algorithm, - which is documented in lang.txt. As a result, only the internal - details of this algorithm will be discussed here. - - The first step done is loading and resolving use files. This is - deferred to the type checking phase for two reasons. First, we - do not want to force tools to have all dependencies compiled if they - use this parser, even though type full type checking is impossible - until all usefiles are loaded. And second, this is when the - information is actually needed. - - Next, the types declared in the package section are merged with the - exported types, allowing us to start off with our type information as - complete as possible, and making sure that the types of globals - actually match up with the exported types. - - The next step is the actual type inference. We do a bottom-up walk of - the tree, unifying types as we go. There are subtleties with the - member operator, however. Because the '.' operator is used for both - member lookups and namespace lookups, before we descend into a node - that has operator Omemb, we need to check if it's a namespaced name, - or an actual member reference. If it is a namespaced name, we replace - the expression with an Ovar expression. This check happens in the - 'checkns()' function. Second, because we need to know the LHS of a - member expression before we can check if the RHS is valid, and we - are not guaranteed to know this at the first time that we see it, the - expression is assumed to be valid, and this assumption is verified in - a post-processing pass. Casts are validated in a deferred manner - similarly. - - Generic variables are added to a list of generic callsites to - specialize when they are seen in as a leaf of an Ovar node. - - The type inference, to this point, has only built up a mapping - of types. So, for example, if we were to have the inferred types - for the following set of statements: - - var a - var b - var c - a = b - c = b + 1 - - We would have the mappings: - - $t0 -> $t1 - $t1 -> $t2 - $t2 -> int - - So, in the 'typesub()' function, we iterate over the entire tree, - replacing every instance of a non-concrete type with the final - mapped type. If a type does not map to a fully concrete type, - this is where we flag an error. - - FIXME: DESCRIBE HOW YOU FIXED GENERICS ONCE YOU FIX GENERICS. - - 2.4. Generic Specialization: - - After type inference (well, technially, as the final step of it), - we walk through the list of callsites that need instantiations - of generics, and create a specialized generic instance for each of - them. This specialization is done, unsurprisingly, in specialize.c, - by the simple algorithm of cloning the entire tree that needs to - be specialized, and walking over all nodes substituting the types - that are replacing the type parameters. - - 2.5. Serialization: - - Trees of all sorts can be serialized and deserialized from files, - as long as they are fully typed. Trees containing type variables (ie, - uninferred types) cannot be serialized, as type variables cannot be - deserialized meaningfully. - - The format for this is only documented in the source, and is a - straightforward dump of the trees to memory. It is constantly shifting, - and will not reliably work between compiler versions. - - 2.6. Usefiles: - - Usefiles are more or less files that consist of a single character tag - that tells us what type of tree to deserialize. Because serialized - trees are compiler version dependent, so are usefiles. - -3. FLATTENING: - - This phase is invoked repeatedly on each top-level declaration that we - want to generate code for. There is a good chance that this flattening - phase should be made machine-independent, and passed as a parameter - a machine description describing known integer and pointer sizes, among - other machine attributes. However, for now, it is machine-dependent, - and lives in 6/simp.c. - - The goal of flattening a tree is to take semantically involved constructs - such as looping, and simplify things into something that is easy to - generate code for, as well as something that is easier to analyze for - optimization. - - 3.1. Control Flow: - - All if statements, loops, and other complex constructs are simplified - to jumps and conditional jumps. Loops are generally simplified from - something that would look like this: - - loop - init - cond - inc - body - - To something that would look like this: - - init - jmp cond - .loop: - body - inc - .cond: - cjmp cond .loop .end - .end: - - Boolean expressions are simplified to a location to jump to, as - described in section 8.4 of the Dragon book[1]. - - 3.2. Complex Expressions: - - Complex expressions such as copying types larger than a single machine - word, pulling members out of structures, emulating multiplication and - division for larger integers sizes, and similar operations are reduced - to trees that are expressible in terms of simple machine operations. - - By the end of the simplification pass, the following operators should - not be present in the trees: - - Obad Oret Opreinc Opostinc Opredec Opostdec Olor Oland Oaddeq - Osubeq Omuleq Odiveq Omodeq Oboreq Obandeq Obxoreq Obsleq - Obsreq Omemb Oslice Oidx Osize Numops Oucon Ouget Otup Oarr - Oslbase Osllen Ocast - - -4. OPTIMIZATION: - - Currently, there is virtually no optimization done on the trees after - flattening. The only optimization that is done is constant folding. - - 4.1. Constant Folding: - - Expressions with constant values are simplified algebraically. For - example, the expression 'x*1' is simplified to 'x', '0/n' is - simplified to '0', and so on. - - -5. CODE GENERATION: - - 5.1. Instruction Selection: - - Instruction selection is done via a simple handwritten bottom-up pass - over the tree. Common patterns such as scaled or offset indexing are - recognized by the patterns, but no attempts are made at finding an - optimal tiling. - - 5.2. Register Allocation: - - Register allocation is done via the algorithm described in "Iterated - Register Coalescing" by Appel and George. As of the time of this - writing, the register allocator does not yet implement overlapping - register classes. This will be done as described in "A generalized - algorithm for graph-coloring register allocation" by Smith, Ramsey, - and Holloway. - -6: TUTORIAL: ADDING A STATEMENT: - - 6.1. Stubbing in the node types: - - 6.2. Parsing: - - 6.3. Flattening: - - 6.4. Optimization: - - 6.5. Instruction Selection: - -[1] Aho, Sethi, Ullman: Compilers: Principles, Techniques, and Tools, 1988. - ISBN 0-201-10088-6 |