diff options
author | Ori Bernstein <ori@eigenstate.org> | 2014-06-18 01:10:30 -0400 |
---|---|---|
committer | Ori Bernstein <ori@eigenstate.org> | 2014-06-18 01:10:30 -0400 |
commit | f6e53709407611de940092c9d73e63926723252f (patch) | |
tree | 320c6d9474ba1eb51f2dc88640e9edd0d1672a55 /doc | |
download | mc-f6e53709407611de940092c9d73e63926723252f.tar.gz |
Merge pull request #1 from akoshibe/master
added sockproto values
Diffstat (limited to 'doc')
-rw-r--r-- | doc/Makefile | 21 | ||||
-rw-r--r-- | doc/compiler.txt | 336 | ||||
-rw-r--r-- | doc/lang.txt | 832 | ||||
-rw-r--r-- | doc/mc.1 | 84 | ||||
-rw-r--r-- | doc/muse.1 | 95 | ||||
-rw-r--r-- | doc/myrbuild.1 | 66 |
6 files changed, 1434 insertions, 0 deletions
diff --git a/doc/Makefile b/doc/Makefile new file mode 100644 index 0000000..9ee91a7 --- /dev/null +++ b/doc/Makefile @@ -0,0 +1,21 @@ +MAN=mc.1 \ + muse.1 \ + myrbuild.1 \ + +include ../config.mk + +all: + +install: + @echo install -m 644 $(MAN) $(abspath $(DESTDIR)/$(INST_ROOT)/share/man/man1); \ + mkdir -p $(abspath $(DESTDIR)/$(INST_ROOT)/share/man/man1); \ + install -m 644 $(MAN) $(abspath $(DESTDIR)/$(INST_ROOT)/share/man/man1); \ + +uninstall: $(MAN) + @for i in $^; do \ + echo rm -f $(abspath $(DESTDIR)/$(INST_ROOT)/share/man/man1/$$i); \ + rm -f $(abspath $(DESTDIR)/$(INST_ROOT)/share/man/man1/$$i); \ + done + +clean: + diff --git a/doc/compiler.txt b/doc/compiler.txt new file mode 100644 index 0000000..e296c0e --- /dev/null +++ b/doc/compiler.txt @@ -0,0 +1,336 @@ + Structure of the Myrddin Compiler + Aug 2012 + Ori Bernstein + +TABLE OF CONTENTS: + + 1. OVERVIEW + 1.1. Tree Structure + 2. PARSING + 2.1. Lexing + 2.2. AST Creation + 2.3. Type checking + 2.4. Generic Specialization + 2.5. Serialization + 2.6. Usefiles + 3. FLATTENING + 3.1. Control Flow + 3.2. Complex Expressions + 4. OPTIMIZATION + 4.1. Constant Folding + 5. CODE GENERATION + 5.1. Instruction Selection + 5.2. Register Allocation + 6. TUTORIAL: ADDING A STATEMENT + 6.1. Stubbing in the node types + 6.2. Parsing + 6.3. Flattening + 6.4. Optimization + 6.5. Instruction Selection + +1. OVERVIEW: + + The Myrddin compiler suite consists of a set of binaries, written in C, + which translate Myrddin source code to the assembly most appropriate for + the target platform, and subsequently invoke the native assembler on it. + The linker is not invoked by the compiler, and the final output is an + object file for the target platform. + + The compilers are named with a single character for the target platform, + with a single character for the language being compiled. A table of the + compilers and their names is below: + + Compiler Platform + ------------------------- + 6m x86-64 + + + The compilation is divided into a small number of phases. The first phase + is parsing, where the source code is first tokenized, the abstract syntax + tree (AST) is generated, and semantically checked. The second phase is the + machine-dependent tree flattening. In this phase, the tree is decomposed + function by function into simple operations that are relatively close to + the machine. Sizes are fixed, and all loops, if statements, etc. are + replaced with gotos. The next phase is a machine-independent optimizer, + which currenty does nothing other than simply folding trees. In the final + phase, the instructions are selected and the registers are allocated. + + So, to recap, the phases are as follows: + + parse Tokenize, parse, and analyze the source + flatten Rewrite the complex nodes into simpe ones + opt Optimize the flattened source trees + gen Generate the assembly code + + 1.1. Tree Structure: + + File nodes (n->type == Nfile) represent the files being compiled. The current + node is held in a global variable called, unsurprisingly, 'file'. The + global symbol table, export table, uses, and other compilation-specific + information is stored in this node. This implies that the compiler can + only process one file at a time. + + Name nodes (n->type == Nname) are simply names, possibly with a namespace + attached. They are left as leaf nodes in the tree, specifying variable + names, union tags, and just about anything else with a name. + + Use nodes (n->type == Nuse) simply tell the compiler that a usefile by the + name stored in this node will be loaded. + + Expression nodes (n->type == Nexpr) represent expressions. They consist of + an operator, a type, a few flags, possibly a declaration ID, and a list of + arguments. + + Operators are defined in parse/ops.def, and start with an 'O' by + convention; eg: Oadd, Osub, etc. + + The declaration id (n->expr.did) is only valid on expressions representing + a single variable (n->expr.op == Ovar). The DID is a unique identifier + representing the declaration node that the variable refers to. This is used + for a variety of things, from fast comparisons to allowing us to put the + node into a bit set easily. + + Literal nodes (n->type == Nlit) hold a literal value. The type held is + stored in littype, which are defined in parse/lits.def. + + The various statement nodes (Nloopstmt, Nifstmt, Nmatchstmt, Nblock, + Nlbl) are all statements that may appear within a block node (Nblock). + + Declaration nodes declare a name in a symbol table. TODO: MORE DETAIL. + + Uelt nodes declare a union element. TODO: MORE DETAIL. + + Func nodes declare a function. TODO: MORE DETAIL. + + + +2. PARSING: + + This phase takes in a source file, and outputs a tree that is guaranteed + to be valid. The tree nodes are defined in parse/parse.h in struct Node, + and have one of the types defined in parse/nodetypes.def. Node types + start with 'N' by convention; eg: Nfile, Nifstmt, etc. + + 2.1. Lexing: + + Lexing occurs in parse/tok.c. Because we want to use this lexer from + within yacc, the entry point to this code is in 'yylex()'. As required + by yacc, 'yylex()' returns an integer defining the token type, and + sets the 'tok' member of yylval to the token that was taken from the + input stream. In addition, to allow for better error messages, the + global variable 'curtok' is set to the value of 'yylval.tok'. This + allows yyerror to print the last token that was seen. + + The tokens that are allowable are generated by Yacc from the '%token' + definitions in parse/gram.y, and are placed into the file + 'parse/gram.h'. The lexer and parser code is the only code that + depends on these token constants. + + The lexer is initalized through 'tokinit(char *file)'. This function + will open the file passed in, read all the data from it in one go + and set up the internal data for the tokenizer. The tokenizing is then + done while the whole file is in memory, which means that this code + will work poorly on files that are larger than the address space + available to the compiler. If this is a problem, you deserve all the + pain that is caused. + + The file data is stored in the three global variables 'fidx', 'fbuf', + and 'fbufsz'. The actual tokenization happens through 'toknext()' and + its callees, which operate on these data structures character by + character, matching the values read, and shoving them into the 'Tok' + data structure. + + 2.2. AST Creation: + + The parser used is a traditional Yacc-based parser. It is generated + from the source in parse/gram.y. The starting production is 'file', + which fills in a global 'file' tree node. This 'file' tree node must + be initialized before yyparse() is called. + + + 2.3. Type Checking: + + Type checking is done through unification of types. It's implemented + in parse/infer.c. It proceeds through a simple unification algorithm, + which is documented in lang.txt. As a result, only the internal + details of this algorithm will be discussed here. + + The first step done is loading and resolving use files. This is + deferred to the type checking phase for two reasons. First, we + do not want to force tools to have all dependencies compiled if they + use this parser, even though type full type checking is impossible + until all usefiles are loaded. And second, this is when the + information is actually needed. + + Next, the types declared in the package section are merged with the + exported types, allowing us to start off with our type information as + complete as possible, and making sure that the types of globals + actually match up with the exported types. + + The next step is the actual type inference. We do a bottom-up walk of + the tree, unifying types as we go. There are subtleties with the + member operator, however. Because the '.' operator is used for both + member lookups and namespace lookups, before we descend into a node + that has operator Omemb, we need to check if it's a namespaced name, + or an actual member reference. If it is a namespaced name, we replace + the expression with an Ovar expression. This check happens in the + 'checkns()' function. Second, because we need to know the LHS of a + member expression before we can check if the RHS is valid, and we + are not guaranteed to know this at the first time that we see it, the + expression is assumed to be valid, and this asumption is verified in + a post-processing pass. Casts are validated in a deferred manner + similarly. + + Generic variables are added to a list of generic callsites to + specialize when they are seen in as a leaf of an Ovar node. + + The type inference, to this point, has only built up a mapping + of types. So, for example, if we were to have the inferred types + for the following set of statements: + + var a + var b + var c + a = b + c = b + 1 + + We would have the mappings: + + $t0 -> $t1 + $t1 -> $t2 + $t2 -> int + + So, in the 'typesub()' function, we iterate over the entire tree, + replacing every instance of a non-concrete type with the final + mapped type. If a type does not map to a fully concrete type, + this is where we flag an error. + + FIXME: DESCRIBE HOW YOU FIXED GENERICS ONCE YOU FIX GENERICS. + + 2.4. Generic Specialization: + + After type inference (well, technially, as the final step of it), + we walk through the list of callsites that need instantiations + of generics, and create a specialized generic instance for each of + them. This specialization is done, unsurprisingly, in specialize.c, + by the simple algorithm of cloning the entire tree that needs to + be specialized, and walking over all nodes substituting the types + that are replacing the type parameters. + + 2.5. Serialization: + + Trees of all sorts can be serialized and deserialized from files, + as long as they are fully typed. Trees containing type variables (ie, + uninferred types) cannot be serialized, as type variables cannot be + deserialized meaningfully. + + The format for this is only documented in the source, and is a + straighforward dump of the trees to memory. It is constantly shifting, + and will not reliably work between compiler versions. + + 2.6. Usefiles: + + Usefiles are more or less files that consist of a single character tag + that tells us what type of tree to deserialize. Because serialized + trees are compiler version dependent, so are usefiles. + +3. FLATTENING: + + This phase is invoked repeatedly on each top-level declaration that we + want to generate code for. There is a good chance that this flattening + phase should be made machine-independent, and passed as a parameter + a machine description describing known integer and pointer sizes, among + other machine attributes. However, for now, it is machine-dependent, + and lives in 6/simp.c. + + The goal of flattening a tree is to take semantically involved constructs + such as looping, and simplify things into something that is easy to + generate code for, as well as something that is easier to analyze for + optimization. + + 3.1. Control Flow: + + All if statements, loops, and other complex constructs are simplified + to jumps and conditional jumps. Loops are generally simplified from + something that would look like this: + + loop + init + cond + inc + body + + To something that would look like this: + + init + jmp cond + .loop: + body + inc + .cond: + cjmp cond .loop .end + .end: + + Boolean expressions are simplified to a location to jump to, as + described in section 8.4 of the Dragon book[1]. + + 3.2. Complex Expressions: + + Complex expressions such as copying types larger than a single machine + word, pulling members out of structures, emulating multiplication and + division for larger integers sizes, and similar operations are reduced + to trees that are expressible in terms of simple machine operations. + + By the end of the simplification pass, the following operators should + not be present in the trees: + + Obad Oret Opreinc Opostinc Opredec Opostdec Olor Oland Oaddeq + Osubeq Omuleq Odiveq Omodeq Oboreq Obandeq Obxoreq Obsleq + Obsreq Omemb Oslice Oidx Osize Numops Oucon Ouget Otup Oarr + Oslbase Osllen Ocast + + +4. OPTIMIZATION: + + Currently, there is virtually no optimization done on the trees after + flattening. The only optimization that is done is constant folding. + + 4.1. Constant Folding: + + Expressions with constant values are simplified algebraically. For + example, the expression 'x*1' is simplified to 'x', '0/n' is + simplified to '0', and so on. + + +5. CODE GENERATION: + + 5.1. Instruction Selection: + + Instruction selection is done via a simple handwritten bottom-up pass + over the tree. Common patterns such as scaled or offset indexing are + recognized by the patterns, but no attempts are made at finding an + optimal tiling. + + 5.2. Register Allocation: + + Register allocation is done via the algorithm described in "Iterated + Regster Coalescing" by Appel and George. As of the time of this + writing, the register allocator does not yet implement overlapping + register classes. This will be done as described in "A generalized + algorithm for graph-coloring register allocation" by Smith, Ramsey, + and Holloway. + +6: TUTORIAL: ADDING A STATEMENT: + + 6.1. Stubbing in the node types: + + 6.2. Parsing: + + 6.3. Flattening: + + 6.4. Optimization: + + 6.5. Instruction Selection: + +[1] Aho, Sethi, Ullman: Compilers: Principles, Techniques, and Tools, 1988. + ISBN 0-201-10088-6 diff --git a/doc/lang.txt b/doc/lang.txt new file mode 100644 index 0000000..008c95a --- /dev/null +++ b/doc/lang.txt @@ -0,0 +1,832 @@ + The Myrddin Programming Language + Jul 2012 + Ori Bernstein + +TABLE OF CONTENTS: + + 1. ABOUT + 2. LEXICAL CONVENTIONS + 3. SYNTAX + 3.1. Declarations + 3.2. Literal Values + 3.3. Control Constructs and Blocks + 3.4. Expressions + 3.5. Data Types + 3.6. Packages and Uses + 4. TOOLCHAIN + 5. EXAMPLES + 6. STYLE GUIDE + 7. STANDARD LIBRARY + 8. GRAMMAR + 9. FUTURE DIRECTIONS + +1. ABOUT: + + Myrddin is designed to be a simple, low-level programming + language. It is designed to provide the programmer with + predictable behavior and a transparent compilation model, + while at the same time providing the benefits of strong + type checking, generics, type inference, and similar. + Myrddin is not a language designed to explore the forefront + of type theory or compiler technology. It is not a language + that is focused on guaranteeing perfect safety. Its focus + is on being a practical, small, fairly well defined, and + easy to understand language for work that needs to be close + to the hardware. + + Myrddin is a computer language influenced strongly by C + and ML, with ideas from Rust, Go, C++, and numerous other + sources and resources. + + +2. LEXICAL CONVENTIONS: + + The language is composed of several classes of tokens. There + are comments, identifiers, keywords, punctuation, and whitespace. + + Comments begin with "/*" and end with "*/". They may nest. + + /* this is a comment /* with another inside */ */ + + Identifiers begin with any alphabetic character or underscore, + and continue with any number of alphanumeric characters or + underscores. Currently the compiler places a limit of 1024 + bytes on the length of the identifier. + + some_id_234__ + + Keywords are a special class of identifier that is reserved + by the language and given a special meaning. The set of + keywords in Myrddin are as follows: + + castto match + const pkg + default protect + elif sizeof + else struct + export trait + extern true + false type + for union + generic use + goto var + if while + + + At the current stage of development, not all of these keywords are + implemented within the language.[1] + + Literals are a direct representation of a data object within the source of + the program. There are several literals implemented within the language. + These are fully described in section 3.2 of this manual. + + In the compiler, single semicolons (';') and newline (\x10) + characters are treated identically, and are therefore interchangable. + They will both be referred to "endline"s thoughout this manual. + + +3. SYNTAX OVERVIEW: + + 3.1. Declarations: + + A declaration consists of a declaration class (i.e., one + of 'const', 'var', or 'generic'), followed by a declaration + name, optionally followed by a type and assignment. One thing + you may note is that unlike most other languages, there is no + special function declaration syntax. Instead, a function is + declared like any other value: by assigning its name to a + constant or variable. + + const: Declares a constant value, which may not be + modified at run time. Constants must have + initializers defined. + var: Declares a variable value. This value may be + assigned to, copied from, and modified. + generic: Declares a specializable value. This value + has the same restricitions as a const, but + taking its address is not defined. The type + parameters for a generic must be explicitly + named in the declaration in order for their + substitution to be allowed. + + In addition, there is one modifier allowed on declarations: + 'extern'. Extern declarations are used to declare symbols from + another module which cannot be provided via the 'use' mechanism. + Typical uses would be to expose a function written in assembly. They + can also be used as a workaround for external dependencies. + + Examples: + + Declare a constant with a value 123. The type is not defined, + and will be inferred: + + const x = 123 + + Declare a variable with no value and no type defined. The + value can be assigned later (and must be assigned before use), + and the type will be inferred. + + var y + + Declare a generic with type '@a', and assigns it the value + 'blah'. Every place that 'z' is used, it will be specialized, + and the type parameter '@a' will be substituted. + + generic z : @a = blah + + Declare a function f with and without type inference. Both + forms are equivalent. 'f' takes two parameters, both of type + int, and returns their sum as an int + + const f = {a, b + var c : int = 42 + -> a + b + c + } + + const f : (a : int, b : int -> int) = {a : int, b : int -> int + var c : int = 42 + -> a + b + c + } + + 3.2. Literal Values + + Integers literals are a sequence of digits, beginning with a + digit and possibly separated by underscores. They are of a + generic type, and can be used where any numeric type is + expected. They may be prefixed with "0x" to indicate that the + following number is a hexadecimal value, or 0b to indicate a + binary value. Decimal values are not prefixed, and octal values + are not supported. + + eg: 0x123_fff, 0b1111, 1234 + + Floating-point literals are also a sequence of digits beginning with + a digit and possibly separated by underscores. They are also of a + generic type, and may be used whenever a floating-point type is + expected. Floating point literals are always in decimal, and + as of this writing, exponential notation is not supported[2] + + eg: 123.456 + + String literals represent a compact method of representing a byte + array. Any byte values are allowed in a string literal, and will be + spit out again by the compiler unmodified, with the exception of + escape sequences. + + There are a number of escape sequences supported for both character + and string literals: + \n newline + \r carriage return + \t tab + \b backspace + \" double quote + \' single quote + \v vertical tab + \\ single slash + \0 nul character + \xDD single byte value, where DD are two hex digits. + + String literals begin with a ", and continue to the next + unescaped ". + + eg: "foo\"bar" + + Character literals represent a single codepoint in the character + set. A character starts with a single quote, contains a single + codepoint worth of text, encoded either as an escape sequence + or in the input character set for the compiler (generally UTF8). + They share the same set of escape sequences as string literals. + + eg: 'א', '\n', '\u1234'[3] + + Boolean literals are either the keyword "true" or the keyword + "false". + + eg: true, false + + Funciton literals describe a function. They begin with a '{', + followed by a newline-terminated argument list, followed by a + body and closing '}'. They will be described in more detail + later in this manual. + + eg: {a : int, b + -> a + b + } + + Sequence literals describe either an array or a structure + literal. They begin with a '[', followed by an initializer + sequence and closing ']'. For array literals, the initializer + sequence is either an indexed initializer sequence[4], or an + unindexed initializer sequence. For struct literals, the + initializer sequence is always a named initializer sequence. + + An unindexed initializer sequence is simply a comma separated + list of values. An indexed initializer sequence contains a + '#number=value' comma separated sequence, which indicates the + index of the array into which the value is inserted. A named + initializer sequence contains a comma separated list of + '.name=value' pairs. + + eg: [1,2,3], [#2=3, #1=2, #0=1], [.a = 42, .b="str"] + + A tuple literal is a parentheses separated list of values. + A single element tuple contains a trailing comma. + + eg: (1,), (1,'b',"three") + + Finally, while strictly not a literal, it's not a control + flow construct either. Labels are identifiers preceded by + colons. + + eg: :my_label + + They can be used as targets for gotos, as follows: + + goto my_label + + the ':' is not part of the label name. + + 3.3. Control Constructs and Blocks: + + if for + while match + goto + + The control statements in Myrddin are similar to those in many other + popular languages, and with the exception of 'match', there should + be no surprises to a user of any of the Algol derived languages. + Where a truth value is required, any type with the builtin trait + 'tctest' can be used in all of these. + + Blocks are the "carriers of code" in Myrddin programs. They consist + of series of expressions, typically ending with a ';;', although the + function-level block ends at the function's '}', and in if + statemments, an 'elif' may terminate a block. They can contain any + number of declarations, expressions, control constructs, and empty + lines. Every control statement example below will (and, in fact, + must) have a block attached to the control statement. + + If statements branch one way or the other depending on the truth + value of their argument. The truth statement is separated from the + block body + + if true + std.put("The program always get here") + elif elephant != mouse + std.put("...eh.") + else + std.put("The program never gets here") + ;; + + For statements begin with an initializer, followed by a test + condition, followed by an increment action. For statements run the + initializer once before the loop is run, the test each on each + iteration through the loop before the body, and the increment on + each iteration after the body. If the loop is broken out of early + (for example, by a goto), the final increment will not be run. The + syntax is as follows: + + for init; test; increment + blockbody() + ;; + + While loops are equivalent to for loops with empty initializers + and increments. They run the test on every iteration of the loop, + and exit only if it returns false. + + Match statements do pattern matching on values. They take as an + argument a value of type 't', and match it against a list of other + values of the same type. The patterns matched against can also contain + free names, which will be bound to the sub-value matched against. The + patterns are checked in order, and the first matching pattern has its + body executed, after which no other patterns will be matched. This + implies that if you have specific patterns mixed with by more general + ones, the specific patterns must come first. + + Match patterns can be one of the following: + + - Union patterns + + These look like union constructors, only they define + a value to match against. + + - Literal patterns + + Any literal value can be matched against. + + - Constant patterns + + Any constant value can be matched against. + + More types of pattern to match will be added over time. + + Match statements consist of the keyord 'match', followed by + the expression to match against the patterns, followed by a + newline. The body of the match statement consists of a list + of pattern clauses. A patterned clause is a pattern, followed + by a ':', followed by a block body. + + An example of the syntax follows: + + const Val234 = `Val 234 /* set up a constant value */ + var v = `Val 123 /* set up variable to match */ + match v + /* pattern clauses */ + `Val 123: + std.put("Matched literal union pat\n");; + Val234: + std.put("Matched const value pat\n") + ;; + `Val a: + std.put("Matched pattern with capture\n") + std.put("Captured value: a = %i\n", a) + ;; + a + std.put("A top level bind matches anything.");; + `Val 111 + std.put("Unreachable block.") + ;; + ;; + + + 3.4. Expressions: + + Myrddin expressions are relatively similar to expressions in C. The + operators are listed below in order of precedence, and a short + summary of what they do is listed given. For the sake of clarity, + 'x' will stand in for any expression composed entirely of + subexpressions with higher precedence than the current current + operator. 'e' will stand in for any expression. Unless marked + otherwise, expressions are left associative. + + BUG: There are too many precedence levels. + + + Precedence 14: (*ok, not really operators) + (,,,) Tuple Construction + (e) Grouping + name Bare names + literal Values + + Precedence 13: + x.name Member lookup + x++ Postincrement + x-- Postdecrement + x[e] Index + x[from,to] Slice + + Precedence 12: + ++x Preincrement + --x Predecrement + *x Dereference + &x Address + !x Logical negation + ~x Bitwise negation + +x Positive (no operation) + -x Negate x + + Precedence 11: + x << x Shift left + x >> x Shift right + + Precedence 10: + x * x Multiply + x / x Divide + x % x Modulo + + Precedence 9: + x + x Add + x - x Subtract + + Precedence 8: + x & y Bitwise and + + Precedence 7: + x | y Bitwise or + x ^ y Bitwise xor + + Precedence 6: + `Name x Union construction + + Precedence 5: + x castto(type) Cast expression + + Precedence 4: + x == x Equality + x != x Inequality + x > x Greater than + x >= x Greater than or equal to + x < x Less than + x <= x Less than or equal to + + Precedence 3: + x && x Logical and + + Precedence 2: + x || x Logical or + + Precedence 1: + x = x Assign Right assoc + x += x Fused add/assign Right assoc + x -= x Fused sub/assign Right assoc + x *= x Fused mul/assign Right assoc + x /= x Fused div/assign Right assoc + x %= x Fused mod/assign Right assoc + x |= x Fused or/assign Right assoc + x ^= x Fused xor/assign Right assoc + x &= x Fused and/assign Right assoc + x <<= x Fused shl/assign Right assoc + x >>= x Fused shr/assign Right assoc + + Precedence 0: + -> x Return expression + + All expressions on integers act on two's complement values which wrap + on overflow. Right shift expressions fill with the sign bit on + signed types, and fill with zeros on unsigned types. + + 3.5. Data Types: + + The language defines a number of built in primitive types. These + are not keywords, and in fact live in a separate namespace from + the variable names. Yes, this does mean that you could, if you want, + define a variable named 'int'. + + There are no implicit conversions within the language. All types + must be explicitly cast if you want to convert, and the casts must + be of compatible types, as will be described later. + + 3.5.1. Primitive types: + + void + bool char + int8 uint8 + int16 uint16 + int32 uint32 + int64 uint64 + int uint + long ulong + float32 float64 + + These types are as you would expect. 'void' represents a + lack of type, although for the sake of genericity, you can + assign between void types, return values of void, and so on. + This allows generics to not have to somehow work around void + being a toxic type. + + bool is a type that can only hold true and false. It can be + assigned, tested for equality, and used in the various boolean + operators. + + char is a 32 bit integer type, and is guaranteed to be able + to hold exactly one codepoint. It can be assigned integer + literals, tested against, compared, and all the other usual + numeric types. + + The various [u]intXX types hold, as expected, signed and + unsigned integers of the named sizes respectively. + Similarly, floats hold floating point types with the + indicated precision. + + var x : int declare x as an int + var y : float32 declare y as a 32 bit float + + + 3.5.2. Composite types: + + pointer + slice array + + Pointers are, as expected, values that hold the address of + the pointed to value. They are declared by appending a '*' + to the type. Pointer arithmetic is not allowed. They are + declared by appending a '*' to the base type + + Arrays are a group of N values, where N is part of the type. + Arrays of different sizes are incompatible. Arrays in + Myrddin, unlike many other languages, are passed by value. + They are declared by appending a '[SIZE]' to the base type. + + Slices are similar to arrays in many contemporary languages. + They are reference types that store the length of their + contents. They are declared by appending a '[,]' to the base + type. + + foo* type: pointer to foo + foo[123] type: array of 123 foo + foo[,] type: slice of foo + + 3.5.3. Aggregate types: + + tuple struct + union + + Tuples are the traditional product type. They are declared + by putting the comma separated list of types within square + brackets. + + Structs are aggregations of types with named members. They + are declared by putting the word 'struct' before a block of + declaration cores (ie, declarations without the storage type + specifier). + + Unions are the traditional sum type. They consist of a tag + (a keyword prefixed with a '`' (backtick)) indicating their + current contents, and a type to hold. They are declared by + placing the keyword 'union' before a list of tag-type pairs. + They may also omit the type, in which case, the tag is + suficient to determine which option was selected. + + [int, int, char] a tuple of 2 ints and a char + + struct a struct containing an int named + a : int 'a', and a char named 'b'. + b : char + ;; + + union a union containing one of + `Thing int int or char. The values are not + `Other float32 named, but they are tagged. + ;; + + + 3.5.4. Magic types: + + tyvar typaram + tyname + + A tyname is a named type, similar to a typedef in C, however + it genuinely creates a new type, and not an alias. There are + no implicit conversions, but a tyname will inherit all + constraints of its underlying type. + + A typaram is a parametric type. It is used in generics as + a placeholder for a type that will be substituted in later. + It is an identifier prefixed with '@'. These are only valid + within generic contexts, and may not appear elsewhere. + + A tyvar is an internal implementation detail that currently + leaks in error messages out during type inference, and is a + major cause of confusing error messages. It should not be in + this manual, except that the current incarnation of the + compiler will make you aware of it. It looks like '@$type', + and is a variable that holds an incompletely inferred type. + + type mine = int creates a tyname named + 'mine', equivalent to int. + + + @foo creates a type parameter + named '@foo'. + + + 3.6. + + The myrddin type system is a system similar to the Hindley Milner + system, however, types are not implicitly generalized. Instead, type + schemes (type parameters, in Myrddin lingo) must be explicitly provided + in the declarations. For purposes of brevity, instead of specifying type + rules for every operator, we group operators which behave identically + from the type system perspective into a small set of classes. and define + the constraints that they require. + + Type inference in Myrddin operates as a bottom up tree walk, + applying the type equations for the operator to its arguments. + It begins by initializing all leaf nodes with the most specific + known type for them as follows: + + 3.6.1 Types for leaf nodes: + + Variable Type + ---------------------- + var foo $t + + A type variable is the most specific type for a declaration + or function without any specified type + + var foo : t t + + If a type is specified, that type is taken for the + declaration. + + "asdf" byte[:] + + String literals are byte arrays. + + + 'a' char + + Char literals are of type 'char' + + true bool + false bool + + true/false are boolean literals + + 123 $t::(tcint,tcnum,tctest) + + Integer literals get a fresh type variable of type with + the constraints for int-like types. + + 123.1 $t::(tcfloat,tcnum,tctest) + + Float literals get a fresh type variable of type with + the constraints for float-like types. + + {a,b:t; } ($a,t -> $b) + + Function literals get the most specific type that can + be determined by their signature. + + + num-binop: + + + - * / % + += -= *= /= % + + Number binops require the constraint 'tcnum' for both the + + num-unary: + - + + Number binops require the constraint 'tcnum'. + + int-binop: + | & ^ << >> + |= &= ^= <<= >> + int-unary: + ~ ++ -- + + bool-binop: + || && == != + < <= > >= + + + 3.7. Packages and Uses: + + pkg use + + There are two keywords for module system. 'use' is the simpler + of the two, and has two cases: + + use syspkg + use "localfile" + + The unquoted form searches all system include paths for 'syspkg' + and imports it into the namespace. By convention, the namespace + defined by 'syspkg' is 'syspkg', and is unique and unmerged. This + is not enforced, however. Typical usage of unquoted names is to + import a library that already exists. + + The quoted form searches the local directory for "localpkg". By + convention, the package it imports does not match the name + "localpkg", but instead is used as partial of the definition of the + importer's package. This is a confusing description. + + A typical use of a quoted import is to allow splitting one package + into multiple files. In order to support this behavior, if a package + is defined in the current file, and a use statements imports a + package with the same namespace, the two namespaces are merged. + + The 'pkg' keyword allows you to define a (partial) package by + listing the symbols and types for export. For example, + + pkg mypkg = + type mytype + + const Myconst : int = 42 + const myfunc : (v : int -> bool) + ;; + + declares a package "mypkg", which defines three exports, "mytype", + "Myconst", and "myfunc". The definitions of the values may be + defined in the 'pkg' specification, but it is preferred to implement + them in the body of the code for readability. Scanning the export + list is desirable from a readability perspective. + +4. TOOLCHAIN: + + The toolchain used is inspired by the Plan 9 toolchain in name. There + is currently one compiler for x64, called '6m'. This compiler outputs + standard elf .o files, and supports these options: + + 6m [-h] [-o outfile] [-d[dbgopts]] inputs + -I path Add 'path' to use search path + -o Output to outfile + +5. EXAMPLES: + + 5.1. Hello World: + + use std + const main = { + std.put("Hello World!\n") + -> 0 + } + + TODO: DESCRIBE CONSTRUCTS. + + 5.2. Conditions + + use std + const intmax = {a, b + if a > b + -> a + else + -> b + ;; + } + + const main = { + var x = 123 + var y = 456 + std.put("The max of %i, %i is %i\n", x, y, max(x, y)) + } + + TODO: DESCRIBE CONSTRUCTS. + + 5.3. Looping + + use std + const innerprod = {a, b + var i + var sum + for i = 0; i < a.len; i++ + sum += a[i]*b[i] + ;; + } + + const main = { + std.put("The inner product is %i\n", innerprod([1,2,3], [4,5,6])) + } + + TODO: DESCRIBE CONSTRUCTS. + +6. STYLE GUIDE: + + 6.1. Brevity: + + Myrddin is a simple language which aims to strip away abstraction when + possible, and it is not well served by overly abstract or bulky code. + The code written should be a readable description of an algorithm, + aimed at conveying the essential operations in a linear and + straightforward fasion. + + Write for humans, not machines. Write linearly, so that an algorithm + can be understood with minimal function-chasing. + + 6.2. Naming: + + Names should be brief and evocative. A good name serves as a reminder + to what the function does. For functions, a single verb is ideal. For + local variables, a single character might suffice. Compact notation + is simpler to read, typographically. + + Variables names should describe the value contained, and function + names should describe the value returned. + + Good: spawn(myfunc) + Bad: create_new_thread_starting_at_function(myfunc) + + The identifiers used for constant values are put in Initialcase. + Functions and types are in singleword style, although underscores are + occasionally necessary to specify additional information within + functions, due to the lack of overloading. + + Good: + type mytype = int + var myvar : mytype + const Myconst = 42 + union + `Tagone int + ;; + + Bad: + type MyType = int /* types are 'singleword' */ + const my_func = {;...} /* function names should avoid _ */ + const myconst /* constants start with Uppercase */ + union + `sometag /* tags start with uppercase */ + ;; + + Acceptable: + const length_mm = {;...} /* '_' disambiguates returned values. */ + cosnt length_cm = {;...} + + 6.3. Collections: + + + +7. STANDARD LIBRARY: + +8. GRAMMAR: + +9. FUTURE DIRECTIONS: + +BUGS: + +[1] TODO: trait, default, protect, +[2] TODO: exponential notation. +[3] TODO: \uDDDD escape sequences not yet recognized +[4] TODO: currently the only sequence literal implemented is the + unindexed one + diff --git a/doc/mc.1 b/doc/mc.1 new file mode 100644 index 0000000..471d013 --- /dev/null +++ b/doc/mc.1 @@ -0,0 +1,84 @@ +.TH MC 1 +.SH NAME +6m +.SH SYNOPSIS +.B 6m +.I -[hioS] +.I [file...] +.br +.SH DESCRIPTION +.PP +The ?m family of compilers compile Myrddin source into object files +for the corresponding architecture. There is one compiler for each +architecture supported, with a unique name. By default, if the input +file is named +.I filename.myr +then the the object file that is generated will be named +.I filename.o. +If the filename does not end with the suffix +.I .myr +then the suffix +.I .o +will simply be appended to it. + +.PP +The following architectures are currently supported: +.TP +6m +x86-64 + +.PP +The compiler options are: + +.TP +.B -d [flTri] +Print debugging dumps. Additional options may be given to give more +debugging information for specific intermediate states of the compilation. + +.TP +.B -h +Print a summary of the available options. + +.TP +.B -I path +Add 'path' to the search path for unquoted use statments. This option +does not affect the search path for local usefiles, which are always +searched relative to the compiler's current working directory. Without +any options, the search path defaults to /usr/include/myr. + +.TP +.B -o output-file +Specify that the generated code should be placed in + +.TP +.B -S +Generate assembly code instead of an object file. + +.SH EXAMPLE +.EX + 6m foo.myr + 6m bar.myr + ld -o foobar foo.o bar.o +.EE + +.SH FILES +The source code for this compiler is available from +.B git://git.eigenstate.org/git/ori/mc.git + +.SH SEE ALSO +.IR muse(1) +.IR ld(1) +.IR as(1) + +.SH BUGS +.PP +The language is not yet complete, and the compilers often omit useful +constructs in spite of their desirability. +.PP +There are virtually no optimizations done, and the generated source is +often very poorly performing. +.PP +The current calling convention is stack-based and not register-based, even +on architectures where it should be register-based. +.PP +The calling convention is not compatible with C. diff --git a/doc/muse.1 b/doc/muse.1 new file mode 100644 index 0000000..d56d072 --- /dev/null +++ b/doc/muse.1 @@ -0,0 +1,95 @@ +.TH MUSE 1 +.SH NAME +muse +.SH SYNOPSIS +.B muse +.I -[hmidos] +.I [file...] +.br +.SH DESCRIPTION +.PP +The 'muse' tool takes as input a Myrddin source file and generates +a usefile from it. A usefile collects definitions exported from the +package specifications in Myrddin source code, and makes them available +for other programs to include with a 'use' statement. +.PP +It can also merge together a number of usefiles into one larger usefile +including all of the exported symbols. If an output file name is not given, +and we are not merging usefiles, then an input file named +.I filename.myr +will generate a usefile named +.I filename.use +\&. + +If the filename does not end with the suffix +.I .myr +then the suffix +.I .o +will simply be appended to it. + +.PP +The output of muse is architecture-independent. However, the format of the +generated file is not stable, and is not guaranteed to work across +different compiler versions. + +.PP +The muse options are: + +.TP +.B -d [flTri] +Print debugging dumps. Additional options may be given to give more +debugging information for specific intermediate states of the compilation. + +.TP +.B -h +Print a summary of the available options. + +.TP +.B -I path +Add 'path' to the search path for unquoted use statments. This option +does not affect the search path for local usefiles, which are always +searched relative to the compiler's current working directory. Without +any options, the search path defaults to /usr/include/myr. + +.TP +.B -o output-file +Specify that the generated usefile should be named +.I output-file + +.TP +.B -s +Print a summary of the symbols exported from the usefile that is specified. + +.SH EXAMPLE +.EX + muse foo.myr + muse -o bar.use bar-system-version.myr + muse -mo library foo.use bar.use +.EE + +.SH FILES +The source for muse is available from +.B git://git.eigenstate.org/git/ori/mc.git +and lives in the +.I muse/ +directory within the source tree. + +.SH SEE ALSO +.IR mc(1) +.IR ld(1) +.IR as(1) + +.SH BUGS +.PP +There is insufficient checking and validation done on usefiles. +.PP +The file format is in flux, and in current iterations, it is not at all compact. +.PP +There is no versioning on the usefiles as it stands. If you use the wrong +version with the wrong compiler, a mysterious error or even segfault is +likely. +.PP +This utility should not exist. Instead, the compiler should put the +exported symbol information into the object file or library directly. +.PP +The file format is not closed under concatentation. diff --git a/doc/myrbuild.1 b/doc/myrbuild.1 new file mode 100644 index 0000000..7eb2693 --- /dev/null +++ b/doc/myrbuild.1 @@ -0,0 +1,66 @@ +.TH MUSE 1 +.SH NAME +myrbuild +.SH SYNOPSIS +.B myrbuild +.I -[hblI] +.I [file...] +.br +.SH DESCRIPTION +.PP +The 'myrbuild' tool takes as input a list of Myrddin or assembly sources, +and compiles them in the correct dependency order into either a library or +an executable. If the source files are myrddin sources, then the appropriate +usefiles will be created as well. It expects Myrddin source to be in '.myr' +files. + +.PP +Myrbuild will default to building for the current architecture. + +.PP +The myrbuild options are: + +.TP +.B -h +Print a summary of the available options. + +.TP +.B -b name +Compile source into a binary named 'name'. If neither this option nor +the '-l' option are given, myrbuild will create a binary called 'a.out'. + +.TP +.B -l 'name' +Compile source given into a library called 'lib<name>.a', and a matching +usefile called 'name'. Only static libraries are currently supported. + +.TP +.B -I path +Add 'path' to the search path for unquoted use statments. This option +does not affect the search path for local usefiles, which are always +searched relative to the compiler's current working directory. Without +any options, the search path defaults to /usr/include/myr. + +.SH EXAMPLE +.EX + myrbuild -b foo foo.myr + myrbuild -l foo bar.myr baz.myr + muse -mo library foo.use bar.use +.EE + +.SH FILES +The source for muse is available from +.B git://git.eigenstate.org/git/ori/mc.git +and lives in the +.I myrbuild/ +directory within the source tree. + +.SH SEE ALSO +.IR mc(1) +.IR muse(1) +.IR ld(1) +.IR as(1) + +.SH BUGS +.PP +None known. |