1. 20

Cross-posting this “challenge” – I would like to see more examples of compilers that are understandable in a small number of lines.

  1.  

  2. 8

    Best I have ever seen was Manfred von Thun’s joy0

    http://www.kevinalbrecht.com/code/joy-mirror/jp-joyjoy.html

    joy0  == 
        [ [ [ joy0          body            joy0     ] 
            [ []                                     ] 
            [ pop           pop             pop      ] 
            [ cons          pop             cons     ] 
            [ opcase        pop             opcase   ] 
            [ body          pop             body     ] 
            [ i             pop             joy0     ] 
            [ step          pop [joy0] cons step     ] 
            [               [] cons         i        ] ] 
          opcase 
          i ] 
        step
    
    1. 3

      I dunno what about John McCarthy’s Lisp in 1959? :-)

      http://www.michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/

      EDIT: Also I should note that this thread is not about 100-line compilers. It’s about compilers where the overall structure is described in 100 lines. My compiler is 8000 lines.in

      1. 4

        McCarthy’s book was fantastic, an absolute eye opener for me when I bought a copy way back in the 1980’s.

        But joy0 beats Lisp for brevity. The heart of that brevity is joy overloads list concatenation with function composition.

        I’m still undecided whether that is a curious accidental shorthand or a fundamental advantage.

        Every time I reread http://www.nsl.com/papers/rewritejoy.html or http://www.kevinalbrecht.com/code/joy-mirror/j04alg.html I lean towards thinking it might be a fundamental advantage.

        1. 3

          re EDIT. Yeah, that should be in original question. I totally thought you meant 100loc. Almost always means that. Your question is kind of more interesting now that you’re talking about 100 lines of structure or description. Different paradigms might come up with some interesting structures we don’t normally see.

      2. 4

        I have been working on a blog post about this topic. The compiler in Warren’s “Logic programming and compiler writing” is about 200 lines IIRC. I think it could be interesting to do more in this vein.

        1. 4

          EDIT: Oops, I think you misunderstood the challenge like a lot of other people in this thread. It’s not about a compiler that’s 100 lines. It’s about reading a 100 line function that shows the structure of an 8,000 line (or 80K line, or 800K line) compiler.

          If you click through to the post I think that’s clear, although I can understand why the title was confusing.

          You can barely write a tokenizer for Python, JavaScript, or C in 100 lines. The “production” tokenizers are 500-2000 lines by themselves.

          100-line compilers are interesting from a pedagogical standpoint, but they don’t pass the test of “Somebody who doesn’t care about compilers built something useful with this.”


          OK awesome, I’m looking forward to it! I wanted to do a blog post about it too, but I likely won’t have time before my trip this summer [1]. To do a good job I would want to look at more existing compilers. I want to hear someone else’s take and see more examples.

          Part of the theme of my blog is “theory vs practice”, and in theory everyone thinks of their compiler as a series of transformations, and that’s how you draw it on the blackboard.

          But in practice, you rarely see the source code reflect this!

          The structure that this compiler had was almost unbelievable until I refactored it. Multiple inheritance was used instead of if statements!!! There were objects that you had to call methods on in a certain order, with hidden state, rather than the state being explicit on the stack.

          [1] http://www.oilshell.org/blog/2018/03/25.html

          1. 2

            ? I am confused. Are you talking about a 100 line top level function of a compiler ? So you are looking for a sufficiently compact yet expressive piece of compiler poetry.

            1. 2

              100-line compilers are interesting from a pedagogical standpoint, but they don’t pass the test of “Somebody who doesn’t care about compilers built something useful with this.”

              Fennel’s compiler is exactly 100 lines, and I’d argue it’s genuinely useful: https://github.com/bakpakin/Fennel/blob/master/fennel.lua#L600

              It doesn’t do as much as most compilers, but that’s kind of the point. It does exactly one thing (bring a new more flexible syntax to an existing runtime without any new semantics) but it does it very well.

              1. 1

                A lot of people are interpreting this question as 100 lines TOTAL, which is what I was referring to. Fennel apears to be more than that.

                But this structure is not what I’m looking for. I’m looking for the stages of the compiler all in one function – see my example. This compile1() function looks like a single stage that takes the AST as input.

                1. 1

                  This compile1() function looks like a single stage that takes the AST as input.

                  In a lisp, the reader is exposed as its own independent first-class feature and isn’t considered part of the compiler. Many lisps have another representation between what the reader returns and the fully-annotated AST used by the compiler, but Fennel doesn’t.

          2. 3

            I’m working on a toy compiler in Rust right now and making a very concise, understandable main compiler driver is one of my goals for it. One problem that I’ve had is coming up with a concise data structure for a sequence of passes that take in input of one type and output data of a different type, that will in turn be the input type of the next stage. This isn’t something that the Rust type system supports formalizing easily, as far as I can tell, and since I want it to be possible to extend the compiler with arbitrary passes, I don’t want to just encode the concrete input and output types of each pass directly in the data structure.

            1. 1

              Hm but why isn’t the driver just a function with local variables? Why do you need a special data structure?

              Is it beacuse you want it configurable at runtime rather than at compile time?

              That is, if you want to “extend the compiler with arbitrary passes”, you just insert a couple lines of code in the 100-line function. Or are you saying this won’t support turning passes on and off with command line flags?

              I think it will at least for passes that are truly optional. If a pass is optional, then its input and output types should be the same.

            2. 3

              How about fewer than 30 lines?

              .SYNTAX PROGRAM
              
              OUT1 = '*1' .OUT('GN1')
                   / '*2' .OUT('GN2')
                   / '*' .OUT('CI')
                   / .STRING .OUT('CL '*)
                   .,
              
              OUTPUT = ('.OUT' '(' $OUT1 ')' / '.LABEL' .OUT('LB') OUT1) .OUT('OUT') .,
              
              EX3 = .ID .OUT('CLL '*)
                  / .STRING .OUT('TST '*)
                  / '.ID' .OUT('ID')
                  / '.NUMBER' .OUT('NUM')
                  / '.STRING' .OUT('SR')
                  / '(' EX1 ')'
                  / '.EMPTY' .OUT('SET')
                  / '$' .LABEL *1 EX3 .OUT('BT ' *1) .OUT('SET')
                  .,
              
              EX2 = (EX3 .OUT('BF ' *1) / OUTPUT) $(EX3 .OUT('BE') / OUTPUT) .LABEL *1 .,
              
              EX1 = EX2 $('/' .OUT('BT ' *1) EX2 ) .LABEL *1 .,
              
              ST = .ID .LABEL * '=' EX1 '.,' .OUT('R') .,
              
              PROGRAM = '.SYNTAX' .ID .OUT('ADR ' *) $ ST '.END' .OUT('END').,
              
              .END
              

              http://www.ibm-1401.info/Meta-II-schorre.pdf

              1. 2

                And my implementation of a bytecode runtime and assembler for META comes in under 500 lines:

                % wc -l meta.c meta.h metas.c
                     189 meta.c
                      61 meta.h
                     225 metas.c
                     475 total
                
                1. 2

                  Heh. Mine came in at exactly 500 lines in a single C file.

                2. 2

                  You beat me to it haha. If he can use a scripting language, then other contenders should be able to use meta-languages since they’re actually easier to implement than even Python. Them plus the source for Python might be easier to implement than Python itself in C.

                  1. 1

                    Hm what language does this recognize? The META II language itself?

                    Is that a Turing complete language or a syntax description language?

                    Technically that is a compiler, but I’m not sure it generalizes to say a language like Python or C.

                    I have seen it before in posts on PEGs, but I don’t quite get what it does and how it works. PEGs also have metagrammar – a PEG that describes the syntax of PEGs. Is that what this is?

                    Also this paper seems somewhat related: http://www.vpri.org/pdf/tr2010003_PEG.pdf

                    Although to be honest I am not as interested in it because of the Scheme “restriction”.


                    EDIT: Also I should note that this thread is not about 100-line compilers. It’s about compilers where the overall structure is described in 100 lines. My compiler is 8000 lines.

                    1. 3

                      Hm what language does this recognize? The META II language itself?

                      yes

                      Is that a Turing complete language or a syntax description language?

                      yes

                      Technically that is a compiler

                      The best kind of compiler ; )

                      generalizes to say a language like Python or C.

                      Schorre’s paper also includes an Algol-like language recognized by the META compiler.

                      1. 2

                        After a brief look at the paper you linked, yeah there’s a direct lineage from Schorre’s work to Viewpoints, although META predates PEGs by about 40 years. My link is citation #9 in yours.

                    2. 3

                      I don’t know if this page of the rustc guide fits what you’re talking about or not https://rust-lang-nursery.github.io/rustc-guide/high-level-overview.html

                      Otherwise, consider this post the opposite: we have a whole book these days!

                      1. 2

                        Right that’s describing the information I want in words – but I want the code to look like the words! The code should be a high level outline of the compiler, IMO (links in the Reddit post).

                        I mentioned I would like to write a blog post about this, which would require looking at open source compilers, and rustc would be a good one. If it has the “skeleton” or “driver” in one place, I’d love to see it.

                        I did look at the old OCaml Rust compiler, and as far as I remember it had fe/, me/, be/, but I don’t know if it glued the whole thing together in 100 lines.

                        I do recall that Clang has a “pass manager”, although that might just be for the codegen portions.

                        1. 2

                          Yeah, totally.

                          the call chain is basically main, rustc_driver::main, so https://github.com/rust-lang/rust/blob/master/src/librustc_driver/ is the driver. It’s not super structured in that way, but that’s where you’d look.

                          1. 1

                            Thanks! This is what I was looking for:

                            https://github.com/rust-lang/rust/blob/master/src/librustc_driver/driver.rs

                            It’s not a single function, but it’s all in a single file, which is better than most. The big list of imports at the top is the giveaway. You can see the internal structure of the compiler; it’s not hidden behind a big compile() function or module as in many other examples.

                            One “wart” is that you don’t see the tokenizer; it appears to go directly from the Input string/file to parsing. But otherwise it looks like it shows all the parts in one place.

                        1. 2

                          Also http://andykeep.com/pubs/dissertation.pdf

                          I’ve been using this for my senior project for compilers, and it’s been quite nice. I really like how it does verification of your IR at each pass (really helps debugging), and how declarative each transformation is.

                        2. 2

                          Well, I think you could say there’s a top-most 100 lines of code in every compiler, somewhere. I do think some are better than others, but I’d guess familiarity has something to do with it.

                          Are we counting LISP interpreters? I especially like 1) kanaka’s Make a Lisp (written in many languages, like CoffeeScript), 2) the Elm compiler, which is in Haskell, and also why not 3) the SpiderMonkey bytecode compiler, which is in C++, so it’s not going to be 99loc but give some respect.

                          1. 1

                            No, as I said, if it has a pyramid structure, and not a flat structure, it doesn’t count. The pyramid means you have to search around the code to see what the compiler does; flat means you can see it all at once.

                            (And yes I made up the rules to this game according to my own preferences.)

                          2. 2

                            I’m a bit confused by what you mean by structure of the compiler. From the source you’ve linked to, I’d offer this which is 18 lines of Python.

                            To go from text to Python AST node,

                            1. Match the extended boot grammar against the boot tree to make the extended boot tree.
                            2. Match the Python grammar against the extended boot tree to make the Python (parse) tree.
                            3. Match the text against the Python parse tree to get the Python AST for that text.

                            (Its intended to show that all three steps are “symmetric” in some sense.)

                            Or if it has to compile to some kind of bytecode instead of Python AST, then the last 5 lines of this

                            t1 = list(simple_wrap_tree(boot_tree.tree))
                            t2 = match(t1, boot_grammar.bootstrap + boot_grammar.extra + boot_grammar.diff)
                            lang_tree = match(t2, grammar + boot_grammar.extra)
                            for filename in sys.argv[1:]:
                                write_suite(to_forth(simplify(parse(open(filename).read()))))
                            

                            But I guess its really just that last line write_suite(to_forth(simplify(parse(open(filename).read())))).

                            1. Transform the text into a tree using parse
                            2. Simplify the tree from 1
                            3. Turn each node into corresponding bytecode
                            4. Write the resulting bytecode as text

                            Its all PEG based and (eventually) traces back to META II that’s already been posted.

                            Hope this helps!

                            1. 1

                              Hm I think this counts, although it’s a pretty unusual interpreter structure!

                              FWIW OPy has a grammar interpreter (pgen2), but not a meta grammar. The grammar is parsed with Python code, using the same tokenizer as Python.

                              I’d be interested to read about the Python-to-forth conversion!

                              1. 2

                                FWIW OPy has a grammar interpreter (pgen2), but not a meta grammar. The grammar is parsed with Python code, using the same tokenizer as Python.

                                Interesting.

                                I’d be interested to read about the Python-to-forth conversion!

                                I actually don’t talk about that part much because

                                1. its intended to be translated from Python to Flpc and the Flpc version would be the one worth documenting and improving on
                                2. its (in my opinion) an ugly handwritten mess. I think each step is only one of two functions calling each other recursively but it could probably be something more uniform. In some places globals are used to track things. Although all but the first step (parse) is included in that file (300 LoC total).

                                I know I can rewrite it in the PEG/OMeta way. Then it would just be one or two matches for each step plus the “boot” match (so something like 8-9 calls to match total, with the right grammar as input in each case, of course). I have a half baked Python to C compiler that way (and a working Python pretty-printer (=Python to Python compiler) that just calls match multiple times).

                                Feel free to ask if you have anything specific about the FlpcPython to FlpcForth conversion though. I might still write about it eventually.

                            2. 2

                              I don’t know if this counts, but in case it does, what about Binary Lambda Calculus, which won the 2012 ioccc category Most Functional (there are a few really good examples on this page).

                              This Haskell used one to be found on wikipedia:

                              import System.Environment(getArgs) 
                              import Control.Monad.State 
                              import Control.Monad.Writer
                              import Control.Applicative hiding ((<|>),many)
                              import Text.ParserCombinators.Parsec 
                              
                              putc = do ( _, _,b, _) <- get; tell [b]
                              getc = do ( left, right,_,b:input) <- get; put ( left, right, b,input)
                              prev = do (l:left, right,b, input) <- get; put ( left,b:right, l,input)
                              next = do ( left,r:right,b, input) <- get; put (b:left, right, r,input) 
                              decr = do ( left, right,b, input) <- get; put ( left, right,pred b,input) 
                              incr = do ( left, right,b, input) <- get; put ( left, right,succ b,input)
                              loop body = do (_,_,b,_) <- get; when (b /= '\0') (body >> loop body) 
                              parseInstr = liftM loop (between (char '[') (char ']') parseInstrs) 
                                  <|> prev <$ char '<' 
                                  <|> next <$ char '>' 
                                  <|> decr <$ char '-' 
                                  <|> incr <$ char '+' 
                                  <|> putc <$ char '.' 
                                  <|> getc <$ char ',' 
                                  <|> return () <$ noneOf "]"
                              parseInstrs = liftM sequence_ (many parseInstr) 
                              main = do [name] <- getArgs
                                      source <- readFile name
                                      input <- getContents let
                                      init = ("", repeat '\0', '\0', input)
                                      putStrLn $ either show (execWriter . (`evalStateT` init)) (parse parseInstrs name source) 
                              

                              which I guess could be written more nicely, maybe in a more pseudocode fashion, and then still be understood in under 100 lines.

                              Nope, I was on my phone and I didn’t read the code, this is a translation of the BLC code for brainf**k, not BLC itself.

                              1. 2

                                I was playing around with a toy language I made in Rust. Here’s the program entry function, 50 lines. How readable is it?

                                fn run_program<P: AsRef<Path>, Q: AsRef<Path>>(
                                    path: P,
                                    dump: bool,
                                    optimize: bool,
                                    compile_only: bool,
                                    search_dirs: &[Q],
                                ) -> Result<()> {
                                    let mut ast_tree = ImportAST::new();
                                    ast_tree.import_ast(search_dirs, path.as_ref())?;
                                    let ir_compiler = CompileIR::new(ast_tree);
                                    let compiler = CompileBytes::new(ir_compiler.compile()?);
                                    let fun_table = {
                                        let fun_table = compiler.compile().chain_err(|| "Compile error")?;
                                        // run optimizations
                                        if optimize {
                                            // TODO : Optimization groups
                                            OptimizePipeline::new(fun_table).optimize()
                                        } else {
                                            fun_table
                                        }
                                    };
                                    if dump {
                                        for f in fun_table.iter().filter_map(
                                            |(_, f)| if let &internal::Fun::UserFun(ref f) =
                                                f as &sbl::bc::Fun
                                            {
                                                Some(f)
                                            } else {
                                                None
                                            },
                                        )
                                        {
                                            eprintln!("- {} {}", &f.name, "-".repeat(69 - f.name.len()));
                                            f.dump();
                                        }
                                    }
                                    if !compile_only {
                                        let mut vm = VM::new(fun_table);
                                        let res = vm.run();
                                        // Dump the VM state on error if we're dumping code as well
                                        if res.is_err() && dump {
                                            eprintln!("- Begin VM state ---------------------------------------------------------------");
                                            vm.dump_state();
                                            eprintln!("--End VM state -----------------------------------------------------------------");
                                        }
                                        res
                                    } else {
                                        Ok(())
                                    }
                                }
                                
                                1. 2

                                  That looks close – I guess it depends on if compile() has “passes”. What I’m saying is that I would like all the passes “pulled” to the top level, in a flat structure, so I can just read them from top to bottom. That’s of course my preference and you can organize your cmopiler how you want :-)

                                  1. 1

                                    It’s been a while since I’ve looked at this project, but I do believe compile() is a single “pass”, depending on your definition. Things get a little hairy in the AST stage because I designed the language so that functions and imports are order-agnostic, so there are probably multiple passes when that gets built in the beginning.

                                    The more I think about it, another feature the language has, called “bake” blocks, allows for code to be run and return values before the actual main function gets run. Without getting too in-depth on the details, this means that the final compilation may depend on previous compilations. And of course, bake blocks may have more bake blocks embedded inside of them (because hey, if it’s worth doing, it’s worth overdoing, right?) so those pre-compilation stages depend on pre-pre-compilation stages. So that final compile() call may be more complex than I originally thought.

                                2. 2

                                  SECL may only be an interpreter but it might be an interesting try too.

                                  The basic structure is a pipelline from the lexer and then through several phases in the parser, each of which is rather simple. The core should be about 600 lines, most of which are waste on definitions and struct functions, falling into the Lexer and Phase 1 of the parser.

                                  I don’t think any parts of it are particularly complex, each phase is somewhat designed to be simple to understand (atleast by my measure) since I wanted to keep the entire construct simple.

                                  1. 1

                                    Is it cheating if they use a compiler-oriented 4GL like a term-rewriting language? ;)