I’ll add another recommendation into the mix: The PL Zoo. I legitimately think that building compilers / interpreters in OCaml is substantially easier than in other languages, and the implementations in this repo are very small and manageable as learning references.
I love Andrej’s code! I should update the article to include this. It doesn’t precisely fit as it’s not a tutorial, but his examples are just beautifully written. I also like ML languages, although I can’t say that Racket is any worse. I think one might look at the “poetry” of Andrej’s code (only what is necessary, no more and no less) and draw conclusions about ML which are perhaps more due to the author.
(warning: I was going to write a blog post about several of these topics, but instead it turned into this comment :-/)
I would kinda split the difference on OCaml for programming languages. I went down that path down in 2014, and definitely learned some things, but I’m no longer that interested in using OCaml [1]. It seems like people who have written a lot of compiler code like @matklad are not super interested in OCaml either.
I picked up on this experiment, and tried it. At first, I thought TypeScript was a bit clunky and verbose, but then I started to appreciate it.
First, the integrated Deno toolchain did help me. (I’ve never used TypeScript before, but I have used JavaScript, and the experience was pretty easy/obvious.)
More interestingly, I noticed that TypeScript allows the same more expressive static type relationships that I’ve been using with Zephyr ASDL in Oils.
Namely, product types can be used as variants, without wrapping. Some elaboration here, with an example related to location info:
You shouldn’t need an advanced feature to describe basic type relationships.
As a specific example, I now view this as a “smell” in the MiniML code
type expr = ...
...
| Int of int (* Non-negative integer constant *)
| Bool of bool (* Boolean constant *)
(** Machine values. *)
type mvalue =
| MInt of int (** Integer *)
| MBool of bool (** Boolean value *)
That is, Int and Bool should be product types with location info. When you start writing bigger languages, you will run into this issue a lot, and many more. (I also learned the hard way that wrapping causes GC pressure.)
For example I wrote this tiny Lisp-y front end with location info in both the type checker and the evaluator:
(if true 42 false)
^^
yaks:1:2: Type error: if branches must have same type: got Num and Bool
(/ 42 0)
^
yaks:1:2: Runtime error: Divide by zero
The schema I use is Bool and Num as top-level types that stand alone, and then these dirt simple union types:
export type Expr = Bool | Num | Name | If | Binary | Error;
export type Value = Bool | Num;
Also:
// [Expr] -> Type Checker -> Map<Expr, Type>
export type Type = 'Bool' | 'Num';
I wanted to turn this code into a “tutorial”, but it’s not there yet.
It doesn’t do very much. But it is actually solid code in that I precisely specify the errors at each stage.
This is a good way to think about languages – what strings/programs are you rejecting at each stage? You can literally count the errors. The idea is to sort of “distill” the Oils architecture, but it’s not done.
So yeah I’m making a pretty bald assertion – basic OCaml or Standard ML have too weak a type system to naturally express languages.
TypeScript looks uglier and less concise, but it does better.
Zephyr ASDL with the Oils improvements also does better. As I wrote yesterday:
in general we have indirection between the dynamic tag/discriminant, and the static type. I’m beginning to view it as a mistake when any language makes a 1:1 correspondence, because varying them independently is so useful.
Basically we can encoded surprisingly rich static type relationships in ASDL now, which compiles down to C++ multiple inheritance, without inheritance of fields. It feels like hidden static expressiveness in C++, which maybe is not too surprising, since the type system is Turing complete
What I don’t like about TypeScript: it doesn’t use types to compile to fast code. It’s not fast enough to express the TypeScript compiler itself and tools like bundlers, which is why people are writing esbuild in Go and other tools in Rust.
Same thing with Python tools – they are too slow because they’re written in Python. (e.g. the linter and formatter we use. MyPy is sped up with mypyc)
Hence the long (but fun) diversion with Oils and mycpp – I learned the hard way that we need C++ speed, but I’m not really willing to write and maintain 200K-300K of C++. (Rust is definitely better, and arguably has a “monopoly” on this space, but I’m still skeptical of dealing with 200K-300K lines of Rust)
[1] Specifically I went through Real World OCaml, since Yaron Minsky had written some advocacy for OCaml back then
It’s actually funny to read the end of that post, with some hindsight:
Is ML the perfect language? Lord, no. It has many flaws,
like every other language. The syntax is weird, some parts
of it are hard to learn, some parts are hard to use. A simple
thing like Printf can be quite strange to express.
printf still bugs me !!! Serialization is still hard in statically typed languages!
I took this post to be about learning how to implement programming languages. It sounds like you’re more focused on compiler engineering, which is its own thing, and I wouldn’t recommend the PL Zoo for that.
For learning though, I still stand by my recommendation of OCaml, and doubly so for using parser-generators. PLs are difficult enough as it is, when learning I prefer to have the quickest path to an end-to-end product as possible, and then build out from there.
This is a perfect example of the metalanguage infecting the language. If you use ML lex/yacc, you can only “think certain thoughts”, but those thoughts don’t match how real languages work, and what people want.
Oils uses regular languages and grammars as much as possible, but it’s far from a complete story. It’s a tool and not a framework.
Just like anything else, anyone who is determined to make programming languages (orthogonal to compilers) will almost certainly make more than 1 language, and will definitely use more than a few dozen languages and be inspired by those experiences. The only point I have here is that for someone really sure they will make programming languages, a detour into OCaml land for a while isn’t straying too far from their path
I’m torn on OCaml. It’s pretty great for many kinds of quick toy languages once you get the hang of it, and toy languages are great for learning language design (again, not the same as compiler design/implementation) for someone trying to grow skills and experience in syntax and semantics. You are 100% right that there are things that will be harder specifically in OCaml
With dt I’m coming to similar a conclusion on retaining source locations, although I haven’t decided exactly what I want to do long-term for things like tracing, debugging, and more helpful error logging
Really really been enjoying Zig for its precision (types), performance, and hackability (comptime and anytype duck-typing)
The other point I was going to make about the PLZoo examples is that I think the use of ml lex and yacc puts off beginners.
There is a meme that many people learning compilers get “stuck” on lexing/parsing, and never get to the later parts, I think that’s true. (I actually think you see the same phenomenon in the Dragon Book itself – the parts on type checking and GC are weak.)
I’m using a parser generator all the time, and I still I think the “no magic” parsing approach of Crafting Interpreters is better for beginners.
A related thing I’ve noticed is that the metalanguage heavily “infects” people’s understanding of the language they’re implementing.
I actually think Crafting Interpreters is genius in that respect – it has complete implementations of the same language in both Java and C, with different architectures.
In Oils, after a bunch of experience, I likewise think it’s a big benefit to have working implementations in both Python and C++. It really does make it like an executable spec: it’s divorced from the metalanguage; it’s more real.
On the subject of parser generators, I am torn. On one hand, I understand the inclination that many developers have to learn everything from first principles. That’s how I learned and there’s a lot of value in knowing the theory of parsing.
On the other hand, learning how to code a lexer, for example, especially as someone new to PLs, can make them miss the forest for the trees. The code that you are writing is the specification. The EBNF is what’s important, and any deviation from that is a mistake. If you just hack together a lexer {perhaps skipping EBNF or some other formal specification altogether) you might write something which seems to work but deviates from the spec in certain conditions.
So I am fine with people writing parsers from scratch so long as they realize that formally specifying the grammars is not something which you can skip. It’s the only way you can be sure that your PL will actually work with every program which someone might throw at it.
Just like Crafting Interpreters does – you have to learn something in two different ways to really understand it. Starting with OCaml is perfectly valid, but it definitely won’t be the end of the journey!
I think the tl;dr is to learn both grammars and hand-written parsers :) Use grammars to check your design (or parts of it).
The EBNF is what’s important, and any deviation from that is a mistake
But I would also say this is wrong. Same comment as above – grammars are a tool, not a framework. It’s a confusion of metalanguage and language. It’s confusing models and reality.
They can limit the thoughts you can think. Parsing isn’t context-free; lexing isn’t regular.
Additional data point: neither OCaml nor Haskell have context-free grammars, or regular lexers
My advice: you don’t need to build up a portfolio of projects before contributing to your favorite existing programming language.
Chances are exceedingly high that if you open up the issue tracker for your favorite compiler/interpreter/type checker/etc. that there will be plenty of good first issues to ease you into the codebase, and you can learn from there.
Like the article points out, there are plenty of resources you can dig into if you really want to dive deep on language or compiler design. But you absolutely do not need to read three textbooks and have built a compiler from scratch before you’re ready to debug a good first issue on an existing compiler.
I agree. I felt like the list of books was a digression from the main point of the article, but it’s also the path I took. I think there are probably two separate articles mashed together here. One is “don’t do ad-hoc language design while writing your first compiler,” the second is some resources you can use if you want to learn more. I think I probably would have done better to make those two separate articles.
Excellent advice! That’s also how I got into hacking on CHICKEN. I started as a user and noticed things that didn’t work or could stand improvement and I just dove in.
Nothing gives you laser focus than working on fixing a specific problem. This goes for any code base, not just compilers. If you have a bug to fix, you are forced to learn at least something about how the subsystems work that you will need to modify, and you’ll usually also glean a bit about adjacent subsystems.
This is also useful when onboarding technical team members in a business setting. You don’t handwave and vaguely tell them to “just get familiar with the code” - you assign them somewhat easy tasks so that they have to get familiar with some parts.
I haven’t read Crafting Interpreters yet, but I was less enthusiastic about everything being written in Java and C++. The diagrams are stunning though and I’ll get to it at some point!
Community is a huge part of learning anything, so if you’re wanting to talk with others who have experience making languages I can’t recommend enough the r/ProgrammingLanguages and their Discord server. Lots of talented developers there!
Just in case anyone is interested in a community build around “Learning Compilers and Creating Programming Languages”, such a thing already exists and is meeting TOMORROW (or today depending where you are, or yesterday, if you got here late, etc):
I think it’s better to view compiler design and programming language design as inextricably coupled. Treating them as separate endeavors can lead to compilers that are impossible to make fast, because the design isn’t implementable in a way that can be fast on hardware that exists today.
Two extreme examples that come to mind: having refinement types in the language design means using something like an SMT solver in the implementation - practically guaranteeing slowness in practice - whereas at the other end of the spectrum, Pascal was able to implemented in a one-pass compiler (namely Turbo Pascal) because the language design did not require multiple passes to implement correctly.
This is true, but from a learning POV, they aren’t necessarily coupled that way. I am in the “I know in principle how all these parts work but need to actually build a bunch” phase and at that point in the journey the distinction offered here is super helpful so far. Then, once one actually has a good handle on both parts, it is easier to fit them together.
I’d also recommend Van Roy and Harper’s PFPL for learning programming language design. Both do the process of building up more complex languages by starting with simpler languages and gluing features on to them. Van Roy does so from a paradigmatic point of view while Harper does so from a formal point of view.
Generally, I’d say Van Roy stands on its own and is simply worth reading for anyone interested in this space. PFPL is a decent stand in for Pierce’s TAPL, but I tend to prefer it due to Harper being more opinionated and directed in his presentation. TAPL’s strength is being even-handed and comprehensive, Harper feels like he just dives in headfirst.
These are both great suggestions. I’ve read both of them and will update the article to include them. I’m a little worried that this is becoming “Awesome Compilers” to the detriment of my original point, which was “don’t make up a PL while writing your first compiler.” But I enjoyed both books, so….
I’ll add another recommendation into the mix: The PL Zoo. I legitimately think that building compilers / interpreters in OCaml is substantially easier than in other languages, and the implementations in this repo are very small and manageable as learning references.
EDIT: Changed “compilers” -> “compilers / interpreters”
I love Andrej’s code! I should update the article to include this. It doesn’t precisely fit as it’s not a tutorial, but his examples are just beautifully written. I also like ML languages, although I can’t say that Racket is any worse. I think one might look at the “poetry” of Andrej’s code (only what is necessary, no more and no less) and draw conclusions about ML which are perhaps more due to the author.
(warning: I was going to write a blog post about several of these topics, but instead it turned into this comment :-/)
I would kinda split the difference on OCaml for programming languages. I went down that path down in 2014, and definitely learned some things, but I’m no longer that interested in using OCaml [1]. It seems like people who have written a lot of compiler code like @matklad are not super interested in OCaml either.
From a couple months ago: TypeScript is Surprisingly OK for Compilers
I picked up on this experiment, and tried it. At first, I thought TypeScript was a bit clunky and verbose, but then I started to appreciate it.
First, the integrated Deno toolchain did help me. (I’ve never used TypeScript before, but I have used JavaScript, and the experience was pretty easy/obvious.)
More interestingly, I noticed that TypeScript allows the same more expressive static type relationships that I’ve been using with Zephyr ASDL in Oils.
Namely, product types can be used as variants, without wrapping. Some elaboration here, with an example related to location info:
https://old.reddit.com/r/ProgrammingLanguages/comments/15aie5u/bitstealing_made_legal_compilation_for_custom/
Follow-up yesterday about concrete representations, and dynamic tags: https://lobste.rs/s/suafzq/bit_stealing_made_legal_compilation_for#c_wjbr2z
I believe you can express this with GADTs in OCaml, but people consistently warn against the downsides of GADTs too: https://lobste.rs/s/gkxg9p/leaving_haskell_behind#c_r0juu3
You shouldn’t need an advanced feature to describe basic type relationships.
As a specific example, I now view this as a “smell” in the MiniML code
That is, Int and Bool should be product types with location info. When you start writing bigger languages, you will run into this issue a lot, and many more. (I also learned the hard way that wrapping causes GC pressure.)
For example I wrote this tiny Lisp-y front end with location info in both the type checker and the evaluator:
The schema I use is Bool and Num as top-level types that stand alone, and then these dirt simple union types:
Also:
https://github.com/oilshell/yaks/blob/main/header.ts
I wanted to turn this code into a “tutorial”, but it’s not there yet.
It doesn’t do very much. But it is actually solid code in that I precisely specify the errors at each stage.
This is a good way to think about languages – what strings/programs are you rejecting at each stage? You can literally count the errors. The idea is to sort of “distill” the Oils architecture, but it’s not done.
So yeah I’m making a pretty bald assertion – basic OCaml or Standard ML have too weak a type system to naturally express languages.
TypeScript looks uglier and less concise, but it does better.
Zephyr ASDL with the Oils improvements also does better. As I wrote yesterday:
What I don’t like about TypeScript: it doesn’t use types to compile to fast code. It’s not fast enough to express the TypeScript compiler itself and tools like bundlers, which is why people are writing esbuild in Go and other tools in Rust.
Same thing with Python tools – they are too slow because they’re written in Python. (e.g. the linter and formatter we use. MyPy is sped up with mypyc)
Hence the long (but fun) diversion with Oils and mycpp – I learned the hard way that we need C++ speed, but I’m not really willing to write and maintain 200K-300K of C++. (Rust is definitely better, and arguably has a “monopoly” on this space, but I’m still skeptical of dealing with 200K-300K lines of Rust)
[1] Specifically I went through Real World OCaml, since Yaron Minsky had written some advocacy for OCaml back then
OCaml for the Masses (2011)
Also this post was going around: Why ML/OCaml are good for writing compilers (1998)
It’s actually funny to read the end of that post, with some hindsight:
printf still bugs me !!! Serialization is still hard in statically typed languages!
I took this post to be about learning how to implement programming languages. It sounds like you’re more focused on compiler engineering, which is its own thing, and I wouldn’t recommend the PL Zoo for that.
For learning though, I still stand by my recommendation of OCaml, and doubly so for using parser-generators. PLs are difficult enough as it is, when learning I prefer to have the quickest path to an end-to-end product as possible, and then build out from there.
Hm well I don’t really draw that big a distinction – implementing requires learning, and learning requires implementing.
It’s a big circle … I like to have things I can play with, do “unexpected” things with, and that motivates more learning and theory
As I mentioned, I do think the metalanguage “infects” the language more than most people think, so:
The main issue with ML lex and yacc is that lexing isn’t regular, and parsing isn’t context free. The better framing is non-recursive and recursive:
https://lobste.rs/s/ndkycy/why_split_lexing_parsing_into_two#c_fbklgz
This is a perfect example of the metalanguage infecting the language. If you use ML lex/yacc, you can only “think certain thoughts”, but those thoughts don’t match how real languages work, and what people want.
Oils uses regular languages and grammars as much as possible, but it’s far from a complete story. It’s a tool and not a framework.
Drive-by comments!
anytype
duck-typing)The other point I was going to make about the PLZoo examples is that I think the use of ml lex and yacc puts off beginners.
There is a meme that many people learning compilers get “stuck” on lexing/parsing, and never get to the later parts, I think that’s true. (I actually think you see the same phenomenon in the Dragon Book itself – the parts on type checking and GC are weak.)
I’m using a parser generator all the time, and I still I think the “no magic” parsing approach of Crafting Interpreters is better for beginners.
A related thing I’ve noticed is that the metalanguage heavily “infects” people’s understanding of the language they’re implementing.
I actually think Crafting Interpreters is genius in that respect – it has complete implementations of the same language in both Java and C, with different architectures.
In Oils, after a bunch of experience, I likewise think it’s a big benefit to have working implementations in both Python and C++. It really does make it like an executable spec: it’s divorced from the metalanguage; it’s more real.
On the subject of parser generators, I am torn. On one hand, I understand the inclination that many developers have to learn everything from first principles. That’s how I learned and there’s a lot of value in knowing the theory of parsing.
On the other hand, learning how to code a lexer, for example, especially as someone new to PLs, can make them miss the forest for the trees. The code that you are writing is the specification. The EBNF is what’s important, and any deviation from that is a mistake. If you just hack together a lexer {perhaps skipping EBNF or some other formal specification altogether) you might write something which seems to work but deviates from the spec in certain conditions.
So I am fine with people writing parsers from scratch so long as they realize that formally specifying the grammars is not something which you can skip. It’s the only way you can be sure that your PL will actually work with every program which someone might throw at it.
Yeah I would basically say go for both, as mentioned in this comment - https://lobste.rs/s/tpe028/on_learning_compilers_creating#c_r2yw79
Just like Crafting Interpreters does – you have to learn something in two different ways to really understand it. Starting with OCaml is perfectly valid, but it definitely won’t be the end of the journey!
Also huge thread from 3 years ago about which parsing approach - https://lobste.rs/s/9pcqys/which_parsing_approach
I think the tl;dr is to learn both grammars and hand-written parsers :) Use grammars to check your design (or parts of it).
But I would also say this is wrong. Same comment as above – grammars are a tool, not a framework. It’s a confusion of metalanguage and language. It’s confusing models and reality.
They can limit the thoughts you can think. Parsing isn’t context-free; lexing isn’t regular.
Additional data point: neither OCaml nor Haskell have context-free grammars, or regular lexers
My advice: you don’t need to build up a portfolio of projects before contributing to your favorite existing programming language.
Chances are exceedingly high that if you open up the issue tracker for your favorite compiler/interpreter/type checker/etc. that there will be plenty of good first issues to ease you into the codebase, and you can learn from there.
Like the article points out, there are plenty of resources you can dig into if you really want to dive deep on language or compiler design. But you absolutely do not need to read three textbooks and have built a compiler from scratch before you’re ready to debug a good first issue on an existing compiler.
Pick an issue and have fun!
I agree. I felt like the list of books was a digression from the main point of the article, but it’s also the path I took. I think there are probably two separate articles mashed together here. One is “don’t do ad-hoc language design while writing your first compiler,” the second is some resources you can use if you want to learn more. I think I probably would have done better to make those two separate articles.
Excellent advice! That’s also how I got into hacking on CHICKEN. I started as a user and noticed things that didn’t work or could stand improvement and I just dove in.
Nothing gives you laser focus than working on fixing a specific problem. This goes for any code base, not just compilers. If you have a bug to fix, you are forced to learn at least something about how the subsystems work that you will need to modify, and you’ll usually also glean a bit about adjacent subsystems.
This is also useful when onboarding technical team members in a business setting. You don’t handwave and vaguely tell them to “just get familiar with the code” - you assign them somewhat easy tasks so that they have to get familiar with some parts.
This is a great resource! Excited to dive in.
I’d also recommend Thorsten Ball’s Writing An Interpreter In Go and its sequel, Writing a Compiler in Go. I found them exceedingly approachable and great in any language (I did mine in Typescript).
How does it compare to Crafting Interpreters?
I imagine they’re similar.
I haven’t read Crafting Interpreters yet, but I was less enthusiastic about everything being written in Java and C++. The diagrams are stunning though and I’ll get to it at some point!
The second interpreter is written in C, not C++.
Ah, I misremembered. Been a minute since I’ve looked at it.
Community is a huge part of learning anything, so if you’re wanting to talk with others who have experience making languages I can’t recommend enough the r/ProgrammingLanguages and their Discord server. Lots of talented developers there!
Just in case anyone is interested in a community build around “Learning Compilers and Creating Programming Languages”, such a thing already exists and is meeting TOMORROW (or today depending where you are, or yesterday, if you got here late, etc):
–
The next PLTea is Wednesday, 25 October 2pm Eastern time (US) Zoom link: https://cornell.zoom.us/meeting/register/tJEvf--qqDIjHtHQqfoki5HsUzCu2EArQ8jm
PLTeaM https://pltea.github.io/ https://twitter.com/PLteaforplt https://mastodon.social/@PLTea
Calendar: https://calendar.google.com/calendar/u/0?cid=NTcxNzFjOWY2OTkzNzgwZTE0MzJiODJjZmRlYjA3YTZiMDRlM2M2ZmRiNmUzNzQ2MTZlMTU1N2Q2MjliY2NjN0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t https://calendar.google.com/calendar/ical/57171c9f6993780e1432b82cfdeb07a6b04e3c6fdb6e374616e1557d629bccc7%40group.calendar.google.com/public/basic.ics
Date/time: https://www.timeanddate.com/worldclock/converter.html?iso=20231025T180000&p1=1440&p2=tz_et&p3=224&p4=tz_mt&p5=43&p6=136&p7=204&p8=107
I think it’s better to view compiler design and programming language design as inextricably coupled. Treating them as separate endeavors can lead to compilers that are impossible to make fast, because the design isn’t implementable in a way that can be fast on hardware that exists today.
Two extreme examples that come to mind: having refinement types in the language design means using something like an SMT solver in the implementation - practically guaranteeing slowness in practice - whereas at the other end of the spectrum, Pascal was able to implemented in a one-pass compiler (namely Turbo Pascal) because the language design did not require multiple passes to implement correctly.
This is true, but from a learning POV, they aren’t necessarily coupled that way. I am in the “I know in principle how all these parts work but need to actually build a bunch” phase and at that point in the journey the distinction offered here is super helpful so far. Then, once one actually has a good handle on both parts, it is easier to fit them together.
I’d also recommend Van Roy and Harper’s PFPL for learning programming language design. Both do the process of building up more complex languages by starting with simpler languages and gluing features on to them. Van Roy does so from a paradigmatic point of view while Harper does so from a formal point of view.
Generally, I’d say Van Roy stands on its own and is simply worth reading for anyone interested in this space. PFPL is a decent stand in for Pierce’s TAPL, but I tend to prefer it due to Harper being more opinionated and directed in his presentation. TAPL’s strength is being even-handed and comprehensive, Harper feels like he just dives in headfirst.
These are both great suggestions. I’ve read both of them and will update the article to include them. I’m a little worried that this is becoming “Awesome Compilers” to the detriment of my original point, which was “don’t make up a PL while writing your first compiler.” But I enjoyed both books, so….
Thanks for sharing. I always hope that I have time to dig deep on specific topic like this!