I hope no one actually takes this article seriously, because:
I can’t think of ANY language that was implemented this way (writing 3 compilers, the second in C, the 3rd being bootstrapped). Bootstrapping is something I’ve researched [1], and I’m pretty confident that the number of projects that have done this is exactly zero.
It doesn’t appear that the author has done anything close to this himself. It’s basically hand waving. (Though I definitely respect the other things he’s done; I’m using parts of sourcehut recently and it’s great)
A partial list of what’s missing (in my experience, which I don’t claim to be representative):
Underestimating how much more realistic it is to write 2 compilers than 3 (I actually made somewhat of the same mistake early in Oil, i.e. thinking I could improve a Python compiler more than was feasible in a reasonable amount of time)
Underestimating how bad a language C is to write compilers. C++ is not great but it’s better than C in this domain.
Not mentioning the requirement to make your language useful in its nascent state, because you can’t do this without help, and people probably won’t help you without something to run and play with
If you want to bootstrap, which is not a given (and that’s another flaw with the article), then Rust’s strategy is pretty decent and proven to have worked: OCaml then Rust.
Go has a similar strategy, except they didn’t start from scratch (that is important!), and they used a few bespoke tools to convert the compiler in C to the compiler in Go.
I can’t think of ANY language that was implemented this way (writing 3 compilers, the second in C, the 3rd being bootstrapped). Bootstrapping is something I’ve researched [1], and I’m pretty confident that the number of projects that have done this is exactly zero.
The first compiler is just a prototyping and language design tool, not a practical compiler. I can guarantee you that many of the projects on your page started with small experiments to test out the ideas they were working on before they went and wrote the compiler that they eventually shipped. Odds are that hardly anyone outside of your direct group of collaborators would use or even see your first compiler.
It doesn’t appear that the author has done anything close to this himself.
I have done exactly what this article describes, as a matter of fact. I wrote it based on my own experiences.
Underestimating how much more realistic it is to write 2 compilers than 3 (I actually made somewhat of the same mistake early in Oil, i.e. thinking I could improve a Python compiler more than was feasible in a reasonable amount of time)
Underestimating how bad a language C is to write compilers. C++ is not great but it’s better than C in this domain.
If these are problems for you, your language is probably too complicated. Remember: the first implementation can be (and inevitably will be) a great big mess. It doesn’t have to work well. It can have crippling bugs and limitations. It doesn’t necessarily have to implement the entire language. Its sole purpose is to reinforce your language design with proof that the ideas can be put into practice.
C is a great language for writing compilers.
Not mentioning the requirement to make your language useful in its nascent state, because you can’t do this without help, and people probably won’t help you without something to run and play with
Not true on any of these counts. It helps to have people interested, and it helps gather interest if you have something they can pick up and mess around with, but you can (1) build it without help, it just takes longer, (2) get people interested without having something readily available to play with.
You should be able to elevator pitch your programming language to people you want to work with, and get them on board. If you can’t, your pitch (or your idea (or your associates)) needs work. And if you prioritize getting it useful ASAP then you’re likely to rush it and lay a poor foundation upon which the rest of your language can be built.
This is a guide for serious projects which want to design a robust, self-consistent programming langauges. If you’re just making a toy, then this level of rigor is not called for. A lot of what you’re arguing for is a reduction in rigor because, well, this is difficult. But difficult though it may be, the demanded rigor will create a significantly better result. Writing a sacrificial compiler as you develop the language design proves your ideas as you work. Throwing it away and starting over lets you create a fresh codebase using only proven ideas, and shrug off the tech debt that your first compiler has in spades. Doing it in C instead of C++ (or whatever else, you were nonspecific) ensures a broadly compatible bootstrap route - you called out Rust as doing this well, but I suspect you’ve never actually bootstrapped Rust if you think it works well.
So great that even gcc moved to C++ :-). Snark aside, the ML family is probably the best for writing compilers, they’ve been designed for decades for that very purpose. SML is barely younger than C, has a standard, multiple implementations, and is very simple; I’d say SML or OCaml are great for writing a bootstrapping compiler or even the main compiler.
Also, if you want to bootstrap, look at what OCaml itself does: rather than writing the whole compiler in C (😱), the compiler is written in itself, but can produce either native code, or bytecode for a quite simple virtual machine that is implemented in C. If you want to bootstrap on a new architecture you just have to port the virtual machine. If the only goal is to bootstrap (i.e. run literally one program before you get the main compiler), any form of simple interpreter should suffice.
One of the great tragedies of our time. Let’s not pretend that GCC is a shining example of good design and compiler philosophy, though.
Also, if you want to bootstrap, look at what OCaml itself does
The OCaml approach is pretty interesting, and definitely better than some approaches (cough Rust cough). I would prefer to be able to bootstrap compilers without cross-compiling them, however, and C is nice for that because it’s already universally supported. If there’s any issues during the bootstrapping progress, you don’t have that extra layer of indirection to debug through that the VM design introduces.
I had thought the same thing when Mesa3D started allowing C++, too, but later in life I realized that C and C++ have different angles of repose. There are real size limits on what can be built with C while maintaining an acceptable defect rate. C++ isn’t perfect, just not as bad.
C is nice for [some purpose] because it’s already universally supported.
This is modern-day vaxocentrism; just like how GNU/Linux is not the only production runtime, and the world’s not a VAX, C is not universally supported. For example, since C is not capability-safe or memory-safe, any systems which provably don’t have capability or memory errors will have to emulate parts of the C virtual machine if they want to host C programs; there will usually be a more direct and native language which is supported first, with C support being an extra feature which might not manifest.
C currently compiles to more systems than any other language. For bootstrapping, the compiler doesn’t need to run well, it has to run at all. C gives the most chances to do that on an arbitrary system.
The bootstrapping compiler doesn’t even need to run on an arbitrary system; it merely needs to run on the bootstrapping compiler author’s system. I personally have used Python as my bootstrapping environment, and I’ve seen other folks use D, Haskell, OCaml, Prolog, and Scheme.
That’s how you get trust problems. I trust the C compiler on my system. I can check the source of the C programs and compile them with my trusted compiler to get trusted binaries. If you have a C bootstrapping compiler, I can compile it, and with it compile the final compiler without breaking the trust chain and me only needing to compile two things. If you need to use Python or some other language to bootstrap, now I need yet another thing to compile and, potentially, port to my system. Or, I could do as you suggest and break the trust chain I had by using the compiled binary from the developer. Now I with anything that compiler produces I innately trust the developer that provided me that binary.
Have you published the source to your compilers? It would have been helpful to mention them in the article, and refer to those experiences.
I disagree with most of the above, but open source has a good way of resolving these arguments (“show me the code”). That said, some short comments:
I agree with the need to prototype the language in any way possible at first
Throwing out the code and starting again IMO is a sign that the metalanguage was too inflexible (e.g. it can be hard to quickly refactor C code, in part because memory management is a concern at every function boundary; the language encourages long functions for the same reason)
There are multiple dimensions of rigor, and they can be at odds: efficiency is one, but correctness is another. C is better for the former; ML dialects are better for the latter. (If you don’t believe this, let me test your compiler.)
I think you implicitly understand the need to get feedback, but what you wrote is at odds with that. Why did you release sourcehut alpha? Why not just wait until it’s finished? That principle doesn’t change just because you’re writing a language.
Bootstrapping means two different things: building a compiler from source without binaries, and writing the language in itself. Ironically the latter bootstrapping makes the former much harder!
There are other options for bootstrapping, which are left out of the article: not writing the language in itself (Swift); making the language popular enough that somebody else will write an independent compiler (somebody started an independent C++ implementation of the Rust compiler).
Aside: I think that translating to C++ can be a good bootstrapping method for certain kinds of languages; I think Oil’s metalanguage might do that
Go is one of the smallest languages that’s useful and deployed. Even so, their compiler was big enough that they didn’t throw it out when bootstrapping (IIRC it was something like 60-80K lines of C, i.e. not trivial to rewrite in Go).
They also didn’t start from scratch. They did zero complete rewrites, not 2, and they were a a team paid to work on the compiler. They also bootstrapped after about ~9 years IIRC. The time frame is missing from the blog post; you’re describing 15-20 years of work IMO, if you have the requirement “people use this language for real work”.
Conjecture: languages have harsher “illusion of being close to done” than other software projects. One example is that I remember watching early videos about Jai by Jon Blow, where he claimed that “making is a language is less work than an AAA game”. That may or may not be true, and by all accounts he’s a productive programmer, but I’d be curious about how he feels about that now …
Part of the reason is that a language relies on a huge ecosystem, whereas a game doesn’t. Also, I think there was some evidence that library code is 3x more expensive to develop than application code, and I believe that. Generality, documentation, testing, edge cases, backward compatibility, etc. matter in library code.
Anyway I wish you luck on your project and I hope to see blog posts with concrete experiences!
I don’t understand. You’re pretty confident that the number of bootstrapped programming language projects is exactly zero… then link to a list you wrote of a bunch of bootstrapped programming language projects? Am I misunderstanding something?
The third compiler is self-hosting and bootstrapped from the second compiler
This is indeed a very rare collection of circumstances, because it implies that C is the preferred high-level prototyping language, which hasn’t been true for decades and in fact might have never been true. Combined with the original article’s advice, we might be misled to imagine that the second compiler is going to be thrown away immediately.
As we’ve bootstrapped the Monte programming language, we’ve basically had three compilers. The first compiler was written in Python, generated more Python, and was barely able to run basic Monte expressions; we used it to rewrite the parser and AST handling in Monte. It existed for around two years in toto, not including time spent on pre-Monte research. The second compiler, montec, is written in pure Monte and sends Monte source code to binary bytecode. This compiler has existed for about half a decade and probably will continue existing for another few years. Our third and final compiler will send pre-linked bytecode to C or assembly and use GCC or LLVM to do native compilation.
So, while it might seem like we followed the outlined pattern, our second compiler is written in pure Monte and uses standard Monte modules for linkage. Only the runtime interpreter has any C, and that particular codebase continues to shrink over time; it’s almost entirely Python except for libsodium and libuv.
If the second compiler were to be written in C, then the project would be moribund. Indeed, my partner in writing Monte wrote a C runtime for E as part of their pre-Monte research, and it turns out that even implementing a small slice of E’s relatively demanding runtime is a tall task for C. It is generally accepted knowledge in the E community that compiling E to native code requires a garbage collector, and at some point it is too difficult to implement capability-safety and memory-safety simultaneously in one single runtime.
The first compiler is the prototype compiler, written in whichever language. The article is pretty clear that the first implementation is to be thrown away. Further, the second compiler is written in C, but the second compiler isn’t a prototype; the article is pretty clear that it’s a form of reference implementation, which should be kept up to date as the formal specification evolves. I don’t see anything which demands C to be used as a prototyping language, or anything which leads one to believe that the second compiler is to be thrown away.
Though the rest of what you’re saying makes sense, and my original misunderstanding was that I interpreted “Bootstrapping is something I’ve researched [1], and I’m pretty confident that the number of projects that have done this is exactly zero” to mean that “the number of projects which have bootstrapped is exactly zero”. I realize now that they meant that “the number of projects which have gone through all three steps is exactly zero”.
As @andyc points out, I think it’s bad advice to tell beginner programming language creators to start with C/C++ and Lex and Yacc. The words “extraneous cognitive load” rush to mind immediately, especially for the first goal of a “prototype implementation”. When I first started working on compilers in high school, I only knew C and some C++, and I wasted so much time writing the most minute algorithms and transformations, messing with pointers and memory allocation and reference cycles and segmentation faults. If you want to design a brand new language, these should be the least of your worries. Dealing with these issues just distracts you from designing your actual language.
It was not until I took a programming languages course in university that I realized just how much easier it is to write programming languages in something like Haskell or OCaml: I can write an imperative language with functions, closures, different data types and much more in about an hour; it would still take me days to make the same work in C++, and although I’m no professional C++ developer, I doubt that it’s purely a matter of skill. The ability to rapidly prototype and explore ideas for your language is a huge positive, since it drastically lowers the cost of making an incorrect design decision. Even Rust, a very much imperative language, had OCaml as its first compiler implementation language.
I wonder if the article’s author has ever tried functional languages, especially as metalanguages, and if they did, I wonder what they thought of them.
I agree that C/C++ and Lex/Yacc is a bad idea. It reeks of cargo culting – there is generally no reason to write any new code for a general purpose machine in C/C++. Sure, you can LARP the 70ies, but it detracts massively from the article.
Also, writing a self-hosting compiler is neither a requirement nor particularly useful; it certainly doesn’t deserve it’s own bullet point.
I sincerely hope people avoid doing the things that are recommended in the article. Perhaps “(1990)” could be added to the title, such that people see at a glance that the advice is 30 years out of date?
Author here. First of all, C and C++ are completely different programming languages. Please do not conflate them. I didn’t mention C++ at all in my article, and I will reply to you from the C perspective only.
I think it’s bad advice to tell beginner programming language creators to start with C/C++ and Lex and Yacc. The words “extraneous cognitive load” rush to mind immediately, especially for the first goal of a “prototype implementation”. When I first started working on compilers in high school
Quotes from my article:
“Designing and implementing a new programming language from scratch is one of the most challenging tasks a programmer can undertake.”
“This post is targeted at motivated programmers who want to make a serious attempt at designing a useful programming language.”
“You [the reader] already know a few programming languages[..]”
This is not geared towards “beginners”. This is geared towards someone who is making a serious attempt at a language which they believe will be practical and useful beyond their own personal needs as a learning project. I don’t expect such a programmer to get bogged down in learning C, because I expect such a programmer to have already learned C, perhaps to an expert level.
The ability to rapidly prototype and explore ideas for your language is a huge positive, since it drastically lowers the cost of making an incorrect design decision.
I spoke to much of the same advantages in the article when talking about the sacrificial implementation. You can write this in whatever you want, and OCaml is a fine choice. I only mention C for the second compiler, which would evolve to become the bootstrap compiler for your future, self-hosted toolchain. Having a bootstrap compiler written in C has a lot of practical benefits, which I could elaborate on more in a future article, but at this point you are no longer in the prototyping phase where the advantage of rapid design iteration is entirely necessary. Hell, I also stated that you should be writing a specification at this point.
I wonder if the article’s author has ever tried functional languages, especially as metalanguages, and if they did, I wonder what they thought of them.
This is not geared towards “beginners”. This is geared towards someone who is making a serious attempt at a language which they believe will be practical and useful beyond their own personal needs as a learning project. I don’t expect such a programmer to get bogged down in learning C, because I expect such a programmer to have already learned C
Knowing languages doesn’t make you any better at designing programming languages. You are, for all intents and purposes, still a “beginner” when it comes to compiler implementation until you’ve written a compiler. So, you are still talking to people who are “beginners”, just not beginner programmers. Furthermore, it’s not about “not knowing C”. You won’t get bogged down because you don’t know the language; you will get bogged down because the language is made to require far more precision than is needed for implementing a language.
You can write this in whatever you want, and OCaml is a fine choice. I only mention C for the second compiler . . .
I assumed that you would want the sacrificial implementation written in C because you mentioned Yacc, which ismade for the C and C++ languages.
Yes, I have. I hate them passionately.
May I ask why? And also, may I ask why you think C is good for writing languages?
Knowing languages doesn’t make you any better at designing programming languages.
But not knowing languages does make you worse at this.
You are, for all intents and purposes, still a “beginner” when it comes to compiler implementation until you’ve written a compiler.
This is why you should be prepared to throw your first one away. It has nothing to do with your C skills.
You won’t get bogged down because you don’t know the language; you will get bogged down because the language is made to require far more precision than is needed for implementing a language.
I don’t agree that this is a symptom of using C.
I assumed that you would want the sacrificial implementation written in C because you mentioned Yacc, which ismade for the C and C++ languages.
I happened to write my sacrificial compiler in C, using yacc, but any parser generator would do. My article only mentions it as one possible option for this use-case. I understand how you could interpret it differently, though. To be clear: I don’t think it matters what language your first compiler is written in. You’re going to throw it away.
May I ask why [you don’t like functional programming languages]?
This is a deeper question than I have the patience to answer in an unrelated discussion.
May I ask why you think C is good for writing languages?
Again, it doesn’t really matter what language you use for the sacrificial compiler. But I do think that C has some advantages for a second compiler. It is universally supported, which makes your language easy to bootstrap. It’s also quite fast, so as long as your algorithms are fast then your compiler will be fast. Writing parsers, representing an AST, representing your type systems, this is all fairly straightforward to do in C. It does have its limitations and it would get annoying if you tried to solve every problem in the language design space with C, but that don’t really matter for a simple compiler. Simple is all this compiler will ever need to be, because its primary purpose is to build your third compiler. Your third compiler should be written in your own language, and at that point it’s on you to make your langauge suitable for any level of advanced compiler development.
May I ask why [you don’t like functional programming languages]?
This is a deeper question than I have the patience to answer in an unrelated discussion.
As a regular (and happy) OCaml user, I’d be interested in your opinion as well. If you could write something about it on your blog some day, that would be great.
I hate [“functional languages, especially as metalanguagues”] passionately.
Can I ask for more information on this perspective? I’ve found functional programming protects me from so many bugs when mutating state, and I’m a little surprised that you are not on the hype train slash bandwagon.
Admittedly I’m a Lisp-head, but this advice makes so little sense to me it’s almost funny. It makes more sense to me to prototype in a high-level language, why involve C at any step, unless your explicit goal is to be compatible with C (and even then…)? Most of the historical 70s and 80s AI research which included a lot of language experimentation was done in Lisp, for good reason:
By leveraging the Lisp reader, you get to punt initially on one of the trickiest parts of language implementation: parsing. This gives you the freedom to focus on the substance (semantics) rather than the looks (syntax). Hopefully, if you’re working on PL research, you have better ideas than just new syntax!
By leaning heavily on the host runtime, you get garbage collection and a nice runtime for free. Same as above. You can later gradually fill out the runtime so it can become self-hosted.
AST manipulation is quite easy in a language optimized for list/tree operations.
This is a great combination to get you to the juicy part, which is the semantics. You can figure out if your language works at all, and it’s easy to swap out different parsers or change aspects of the compiler radically, which is exactly what you’d want to be able to do when you’re exploring the design space. Of course, nowadays many of the above advantages are offered by other high level languages besides Lisp. Other commenters have already mentioned Haskell, OCaml and Python, all of which are fine ideas. But this post seems to suggest doing the second (and maybe also first; yacc is mentioned) implementation in C which to me sounds awfully backwards.
By leveraging the Lisp reader, you get to punt initially on one of the trickiest parts of language implementation: parsing. This gives you the freedom to focus on the substance (semantics) rather than the looks (syntax). Hopefully, if you’re working on PL research, you have better ideas than just new syntax!
Syntax heavily influences semantics. Lisp cannot parse all syntaxes. For example, let’s say I’m writing a shader language. Those are often represented as graphs. I could for example structure my programs like this:
___________ ___________
| input x | | input y |
___________ ___________
| ________ |
| - - - > | add | <- - - |
________
|
|
v
___________
| output z |
___________
Lisp parser definitely can’t make out anything useful from that. You’ll either need to change the syntax, and with it the semantics (e.g. the pipes now need to be named, or you cannot reuse the output twice).
C is only mentioned here as a language to create a bootstrapping compiler, the main purpose of which is to keep it easy to get your main compiler (the third one, written in the language itself) to your system, without trusting someone else’s binaries. C is quite an ubiquitous language, so probably already have a C compiler for that system, and therefore will only need to compile the second compiler to get the third and the main one to compile.
Not being familiar with shaders, these diagrams don’t really ring a bell. But AFAIK normally shader languages are representable in textual form, which basically means by definition you can turn it into s-expressions. Naively, I would say your diagram show just (define z (add x y)), which allows you to re-use z as many times as you like. I’d need a whole lot more context to figure out what you mean here, so you don’t have to bother unless this is somehow central to your point.
C is only mentioned here as a language to create a bootstrapping compiler, the main purpose of which is to keep it easy to get your main compiler (the third one, written in the language itself) to your system, without trusting someone else’s binaries.
Having a second compiler implemented in C which is a subset of the actual language sounds like a whole lot of extra maintenance work as you would need to develop it in lock-step with your third, final compiler. Makes much more sense to have C as (one of the) compilation target, instead.
Shaders are mainly used in graphics programming, and can be though as procedures that are run over a field of inputs that output another field of outputs And maybe the simple example didn’t quite show what I meant. You got it right, this shader just takes two named inputs and outputs the sum of them into a named output, and it’s quite simple to convert it to a textual representation without changing semantics. But take this example:
When converting this to simple textual representation, you now have to choose, should the result of add be assigned to a variable for reuse, or should we disallow the reuse of a result. The assignment to a variable changes the semantics, since in the graphical representation, the pipes are anonymous. Disallowing the reuse quite obviously changes the semantics as well.
Having a second compiler implemented in C which is a subset of the actual language sounds like a whole lot of extra maintenance work as you would need to develop it in lock-step with your third, final compiler.
You should not develop the second compiler “in lock step” with the third. You should keep it up to date, still implementing all the features described in the standard, but without anything fancy, just enough to compile your third compiler, because hopefully, at that point, your language itself has stopped having major changes itself and has majority of it’s changes in the standard library and similar things.
Makes much more sense to have C as (one of the) compilation target, instead.
Compiling things into C in my opinion is holding a shotgun to your foot. C has a lot of undefined behavior, and it can get very hard to make the compiler not hit any of that. I’m not saying it’s not possible, but it’s not practical to have a such a high maintenance feature that has so few uses. Also, notably, this doesn’t get rid of the trust problem, since this machine generated C wouldn’t be readable, so you are still trusting someone who ran the tool.
The assignment to a variable changes the semantics, since in the graphical representation, the pipes are anonymous. Disallowing the reuse quite obviously changes the semantics as well.
That’s not quite right.
Having the parser (or some other stage of a compiler) add automatically generated, unique names to “anonymous” program points doesn’t change the semantics of the language the user is using, since there is no way for the user to write that unique name in the input program.
In fact, compilers for all kinds of languages automatically generate names for intermediate values all the time, and that’s tame compared to other transformations that an industrial strength compiler will do (eg. change loop structure, convert to continuation passing style or static single assignment, etc).
However, that doesn’t change the semantics of the input language because you can’t write code in your input program that targets the transformed (intermediate or output) language code, only the input language.
IE you can’t target the CPS-transformed code in intermediate compiler stages any more than you can target the intermediate names given to “anonymous” values.
The top-level comment is absolutely right about the benefits of using something like s-expressions (and a high level language with rich data structures, garbage collection, etc) to go quickly to working on semantics.
There may be many reasons you may ultimately want a syntax other than s-expressions, but s-expressions are a great intermediate format for quick compiler prototyping.
Once you’ve nailed down the semantics, using the simplified intermediate syntax of s-expressions, you can add support for the “real” syntax by adding a parser to transform your input syntax into s-expressions.
If you are making a new language that is really just trying to innovate syntax, then sure, you probably want to spend your time prototyping various parsers.
If your language is doing something interesting and new with semantics, however, then fussing about with some complicated parser just to get started on prototyping your semantics is a waste of time.
And if what you really want to do is write parsers, in 2020 I still can’t recommend Lex and Yacc to anyone.
Additionally, as a Racket user I’m spoiled by its #lang protocol.
The #lang system allows you to specify a parser for your language provided by any arbitrary library.
Thus any user can add their own new syntax, including graphical syntaxes.
There is still research work to be done on aspects of mixing languages of different syntax (eg. I am writing in graphical language A but I’m getting an error from a library written in algol-ish language B, how should the error be displayed?), but it’s already pretty great.
After using it for a while, not only do you get to see how shallow a lot of syntax sugar is, but you start to think of it like any other (semi-) orthogonal, replaceable component of programs you write, such as your choice in data structures, utility libraries, databases, etc.
In other words, you start to consider the question of whether different syntax options will be more helpful (or harmful) for writing a particular program or component, independently of various other considerations.
Spoiler alert: most general-purpose code is best just written in s-expressions, possibly with some minor extra conveniences (such as the standard quote, quasiquote, unquote shorthands), while heavier syntax really comes into its own for domain-specific languages.
Compiling things into C in my opinion is holding a shotgun to your foot. C has a lot of undefined behavior, and it can get very hard to make the compiler not hit any of that.
This is quite similar to the problem of, well, writing things in C in the first place. Except that in this case the C shotgun is only an optional target, and it is perhaps easier to guard against undefined behavior within a few code generation templates than in the thousands of lines you need to write by hand to write a compiler in C.
(Actually, keeping a template UB-free is surely harder, but fixing a bug in a template fixes all uses of that template, and you will simply have less code in C-target templates than in an entire compiler written in C.)
The given example is not possible to replicate in s-expressions without adding a variable. If my language does not have variables, then I cannot use lisps parser for prototyping my language. The fact that the compiler adds in variables is an implementation detail and should be completely irrelevant to the user. You cannot move the language I gave to s-expressions without changing the semantics.
Racket does have a neat system for DSL’s. That does make it different from other Lisps which from what I’ve seen just make you shove it into s-expressions. Yet I feel like your lisp love is a little bit overreaching. S-expressions are really hard to parse for a human. They might be very easy for a computer, but the parenthesis sea isn’t too easy to easily discern and requires quite a bit of tooling from the IDE to work comfortably with. I believe that there are better syntax conventions that are more suited for enabling code understanding and allowing code to be read more easily.
I feel like trying to guard against undefined behavior in C in an automated fashion is a lot more difficult than writing a minimal effort compiler/interpreter that is able to compile the last compiler. Also, C assumes quite a few things, and if your language design doesn’t agree with them then you will have some bad time dealing with that.
The given example is not possible to replicate in s-expressions without adding a variable. If my language does not have variables, then I cannot use lisps parser for prototyping my language.
This is a distinction without a serious difference.
Perhaps you will actually be prototyping an intermediate language that has a certain feature you won’t make available in the outer language, or that lacks a certain feature that you will add to the outer language via syntax sugar.
But at a lower level, your graph language is already stored in some serialization format.
The given example is not possible to serialize without adding a variable (IE some kind of identifier that can be referenced).
Maybe you serialize it to something like:
You need some kind of ID for each node in the graph so other nodes can refer to them because you can’t nest the nodes cleanly as a tree.
Now let’s squint our eyes a little:
(1 "input x" (2))
(2 "add" (4 5))
...
Lo, s-expressions!
S-expressions are really hard to parse for a human.
This is the quintissential complaint against lisp. And… it’s just wrong!
I’ve never met someone who has seriously used a lisp (eg. for a job, or really anything more than just “I have to use this wacko language for some university course”) that has ever struggled with this.
This is discussed in bewilderment at every lisp-related meetup – “Wow, I came in to it thinking the parentheses were a big problem, but quickly realized it was no problem!”, “We hire people with no lisp experience all the time, and the syntax is never the roadblock to learning Clojure.”, etc.
(On the other hand, students who grudgingly learn just enough to get by for a class frequently struggle with it. As a TA for a class like this, I saw it frequently.)
Yes, s-expressions are hard for humans to parse when they are formatted wrong (or minified).
Poorly formatted (or minified!) Javascript, C, etc is harder for humans to parse!
I promise you (or almost anyone) that if you learn the standard rules for Lisp formatting (there are just a couple simple rules, far fewer than for formatting Algol-like languages), then stick to those formatting rules for reading and writing lisp, all your lisp reading problems will go away.
(Well, all your lisp reading problems relative to reading other languages – you still have to know how to read, probably in English, it’s easier if you can see rather than read Braille, etc. Deeply nested expressions using lots of custom-defined functions are difficult in any language, because you have to untangle what all these different functions mean. But that’s true of any language.)
Beginning CS courses (and books, etc) tend to cover language syntax for a lot of time.
What does x[y] mean?
What about curly braces?
What about curly braces in another context?
Where does the semicolon go?
Oh, semicolons aren’t necessary after braces for a block, but they are necessary after a struct declaration?
And, heaven help us, how do you parse x + y && a + b?
And this is all after literally more than a decade of repeated instruction in algebra notation, order of operations, etc.
University Lisp courses tend to assume “you’ll just pick it up”, or maybe spend a fraction of a lecture on syntax.
If people had as much instruction in lisp syntax as they get with C syntax it would not be a problem.
Before you say I’m just ignorant and living in a lisp bubble, let me say that I truly do sympathize with the complaint of s-expressions being hard.
The first time I used a lisp was for a university programming languages course.
I hated it!
I really, honestly struggled with the syntax.
(Of course, I tried to format it like C.)
It was bad enough that I badmouthed lisp for years after that.
But then I realized: “Hey, thousands of people use this, love it, and not only claim the syntax is not a problem but actually a very strong benefit. I know people who have defended lisp to me after I made fun of it, and they aren’t unrelatable super geniuses. Maybe I should actually give it serious consideration rather than adopting popular antogonistic beliefs about it after a brief, poorly instructed experience with it?”
When I finally sat down to seriously learn lisp and understand its weird syntax, it ended up taking less than a day to get pretty comfortable with it.
I’m not claiming that your (potential) negative experience with lisp matches mine, or that the ability to read and write lisp code isn’t something that you need to take some time to learn.
But learning to read and write lisp is not that hard.
I submit that it’s significantly easier than many things lots of programmers learn.
I think it’s significantly easier than learning how to type, learning how to use vim, or learning bash or similar shell languages (actually learning the bulk of the syntax, how substitutions work, etc, not just how to run some simple commands).
Thinking about it for a moment, I think it’s actually pretty similar in dificulty to learning to use a tiling window manager.
Not that hard, and well worth it.
They might be very easy for a computer, but the parenthesis sea isn’t too easy to easily discern and requires quite a bit of tooling from the IDE to work comfortably with.
It is true that manually formatting lisp code tends to be more tedious than manually formatting languages like Javascript.
But lisp auto-indenters are relatively easy to write and widely available in text editors.
Your favorite editor probably already has one.
Auto-indentation is the single invaluable bit of editor/IDE tooling that’s required.
Of course, once you also start using an editor extension that automatically keeps your parentheses balanced, then start using something like Paredit (I prefer SmartParens) to get 95% of the way to structural editing, then editing other syntaxes starts to feel like a chore when you can effortlessly navigate and transform your code structurally, always preserving syntactic correctness.
Admittedly Paredit and friends take some practice, but honestly you can master the essentials within a couple of hours, then just conciously make an effort to use it a little each work day over a short period so you remember it.
I could totally write a parser that parses the text as I pasted and interprets it. Would it be easy? No. Would it be easy to use? No. But syntax is making a contract with the user, and I don’t want to make a contract that contains any kind of ID’s. No non-graphical representations can do that. The fact that the graph itself might be stored as a JSON file with ID’s is not relevant, as it is not part of the contract. Yes, you can theoretically express every data structure with s-expressions. You could just as well express every data structure with nested dictionaries or a bunch of bytes with pointers. It’s like being Turing complete: it means that you can do things, but not that it’s fit for all or any of them.
Parsing isn’t just reading the data that is written. It is also understanding what data is written. In lisp there are no visual distinction from instructions and data - two things that usually don’t mix in general programming. Most other languages do, because it is important for one to quickly see what is a simple data extraction, and what is processing. Most languages have distinct looks for separate ways to look up data. Most languages have distinct looks for separate ways to store data. S-expressions don’t leave any space for those distinctions.
The fact that you need to format the code to be readable isn’t good. Self formatting syntax, like for example Python, doesn’t leave any chances for that. Yes, I know, Python syntax has some bad parts, don’t need to mess about that.
About learning: sometimes, powerful tools take a lot of time to learn how to use. But when you do know how to use it, it can be tens of times more useful than a tool that’s simple to learn. Professional tools take the time to learn them.
If you have to write programs to help you manage the amount of parenthesis to write your code, maybe the syntax isn’t that fit for humans. As I said, great for machine consumption, not that good for humans.
It’s unfortunate that this is turning into another Lisp/s-expression syntax flamewar. It looks like you already decided that you don’t even want a textual representation for this language. This means that the question of using s-expressions or not is completely besides the point.
My original point was that you can punt on the “syntax” initially (which in your case would probably be some sort of visual diagram editor), and go straight to the usually more interesting parts. In practice, for your case that would mean simply defining a format for easy representation, whether that’s using s-expressions or JSON or a serialized Python dict with intermediate IDs.
All this means you don’t even have to spend time to build a parser, you can go straight to the part where you are trying out compiler techniques to see if this is even a viable thing to do. It wasn’t about whether you find this useful or readable for the end-user. Remember, Lisp itself was originally supposed to get a non-parenthesized surface syntax with M-expressions, the syntax we currently use was originally no more than an intermediate syntax which made thinking about language semantics easier.
It does have a textual representation - the ASCII art I pasted. You can write a parser to parse that. It won’t be easy, and defining the syntax will probably be a full research paper because all the current syntax definition languages only work in a single linear dimension, while this definition would have to work in 2d space. This would actually be a pretty interesting problem to take on now that I think about it.
Also, m-expressions actually look pretty good, and are a definite improvement from the s-expression syntax in my eyes. If only lisp actually had them implemented from the start…
About learning: sometimes, powerful tools take a lot of time to learn how to use. But when you do know how to use it, it can be tens of times more useful than a tool that’s simple to learn. Professional tools take the time to learn them.
I 100% agree.
Of course, I think s-expressions are one of those powerful 10x professional tools (for language prototyping, or for more general programming when paired with macros).
(10x is probably an exaggeration, or you only get 10x for certain applications that are a minority of what you do, for s-expressions and each of those other things listed. The most productive programmer I know doesn’t use vim, barely uses any features of emacs, doesn’t use a tiling window manager, and doesn’t even type properly. Of course, I still think all of those things are worth learning and make me more efficient than I would otherwise be.)
I could totally write a parser that parses the text as I pasted and interprets it. Would it be easy? No. Would it be easy to use? No. But syntax is making a contract with the user, and I don’t want to make a contract that contains any kind of ID’s.
I thought we were discussing building a first prototype of a new programming language, not making contracts to users about an esolang.
The fact that you need to format the code to be readable isn’t good.
Every language needs formatting to be readable. In general, every programmer either lets a program (their editor) automatically format (indent, mostly) their code, or they do it manually. We give it little thought and do it nearly unconciously in C and in Lisp.
Nobody writes code like that, everybody indents their code.
Furthermore, people indent their code in standard ways, not willy-nilly adding arbitrary amounts of indentation to whatever line, they follow logical rules.
Learning the standard way to indent (and use indentation for reading) Lisp code is no different than learning the standard way to indent Algol (C, Javascript, Python, …) code.
It’s just a different set of rules that programmers follow practically unconciously once they’ve internalized them.
Self formatting syntax, like for example Python, doesn’t leave any chances for that.
Python just makes some instances of poor formatting a syntax error. You still have to format it!
If you have to write programs to help you manage the amount of parenthesis to write your code, maybe the syntax isn’t that fit for humans. As I said, great for machine consumption, not that good for humans.
1: The only program you “need” is the auto-indenter. This simple program was written decades ago. The other programs just make you more efficient. IE they are unnecessary but useful “professional 10x tools” that you can learn, similar to to Vim.
2: You need a program (text editor) to write <C, Javascript, Python, whatever> code? It must not be fit for humans. You need a program (compiler, interpreter) to be sure your types (or even your syntax) are correct? It must not be fit for humans. You need a program to actually run your program? Not fit for humans. You need a program to send instant messages? Instant messaging must not be fit for humans. This is a silly argument. You need programs to do anything with a computer, and it’s just as easy to argue that some program (or part of a program) in some other PL toolchain is a boondoggle that makes it somehow unfit for humans. Having a machine check your parentheses is no different than having a machine check your that your program is correct in other ways, and every programmer uses such programs in an edit/run/debug loop.
Again, these arguments about lisp syntax being difficult or unsuited for humans are only made by people who have never seriously used a lisp. Thousands of normal, everyday humans read and write lisp code quite naturally. We read and write it on white boards, we read and write it in any and every text editor, we read and write it in REPL sessions, we read and write it in IDEs that draw fancy arrows. We make funny music videos about reading and writing lisp[2]. Children read and write lisp. Humans have been quite capably reading and writing Lisp since before 1960. Lisp is very suited to human use. Reading arguments about how lisp is not suited for humans is like reading arguments that human flight is impossible, or how you can’t make a wind powered vehicle that goes directly down wind faster than the wind. People have made many polished arguments to prove those points. But in the end you don’t need to wade through lots of fancy arguments to see that they are demonstrably false – you can just look at all the people that are doing these things.
[1] This is a slight modification of GPL-2 licensed code available in the Linux kernel in the kernel/sched/core.c file.
Every language needs formatting to be readable. In general, every programmer either lets a program (their editor) automatically format (indent, mostly) their code, or they do it manually. We give it little thought and do it nearly unconciously in C and in Lisp.
Do you like to read code like this[1]:
No I do not. I also did not say that C syntax is good for this reason, I said it about Python. Python is a huge syntactical leap from other C-style languages (though Python syntax isn’t perfect, and I particularly don’t like that you can mix tabs and spaces which can bring a bunch of ambiguity). Languages with the Off-side rule are indented correctly by default. So I read code like this[1]:
if get_load_dotenv(load_dotenv):
cli.load_dotenv()
# if set, let env vars override previous values
if "FLASK_ENV" in os.environ:
self.env = get_env()
self.debug = get_debug_flag()
elif "FLASK_DEBUG" in os.environ:
self.debug = get_debug_flag()
Yes, can remove some whitespace, but removing the whitespace between the operators don’t hurt the readability as much as the lack of indentation. This is not provided with languages that don’t do that so the programmers have to rely on conventions, which is not always possible. In lisp this problem is exacerbated by the fact that there is only a single syntactical character, blurring the lines between control flow and method calls even more. Scheme did recognize this and have a very good write-up of the problems in one of the proposals to add indentation-based syntax to Scheme.
My dad still has his old notebook with hand-written FORTRAN code from his university times. It’s not too rare for me to scribble some code on my notebook when I don’t have a computer around(not too often of an occasion with the pandemic around). I can read, understand, and quickly check Python, C, FORTRAN for syntactical correctness on paper. With Lisp, I have to resort to the long and error-prone process of counting the parentheses. To be completely honest, I don’t understand how Lisp has survived in those times, when access to computers wasn’t as universal and most programmers didn’t have their own computers and wrote code by hand.
[1]: This code is from Flask src/flask/app.py file, licensed under BSD-3 clause license.
I’ve noticed couple of languages instead of compiler implemented translator into C: Nim and Vlang. What are the pros and cons of compiler vs translator?
It’s easier to get something working if you generate code in another language rather than LLVM IR or ASM or whatever. But if you want quite different semantics to your host language then you’ll maybe have an easier time if you build up from something simpler.
Eiffel has been working like this ~forever, and it was sometimes painful to adapt the to-c compiler when new C compilers started to exploit arbitrary combinations of Undefined Behavior to remove intended functionality in your program in favor of a 1/10000 speed improvement on some microbenchmark. You can be as careful as you want, the UB rules of C pretty much guarantee that you will emit UB code and eventually some C compiler writer will pick the combination that you accidentally used for “optimization” and break your code generator.
Target a VM (JVM, CLR) or Pascal (free pascal is a pretty decent compiler) if you want a well-defined backend environment that isn’t changing in some drastic, unexpected and undocumented way every couple of years.
Great article, the perspective is good even if you don’t end up building your compiler in exactly these steps.
You had me at C and YACC. It’s weird to see a chicken developer hate on C.
I thought that was chicken’s superpower. From the highest level of abstraction to the lowest (figuratively at least) it has you covered. (I really like programming in chicken along with C.)
To all you vader haters out there, we’ll blow you’re planet up.
C is (or: was? before all the UB exploitation) a nice language to work with, but it is not sacred. It has a lot of issues (and we’ve certainly run into our share of issues particularly due to using it as a compilation target). It is great for low-level operations and for interfacing with existing operating systems. But for prototyping a new compiler? That’s silly. Manual memory management and its impoverished string processing functionality alone should put anyone off of using it for prototyping. At that stage, you don’t need performance or low-level control over anything, really.
C’s string handling sucks (“impoversed” is harsh), but you can learn to use it. However, string handling really doesn’t matter for a compiler (really, I’m serious). The only time the compiler ever handles strings is in the lexer, and it’s not especially more difficult in C to convert input text into a token stream. From that point forward, you’re just working with types.
The language which inspired this post has a lexer which is about 1 kloc, and literally none of the remaining stages use strings at all. The last stage, emit, does deal with strings insofar as it translates an AST back into text, but that’s accomplished entirely with fprintf calls and it’s quite straightforward.
I hope no one actually takes this article seriously, because:
A partial list of what’s missing (in my experience, which I don’t claim to be representative):
If you want to bootstrap, which is not a given (and that’s another flaw with the article), then Rust’s strategy is pretty decent and proven to have worked: OCaml then Rust.
Go has a similar strategy, except they didn’t start from scratch (that is important!), and they used a few bespoke tools to convert the compiler in C to the compiler in Go.
[1] https://github.com/oilshell/oil/wiki/BootstrappingCaseStudies
The first compiler is just a prototyping and language design tool, not a practical compiler. I can guarantee you that many of the projects on your page started with small experiments to test out the ideas they were working on before they went and wrote the compiler that they eventually shipped. Odds are that hardly anyone outside of your direct group of collaborators would use or even see your first compiler.
I have done exactly what this article describes, as a matter of fact. I wrote it based on my own experiences.
If these are problems for you, your language is probably too complicated. Remember: the first implementation can be (and inevitably will be) a great big mess. It doesn’t have to work well. It can have crippling bugs and limitations. It doesn’t necessarily have to implement the entire language. Its sole purpose is to reinforce your language design with proof that the ideas can be put into practice.
C is a great language for writing compilers.
Not true on any of these counts. It helps to have people interested, and it helps gather interest if you have something they can pick up and mess around with, but you can (1) build it without help, it just takes longer, (2) get people interested without having something readily available to play with. You should be able to elevator pitch your programming language to people you want to work with, and get them on board. If you can’t, your pitch (or your idea (or your associates)) needs work. And if you prioritize getting it useful ASAP then you’re likely to rush it and lay a poor foundation upon which the rest of your language can be built.
This is a guide for serious projects which want to design a robust, self-consistent programming langauges. If you’re just making a toy, then this level of rigor is not called for. A lot of what you’re arguing for is a reduction in rigor because, well, this is difficult. But difficult though it may be, the demanded rigor will create a significantly better result. Writing a sacrificial compiler as you develop the language design proves your ideas as you work. Throwing it away and starting over lets you create a fresh codebase using only proven ideas, and shrug off the tech debt that your first compiler has in spades. Doing it in C instead of C++ (or whatever else, you were nonspecific) ensures a broadly compatible bootstrap route - you called out Rust as doing this well, but I suspect you’ve never actually bootstrapped Rust if you think it works well.
So great that even gcc moved to C++ :-). Snark aside, the ML family is probably the best for writing compilers, they’ve been designed for decades for that very purpose. SML is barely younger than C, has a standard, multiple implementations, and is very simple; I’d say SML or OCaml are great for writing a bootstrapping compiler or even the main compiler.
Also, if you want to bootstrap, look at what OCaml itself does: rather than writing the whole compiler in C (😱), the compiler is written in itself, but can produce either native code, or bytecode for a quite simple virtual machine that is implemented in C. If you want to bootstrap on a new architecture you just have to port the virtual machine. If the only goal is to bootstrap (i.e. run literally one program before you get the main compiler), any form of simple interpreter should suffice.
One of the great tragedies of our time. Let’s not pretend that GCC is a shining example of good design and compiler philosophy, though.
The OCaml approach is pretty interesting, and definitely better than some approaches (cough Rust cough). I would prefer to be able to bootstrap compilers without cross-compiling them, however, and C is nice for that because it’s already universally supported. If there’s any issues during the bootstrapping progress, you don’t have that extra layer of indirection to debug through that the VM design introduces.
I had thought the same thing when Mesa3D started allowing C++, too, but later in life I realized that C and C++ have different angles of repose. There are real size limits on what can be built with C while maintaining an acceptable defect rate. C++ isn’t perfect, just not as bad.
This is modern-day vaxocentrism; just like how GNU/Linux is not the only production runtime, and the world’s not a VAX, C is not universally supported. For example, since C is not capability-safe or memory-safe, any systems which provably don’t have capability or memory errors will have to emulate parts of the C virtual machine if they want to host C programs; there will usually be a more direct and native language which is supported first, with C support being an extra feature which might not manifest.
C currently compiles to more systems than any other language. For bootstrapping, the compiler doesn’t need to run well, it has to run at all. C gives the most chances to do that on an arbitrary system.
The bootstrapping compiler doesn’t even need to run on an arbitrary system; it merely needs to run on the bootstrapping compiler author’s system. I personally have used Python as my bootstrapping environment, and I’ve seen other folks use D, Haskell, OCaml, Prolog, and Scheme.
That’s how you get trust problems. I trust the C compiler on my system. I can check the source of the C programs and compile them with my trusted compiler to get trusted binaries. If you have a C bootstrapping compiler, I can compile it, and with it compile the final compiler without breaking the trust chain and me only needing to compile two things. If you need to use Python or some other language to bootstrap, now I need yet another thing to compile and, potentially, port to my system. Or, I could do as you suggest and break the trust chain I had by using the compiled binary from the developer. Now I with anything that compiler produces I innately trust the developer that provided me that binary.
Have you published the source to your compilers? It would have been helpful to mention them in the article, and refer to those experiences.
I disagree with most of the above, but open source has a good way of resolving these arguments (“show me the code”). That said, some short comments:
Anyway I wish you luck on your project and I hope to see blog posts with concrete experiences!
I don’t understand. You’re pretty confident that the number of bootstrapped programming language projects is exactly zero… then link to a list you wrote of a bunch of bootstrapped programming language projects? Am I misunderstanding something?
They give three criteria:
This is indeed a very rare collection of circumstances, because it implies that C is the preferred high-level prototyping language, which hasn’t been true for decades and in fact might have never been true. Combined with the original article’s advice, we might be misled to imagine that the second compiler is going to be thrown away immediately.
As we’ve bootstrapped the Monte programming language, we’ve basically had three compilers. The first compiler was written in Python, generated more Python, and was barely able to run basic Monte expressions; we used it to rewrite the parser and AST handling in Monte. It existed for around two years in toto, not including time spent on pre-Monte research. The second compiler, montec, is written in pure Monte and sends Monte source code to binary bytecode. This compiler has existed for about half a decade and probably will continue existing for another few years. Our third and final compiler will send pre-linked bytecode to C or assembly and use GCC or LLVM to do native compilation.
So, while it might seem like we followed the outlined pattern, our second compiler is written in pure Monte and uses standard Monte modules for linkage. Only the runtime interpreter has any C, and that particular codebase continues to shrink over time; it’s almost entirely Python except for
libsodium
andlibuv
.If the second compiler were to be written in C, then the project would be moribund. Indeed, my partner in writing Monte wrote a C runtime for E as part of their pre-Monte research, and it turns out that even implementing a small slice of E’s relatively demanding runtime is a tall task for C. It is generally accepted knowledge in the E community that compiling E to native code requires a garbage collector, and at some point it is too difficult to implement capability-safety and memory-safety simultaneously in one single runtime.
The first compiler is the prototype compiler, written in whichever language. The article is pretty clear that the first implementation is to be thrown away. Further, the second compiler is written in C, but the second compiler isn’t a prototype; the article is pretty clear that it’s a form of reference implementation, which should be kept up to date as the formal specification evolves. I don’t see anything which demands C to be used as a prototyping language, or anything which leads one to believe that the second compiler is to be thrown away.
Though the rest of what you’re saying makes sense, and my original misunderstanding was that I interpreted “Bootstrapping is something I’ve researched [1], and I’m pretty confident that the number of projects that have done this is exactly zero” to mean that “the number of projects which have bootstrapped is exactly zero”. I realize now that they meant that “the number of projects which have gone through all three steps is exactly zero”.
As @andyc points out, I think it’s bad advice to tell beginner programming language creators to start with C/C++ and Lex and Yacc. The words “extraneous cognitive load” rush to mind immediately, especially for the first goal of a “prototype implementation”. When I first started working on compilers in high school, I only knew C and some C++, and I wasted so much time writing the most minute algorithms and transformations, messing with pointers and memory allocation and reference cycles and segmentation faults. If you want to design a brand new language, these should be the least of your worries. Dealing with these issues just distracts you from designing your actual language.
It was not until I took a programming languages course in university that I realized just how much easier it is to write programming languages in something like Haskell or OCaml: I can write an imperative language with functions, closures, different data types and much more in about an hour; it would still take me days to make the same work in C++, and although I’m no professional C++ developer, I doubt that it’s purely a matter of skill. The ability to rapidly prototype and explore ideas for your language is a huge positive, since it drastically lowers the cost of making an incorrect design decision. Even Rust, a very much imperative language, had OCaml as its first compiler implementation language.
I wonder if the article’s author has ever tried functional languages, especially as metalanguages, and if they did, I wonder what they thought of them.
I agree that C/C++ and Lex/Yacc is a bad idea. It reeks of cargo culting – there is generally no reason to write any new code for a general purpose machine in C/C++. Sure, you can LARP the 70ies, but it detracts massively from the article.
Also, writing a self-hosting compiler is neither a requirement nor particularly useful; it certainly doesn’t deserve it’s own bullet point.
I sincerely hope people avoid doing the things that are recommended in the article. Perhaps “(1990)” could be added to the title, such that people see at a glance that the advice is 30 years out of date?
Author here. First of all, C and C++ are completely different programming languages. Please do not conflate them. I didn’t mention C++ at all in my article, and I will reply to you from the C perspective only.
Quotes from my article:
This is not geared towards “beginners”. This is geared towards someone who is making a serious attempt at a language which they believe will be practical and useful beyond their own personal needs as a learning project. I don’t expect such a programmer to get bogged down in learning C, because I expect such a programmer to have already learned C, perhaps to an expert level.
I spoke to much of the same advantages in the article when talking about the sacrificial implementation. You can write this in whatever you want, and OCaml is a fine choice. I only mention C for the second compiler, which would evolve to become the bootstrap compiler for your future, self-hosted toolchain. Having a bootstrap compiler written in C has a lot of practical benefits, which I could elaborate on more in a future article, but at this point you are no longer in the prototyping phase where the advantage of rapid design iteration is entirely necessary. Hell, I also stated that you should be writing a specification at this point.
Yes, I have. I hate them passionately.
Knowing languages doesn’t make you any better at designing programming languages. You are, for all intents and purposes, still a “beginner” when it comes to compiler implementation until you’ve written a compiler. So, you are still talking to people who are “beginners”, just not beginner programmers. Furthermore, it’s not about “not knowing C”. You won’t get bogged down because you don’t know the language; you will get bogged down because the language is made to require far more precision than is needed for implementing a language.
I assumed that you would want the sacrificial implementation written in C because you mentioned Yacc, which ismade for the C and C++ languages.
May I ask why? And also, may I ask why you think C is good for writing languages?
But not knowing languages does make you worse at this.
This is why you should be prepared to throw your first one away. It has nothing to do with your C skills.
I don’t agree that this is a symptom of using C.
I happened to write my sacrificial compiler in C, using yacc, but any parser generator would do. My article only mentions it as one possible option for this use-case. I understand how you could interpret it differently, though. To be clear: I don’t think it matters what language your first compiler is written in. You’re going to throw it away.
This is a deeper question than I have the patience to answer in an unrelated discussion.
Again, it doesn’t really matter what language you use for the sacrificial compiler. But I do think that C has some advantages for a second compiler. It is universally supported, which makes your language easy to bootstrap. It’s also quite fast, so as long as your algorithms are fast then your compiler will be fast. Writing parsers, representing an AST, representing your type systems, this is all fairly straightforward to do in C. It does have its limitations and it would get annoying if you tried to solve every problem in the language design space with C, but that don’t really matter for a simple compiler. Simple is all this compiler will ever need to be, because its primary purpose is to build your third compiler. Your third compiler should be written in your own language, and at that point it’s on you to make your langauge suitable for any level of advanced compiler development.
As a regular (and happy) OCaml user, I’d be interested in your opinion as well. If you could write something about it on your blog some day, that would be great.
Can I ask for more information on this perspective? I’ve found functional programming protects me from so many bugs when mutating state, and I’m a little surprised that you are not on the hype train slash bandwagon.
Admittedly I’m a Lisp-head, but this advice makes so little sense to me it’s almost funny. It makes more sense to me to prototype in a high-level language, why involve C at any step, unless your explicit goal is to be compatible with C (and even then…)? Most of the historical 70s and 80s AI research which included a lot of language experimentation was done in Lisp, for good reason:
This is a great combination to get you to the juicy part, which is the semantics. You can figure out if your language works at all, and it’s easy to swap out different parsers or change aspects of the compiler radically, which is exactly what you’d want to be able to do when you’re exploring the design space. Of course, nowadays many of the above advantages are offered by other high level languages besides Lisp. Other commenters have already mentioned Haskell, OCaml and Python, all of which are fine ideas. But this post seems to suggest doing the second (and maybe also first; yacc is mentioned) implementation in C which to me sounds awfully backwards.
Syntax heavily influences semantics. Lisp cannot parse all syntaxes. For example, let’s say I’m writing a shader language. Those are often represented as graphs. I could for example structure my programs like this:
Lisp parser definitely can’t make out anything useful from that. You’ll either need to change the syntax, and with it the semantics (e.g. the pipes now need to be named, or you cannot reuse the output twice).
C is only mentioned here as a language to create a bootstrapping compiler, the main purpose of which is to keep it easy to get your main compiler (the third one, written in the language itself) to your system, without trusting someone else’s binaries. C is quite an ubiquitous language, so probably already have a C compiler for that system, and therefore will only need to compile the second compiler to get the third and the main one to compile.
Not being familiar with shaders, these diagrams don’t really ring a bell. But AFAIK normally shader languages are representable in textual form, which basically means by definition you can turn it into s-expressions. Naively, I would say your diagram show just
(define z (add x y))
, which allows you to re-usez
as many times as you like. I’d need a whole lot more context to figure out what you mean here, so you don’t have to bother unless this is somehow central to your point.Having a second compiler implemented in C which is a subset of the actual language sounds like a whole lot of extra maintenance work as you would need to develop it in lock-step with your third, final compiler. Makes much more sense to have C as (one of the) compilation target, instead.
Shaders are mainly used in graphics programming, and can be though as procedures that are run over a field of inputs that output another field of outputs And maybe the simple example didn’t quite show what I meant. You got it right, this shader just takes two named inputs and outputs the sum of them into a named output, and it’s quite simple to convert it to a textual representation without changing semantics. But take this example:
When converting this to simple textual representation, you now have to choose, should the result of add be assigned to a variable for reuse, or should we disallow the reuse of a result. The assignment to a variable changes the semantics, since in the graphical representation, the pipes are anonymous. Disallowing the reuse quite obviously changes the semantics as well.
You should not develop the second compiler “in lock step” with the third. You should keep it up to date, still implementing all the features described in the standard, but without anything fancy, just enough to compile your third compiler, because hopefully, at that point, your language itself has stopped having major changes itself and has majority of it’s changes in the standard library and similar things.
Compiling things into C in my opinion is holding a shotgun to your foot. C has a lot of undefined behavior, and it can get very hard to make the compiler not hit any of that. I’m not saying it’s not possible, but it’s not practical to have a such a high maintenance feature that has so few uses. Also, notably, this doesn’t get rid of the trust problem, since this machine generated C wouldn’t be readable, so you are still trusting someone who ran the tool.
That’s not quite right. Having the parser (or some other stage of a compiler) add automatically generated, unique names to “anonymous” program points doesn’t change the semantics of the language the user is using, since there is no way for the user to write that unique name in the input program. In fact, compilers for all kinds of languages automatically generate names for intermediate values all the time, and that’s tame compared to other transformations that an industrial strength compiler will do (eg. change loop structure, convert to continuation passing style or static single assignment, etc). However, that doesn’t change the semantics of the input language because you can’t write code in your input program that targets the transformed (intermediate or output) language code, only the input language. IE you can’t target the CPS-transformed code in intermediate compiler stages any more than you can target the intermediate names given to “anonymous” values.
The top-level comment is absolutely right about the benefits of using something like s-expressions (and a high level language with rich data structures, garbage collection, etc) to go quickly to working on semantics. There may be many reasons you may ultimately want a syntax other than s-expressions, but s-expressions are a great intermediate format for quick compiler prototyping. Once you’ve nailed down the semantics, using the simplified intermediate syntax of s-expressions, you can add support for the “real” syntax by adding a parser to transform your input syntax into s-expressions. If you are making a new language that is really just trying to innovate syntax, then sure, you probably want to spend your time prototyping various parsers. If your language is doing something interesting and new with semantics, however, then fussing about with some complicated parser just to get started on prototyping your semantics is a waste of time. And if what you really want to do is write parsers, in 2020 I still can’t recommend Lex and Yacc to anyone.
Additionally, as a Racket user I’m spoiled by its
#lang
protocol. The#lang
system allows you to specify a parser for your language provided by any arbitrary library. Thus any user can add their own new syntax, including graphical syntaxes. There is still research work to be done on aspects of mixing languages of different syntax (eg. I am writing in graphical language A but I’m getting an error from a library written in algol-ish language B, how should the error be displayed?), but it’s already pretty great. After using it for a while, not only do you get to see how shallow a lot of syntax sugar is, but you start to think of it like any other (semi-) orthogonal, replaceable component of programs you write, such as your choice in data structures, utility libraries, databases, etc. In other words, you start to consider the question of whether different syntax options will be more helpful (or harmful) for writing a particular program or component, independently of various other considerations. Spoiler alert: most general-purpose code is best just written in s-expressions, possibly with some minor extra conveniences (such as the standard quote, quasiquote, unquote shorthands), while heavier syntax really comes into its own for domain-specific languages.This is quite similar to the problem of, well, writing things in C in the first place. Except that in this case the C shotgun is only an optional target, and it is perhaps easier to guard against undefined behavior within a few code generation templates than in the thousands of lines you need to write by hand to write a compiler in C. (Actually, keeping a template UB-free is surely harder, but fixing a bug in a template fixes all uses of that template, and you will simply have less code in C-target templates than in an entire compiler written in C.)
The given example is not possible to replicate in s-expressions without adding a variable. If my language does not have variables, then I cannot use lisps parser for prototyping my language. The fact that the compiler adds in variables is an implementation detail and should be completely irrelevant to the user. You cannot move the language I gave to s-expressions without changing the semantics.
Racket does have a neat system for DSL’s. That does make it different from other Lisps which from what I’ve seen just make you shove it into s-expressions. Yet I feel like your lisp love is a little bit overreaching. S-expressions are really hard to parse for a human. They might be very easy for a computer, but the parenthesis sea isn’t too easy to easily discern and requires quite a bit of tooling from the IDE to work comfortably with. I believe that there are better syntax conventions that are more suited for enabling code understanding and allowing code to be read more easily.
I feel like trying to guard against undefined behavior in C in an automated fashion is a lot more difficult than writing a minimal effort compiler/interpreter that is able to compile the last compiler. Also, C assumes quite a few things, and if your language design doesn’t agree with them then you will have some bad time dealing with that.
This is a distinction without a serious difference. Perhaps you will actually be prototyping an intermediate language that has a certain feature you won’t make available in the outer language, or that lacks a certain feature that you will add to the outer language via syntax sugar.
But at a lower level, your graph language is already stored in some serialization format. The given example is not possible to serialize without adding a variable (IE some kind of identifier that can be referenced). Maybe you serialize it to something like:
You need some kind of ID for each node in the graph so other nodes can refer to them because you can’t nest the nodes cleanly as a tree. Now let’s squint our eyes a little:
Lo, s-expressions!
This is the quintissential complaint against lisp. And… it’s just wrong! I’ve never met someone who has seriously used a lisp (eg. for a job, or really anything more than just “I have to use this wacko language for some university course”) that has ever struggled with this. This is discussed in bewilderment at every lisp-related meetup – “Wow, I came in to it thinking the parentheses were a big problem, but quickly realized it was no problem!”, “We hire people with no lisp experience all the time, and the syntax is never the roadblock to learning Clojure.”, etc.
(On the other hand, students who grudgingly learn just enough to get by for a class frequently struggle with it. As a TA for a class like this, I saw it frequently.)
Yes, s-expressions are hard for humans to parse when they are formatted wrong (or minified). Poorly formatted (or minified!) Javascript, C, etc is harder for humans to parse! I promise you (or almost anyone) that if you learn the standard rules for Lisp formatting (there are just a couple simple rules, far fewer than for formatting Algol-like languages), then stick to those formatting rules for reading and writing lisp, all your lisp reading problems will go away.
(Well, all your lisp reading problems relative to reading other languages – you still have to know how to read, probably in English, it’s easier if you can see rather than read Braille, etc. Deeply nested expressions using lots of custom-defined functions are difficult in any language, because you have to untangle what all these different functions mean. But that’s true of any language.)
Beginning CS courses (and books, etc) tend to cover language syntax for a lot of time. What does
x[y]
mean? What about curly braces? What about curly braces in another context? Where does the semicolon go? Oh, semicolons aren’t necessary after braces for a block, but they are necessary after a struct declaration? And, heaven help us, how do you parsex + y && a + b
? And this is all after literally more than a decade of repeated instruction in algebra notation, order of operations, etc.University Lisp courses tend to assume “you’ll just pick it up”, or maybe spend a fraction of a lecture on syntax. If people had as much instruction in lisp syntax as they get with C syntax it would not be a problem.
Before you say I’m just ignorant and living in a lisp bubble, let me say that I truly do sympathize with the complaint of s-expressions being hard. The first time I used a lisp was for a university programming languages course. I hated it! I really, honestly struggled with the syntax. (Of course, I tried to format it like C.) It was bad enough that I badmouthed lisp for years after that. But then I realized: “Hey, thousands of people use this, love it, and not only claim the syntax is not a problem but actually a very strong benefit. I know people who have defended lisp to me after I made fun of it, and they aren’t unrelatable super geniuses. Maybe I should actually give it serious consideration rather than adopting popular antogonistic beliefs about it after a brief, poorly instructed experience with it?” When I finally sat down to seriously learn lisp and understand its weird syntax, it ended up taking less than a day to get pretty comfortable with it.
I’m not claiming that your (potential) negative experience with lisp matches mine, or that the ability to read and write lisp code isn’t something that you need to take some time to learn. But learning to read and write lisp is not that hard. I submit that it’s significantly easier than many things lots of programmers learn. I think it’s significantly easier than learning how to type, learning how to use vim, or learning bash or similar shell languages (actually learning the bulk of the syntax, how substitutions work, etc, not just how to run some simple commands). Thinking about it for a moment, I think it’s actually pretty similar in dificulty to learning to use a tiling window manager. Not that hard, and well worth it.
It is true that manually formatting lisp code tends to be more tedious than manually formatting languages like Javascript. But lisp auto-indenters are relatively easy to write and widely available in text editors. Your favorite editor probably already has one. Auto-indentation is the single invaluable bit of editor/IDE tooling that’s required.
Of course, once you also start using an editor extension that automatically keeps your parentheses balanced, then start using something like Paredit (I prefer SmartParens) to get 95% of the way to structural editing, then editing other syntaxes starts to feel like a chore when you can effortlessly navigate and transform your code structurally, always preserving syntactic correctness. Admittedly Paredit and friends take some practice, but honestly you can master the essentials within a couple of hours, then just conciously make an effort to use it a little each work day over a short period so you remember it.
I could totally write a parser that parses the text as I pasted and interprets it. Would it be easy? No. Would it be easy to use? No. But syntax is making a contract with the user, and I don’t want to make a contract that contains any kind of ID’s. No non-graphical representations can do that. The fact that the graph itself might be stored as a JSON file with ID’s is not relevant, as it is not part of the contract. Yes, you can theoretically express every data structure with s-expressions. You could just as well express every data structure with nested dictionaries or a bunch of bytes with pointers. It’s like being Turing complete: it means that you can do things, but not that it’s fit for all or any of them.
Parsing isn’t just reading the data that is written. It is also understanding what data is written. In lisp there are no visual distinction from instructions and data - two things that usually don’t mix in general programming. Most other languages do, because it is important for one to quickly see what is a simple data extraction, and what is processing. Most languages have distinct looks for separate ways to look up data. Most languages have distinct looks for separate ways to store data. S-expressions don’t leave any space for those distinctions.
The fact that you need to format the code to be readable isn’t good. Self formatting syntax, like for example Python, doesn’t leave any chances for that. Yes, I know, Python syntax has some bad parts, don’t need to mess about that.
About learning: sometimes, powerful tools take a lot of time to learn how to use. But when you do know how to use it, it can be tens of times more useful than a tool that’s simple to learn. Professional tools take the time to learn them.
If you have to write programs to help you manage the amount of parenthesis to write your code, maybe the syntax isn’t that fit for humans. As I said, great for machine consumption, not that good for humans.
It’s unfortunate that this is turning into another Lisp/s-expression syntax flamewar. It looks like you already decided that you don’t even want a textual representation for this language. This means that the question of using s-expressions or not is completely besides the point.
My original point was that you can punt on the “syntax” initially (which in your case would probably be some sort of visual diagram editor), and go straight to the usually more interesting parts. In practice, for your case that would mean simply defining a format for easy representation, whether that’s using s-expressions or JSON or a serialized Python dict with intermediate IDs.
All this means you don’t even have to spend time to build a parser, you can go straight to the part where you are trying out compiler techniques to see if this is even a viable thing to do. It wasn’t about whether you find this useful or readable for the end-user. Remember, Lisp itself was originally supposed to get a non-parenthesized surface syntax with M-expressions, the syntax we currently use was originally no more than an intermediate syntax which made thinking about language semantics easier.
Sorry about that. Somehow last night I was just in a mood to keep replying.
It does have a textual representation - the ASCII art I pasted. You can write a parser to parse that. It won’t be easy, and defining the syntax will probably be a full research paper because all the current syntax definition languages only work in a single linear dimension, while this definition would have to work in 2d space. This would actually be a pretty interesting problem to take on now that I think about it.
Also, m-expressions actually look pretty good, and are a definite improvement from the s-expression syntax in my eyes. If only lisp actually had them implemented from the start…
I 100% agree.
Of course, I think s-expressions are one of those powerful 10x professional tools (for language prototyping, or for more general programming when paired with macros).
(10x is probably an exaggeration, or you only get 10x for certain applications that are a minority of what you do, for s-expressions and each of those other things listed. The most productive programmer I know doesn’t use vim, barely uses any features of emacs, doesn’t use a tiling window manager, and doesn’t even type properly. Of course, I still think all of those things are worth learning and make me more efficient than I would otherwise be.)
I thought we were discussing building a first prototype of a new programming language, not making contracts to users about an esolang.
Every language needs formatting to be readable. In general, every programmer either lets a program (their editor) automatically format (indent, mostly) their code, or they do it manually. We give it little thought and do it nearly unconciously in C and in Lisp.
Do you like to read code like this[1]:
Nobody writes code like that, everybody indents their code. Furthermore, people indent their code in standard ways, not willy-nilly adding arbitrary amounts of indentation to whatever line, they follow logical rules.
Learning the standard way to indent (and use indentation for reading) Lisp code is no different than learning the standard way to indent Algol (C, Javascript, Python, …) code. It’s just a different set of rules that programmers follow practically unconciously once they’ve internalized them.
Python just makes some instances of poor formatting a syntax error. You still have to format it!
1: The only program you “need” is the auto-indenter. This simple program was written decades ago. The other programs just make you more efficient. IE they are unnecessary but useful “professional 10x tools” that you can learn, similar to to Vim. 2: You need a program (text editor) to write <C, Javascript, Python, whatever> code? It must not be fit for humans. You need a program (compiler, interpreter) to be sure your types (or even your syntax) are correct? It must not be fit for humans. You need a program to actually run your program? Not fit for humans. You need a program to send instant messages? Instant messaging must not be fit for humans. This is a silly argument. You need programs to do anything with a computer, and it’s just as easy to argue that some program (or part of a program) in some other PL toolchain is a boondoggle that makes it somehow unfit for humans. Having a machine check your parentheses is no different than having a machine check your that your program is correct in other ways, and every programmer uses such programs in an edit/run/debug loop.
Again, these arguments about lisp syntax being difficult or unsuited for humans are only made by people who have never seriously used a lisp. Thousands of normal, everyday humans read and write lisp code quite naturally. We read and write it on white boards, we read and write it in any and every text editor, we read and write it in REPL sessions, we read and write it in IDEs that draw fancy arrows. We make funny music videos about reading and writing lisp[2]. Children read and write lisp. Humans have been quite capably reading and writing Lisp since before 1960. Lisp is very suited to human use. Reading arguments about how lisp is not suited for humans is like reading arguments that human flight is impossible, or how you can’t make a wind powered vehicle that goes directly down wind faster than the wind. People have made many polished arguments to prove those points. But in the end you don’t need to wade through lots of fancy arguments to see that they are demonstrably false – you can just look at all the people that are doing these things.
[1] This is a slight modification of GPL-2 licensed code available in the Linux kernel in the kernel/sched/core.c file.
[2] This music video deserves to be shoehorned into programming conversations more frequently than it is. It’s great. https://www.youtube.com/watch?v=HM1Zb3xmvMc
No I do not. I also did not say that C syntax is good for this reason, I said it about Python. Python is a huge syntactical leap from other C-style languages (though Python syntax isn’t perfect, and I particularly don’t like that you can mix tabs and spaces which can bring a bunch of ambiguity). Languages with the Off-side rule are indented correctly by default. So I read code like this[1]:
Yes, can remove some whitespace, but removing the whitespace between the operators don’t hurt the readability as much as the lack of indentation. This is not provided with languages that don’t do that so the programmers have to rely on conventions, which is not always possible. In lisp this problem is exacerbated by the fact that there is only a single syntactical character, blurring the lines between control flow and method calls even more. Scheme did recognize this and have a very good write-up of the problems in one of the proposals to add indentation-based syntax to Scheme.
My dad still has his old notebook with hand-written FORTRAN code from his university times. It’s not too rare for me to scribble some code on my notebook when I don’t have a computer around(not too often of an occasion with the pandemic around). I can read, understand, and quickly check Python, C, FORTRAN for syntactical correctness on paper. With Lisp, I have to resort to the long and error-prone process of counting the parentheses. To be completely honest, I don’t understand how Lisp has survived in those times, when access to computers wasn’t as universal and most programmers didn’t have their own computers and wrote code by hand.
[1]: This code is from Flask src/flask/app.py file, licensed under BSD-3 clause license.
I’ve noticed couple of languages instead of compiler implemented translator into C: Nim and Vlang. What are the pros and cons of compiler vs translator?
It’s easier to get something working if you generate code in another language rather than LLVM IR or ASM or whatever. But if you want quite different semantics to your host language then you’ll maybe have an easier time if you build up from something simpler.
Eiffel has been working like this ~forever, and it was sometimes painful to adapt the to-c compiler when new C compilers started to exploit arbitrary combinations of Undefined Behavior to remove intended functionality in your program in favor of a 1/10000 speed improvement on some microbenchmark. You can be as careful as you want, the UB rules of C pretty much guarantee that you will emit UB code and eventually some C compiler writer will pick the combination that you accidentally used for “optimization” and break your code generator.
Target a VM (JVM, CLR) or Pascal (free pascal is a pretty decent compiler) if you want a well-defined backend environment that isn’t changing in some drastic, unexpected and undocumented way every couple of years.
Here are some of the pros and cons vs writing a native code generator.
Pros:
C compilers are riddled with workarounds for buggy processors.
Cons:
Great article, the perspective is good even if you don’t end up building your compiler in exactly these steps. You had me at C and YACC. It’s weird to see a chicken developer hate on C. I thought that was chicken’s superpower. From the highest level of abstraction to the lowest (figuratively at least) it has you covered. (I really like programming in chicken along with C.) To all you vader haters out there, we’ll blow you’re planet up.
haha, hi :)
C is (or: was? before all the UB exploitation) a nice language to work with, but it is not sacred. It has a lot of issues (and we’ve certainly run into our share of issues particularly due to using it as a compilation target). It is great for low-level operations and for interfacing with existing operating systems. But for prototyping a new compiler? That’s silly. Manual memory management and its impoverished string processing functionality alone should put anyone off of using it for prototyping. At that stage, you don’t need performance or low-level control over anything, really.
C’s string handling sucks (“impoversed” is harsh), but you can learn to use it. However, string handling really doesn’t matter for a compiler (really, I’m serious). The only time the compiler ever handles strings is in the lexer, and it’s not especially more difficult in C to convert input text into a token stream. From that point forward, you’re just working with types.
The language which inspired this post has a lexer which is about 1 kloc, and literally none of the remaining stages use strings at all. The last stage, emit, does deal with strings insofar as it translates an AST back into text, but that’s accomplished entirely with fprintf calls and it’s quite straightforward.
Offtop: Sucks that drew left fediverse, won’t read any of his posts anymore (unless posted here).
Just use RSS if you want to keep the hot takes coming.
Agreed. Fedi isn’t the same anymore without Drew’s angry programmer rants. :(