I enjoy how vastly different are the proposed answers here to the title question — a version of C that’s even more minimal and low-level, to the point of being esoteric, and an cutting-edge dependently typed functional language that provides the only(?) practical implementation of Quantitative Type Theory.
One thing I can say as a (very incomplete yet) student of Idris is that even if you have zero intentions of actually using it, it will make you a better programmer in almost any language you’d work in. For example, I just learned of a concept called “total” (as in, “total function”) which is when a function can accept all its designated possible inputs; Idris checks for that upfront and lets you know when something is “not total”, and on the same day I learned that concept I actually had to fix a bug in a function I wrote which had to do with… not handling all possible inputs. >..<
Anyway, Idris 2 IS bootstrapped (implemented in itself), it can output a C version which can then be compiled down to a statically linked executable, and the code you end up with is apparently fairly fast, all things considered, though I have a bunch more testing/playing with it to do.
It’s pretty well-known by Haskell types as an “un-fucked Haskell” (literally what one person I know said about it), I found it while bemoaning the fact that Elixir doesn’t let you build a single binary out of it.
I count 2329 lines of C code in the C compiler (leaving out cvopt, a separate program I can’t identify).
There are also 2477 lines of assembly. This appears to be the runtime for compiled programs, not the compiler itself.
ncc.c is the front end program (cc), the command that you run from the shell. It invokes the first pass (/usr/lib/c0), the second pass (/usr/lib/c1) and finally the assembler (/bin/as) to generate machine code. c0 reads a C source file and outputs a mix of assembly language and some kind of IR representation for expressions. c1 reads the output of c0 and outputs pure assembly language.
This is the very first released C compiler written in C itself, for a PDP-11. You won’t be able to compile this with a modern C compiler. Based on the file nc0/c00.c, the reserved words are: int, char, float, double, auto, extern, static, goto, return, if, while, else, switch, case, break, continue, do, default. No struct yet. No header files or preprocessor yet.
I posted this before reading @andyc’s post on C4, which is an even shorter self-hosted C compiler designed along similar lines, although with far fewer language features. Only 528 lines of code. (Andy claims that C4 is “dynamically typed”, which isn’t true. It has int, char and pointer types: multiple static types. Type errors are reported at compile time, not at run time. There is no dynamic typing facility by which you can store a value of any type in a variable, and query its type at runtime.)
What type errors were you able to get from it? I tried the following:
char c;
int i;
int *p;
And you can freely assign i to c, i to p, etc. You can also compare them for equality freely.
There are no function parameters in C4, and thus there is no type checking of function parameters :)
It seems very much like old school B or BCPL – a structured assembler.
To me dynamic vs. static typing is a spectrum, with ANSI C being dynamically typed in many areas (implicit decay of array to pointer, implicit conversion of pointer to bool, of double to int – or maybe that’s a C++ overload rule, either way I just hit it). Most of this was an explicit design by Thompson and Ritchie – they had more of the dynamic typing philosophy.
But overall it still emits many type errors, so ANSI C is statically typed.
On the other hand, I was not able to get ANY type error out of C4. It has to emit at least one type error to be called statically typed :)
“bad function call”, where the expression in function position is not a function,
“bad dereference”, in *p where p does not have a pointer type,
“pointer type expected”, in p[i] where p does not have pointer type.
In B or BCPL, there’s only one type, the machine word. By contrast, C4 is what C was originally like. There is more than one type. int and char are distinct types that occupy different amounts of memory. And unlike B/BCPL, you need to keep track of pointer types so that you know the size and type of the object you are pointing to. So int*, int**, char*, char** are all different types. The compiler generates different code for dereferencing an int* vs a char* (LI vs LC). The type discipline in early C was less strict than today: the compiler performed more automatic conversions between types than what we now expect. But it’s still static typing, not dynamically typed, and not typeless like B or BCPL.
The symbol table is a global variable called sym. It is used for both global and local variables. When a function definition is entered, its local variables are added to the symbol table. On exiting a function definition, the local variable definitions are removed.
Each entry in the symbol table has 8 fields, defined in an enum at top of file. The Type field represents the static type of a variable.
Ah OK, thanks for the correction. Apparently I’m not very good at reading C4’s code!
I did try it on a bunch of snippets though.
So it’s interesting that it does have compound data structures in the form of malloc()ing an array, and then pointers and a[i] give enough access to that structure. Which I suppose is the cleverness of the original C.
I don’t see any usage of stack or global arrays, but it does have dynamically allocated arrays.
Trying to think of what simplifying assumptions you could make to “Design golf” a language that is both super simple and still usable enough to write a compiler for, without just saying everything is of type “object”.
I think if you entirely get rid of type inference and rely on the C preprocessor for language imports, you could get to something very small. Maybe a miserable language to use but it would work I think!
Smalltalk was originally created as a bet that you could fully specify a useful language on a single page of paper (I think US letter sized). It cheated slightly by punting a lot of things to the standard library (e.g. loops, conditional branches) but it provided everything that you needed to implement them. Its not statically typed though.
It’s also not that interesting to think about the compiler in isolation. Most languages can’t implement their own run-time system entirely. The C runtime includes some assembly, as does the C++ runtime. A Java VM requires either a load of C/C++ or extensions to Java like com.sun.Unsafe that let you bypass the safety guarantees of the language. For a high-level language, not being able to implement the language in itself is a feature: if I can’t do unsafe things, I can build systems that depend on the fact that no code that I pull in from third parties can either.
If you just think about compilers, then you’ll find that functional languages with pattern matching are really good for writing compilers (a compiler is a pure function that consumes some text, builds a tree, pattern matches over the tree, and produces some output). @ltratt has written a lot about the dangers of optimising a language for implementing the compiler for the language. I think these are less valid if you consider incremental recompilation, REPLs, and LSP tooling such as online refactoring to be part of the compiler, because then the compiler starts to look more like a representative subset of useful programs, but you don’t get these if you’re aiming for the smallest possible compiler.
One of the things that you can do to simplify bootstrapping for a static language is skip all of the type checking. If you have a bootstrap compiler that implements the language but generates weird output on ill-formed input, you can then implement the full compiler in the language. This is the opposite of the original question, which wanted a smaller implementation in itself, but it means that if the compiler compiles with itself then it will also compile with the bootstrap compiler for bring up on new platforms and you just have to worry about codegen in the bootstrap compiler.
The language has no compound data structures at all. No structs or arrays. How do you create a symbol table, to look up types of variables?
You don’t, you just have a single global variable ty that is mutated in the expr() function, which parses expressions, and does some very basic and incomplete type checks
It definitely has a non-zero amount of static typing. Although I wouldn’t argue if you said it wasn’t statically typed :)
So yeah probably the key concerns are how to represent the symbol table. That requires some power in the language.
Also if you want to skip creating any kind of tree, like c4 does. For what I’m doing, I would want a language powerful enough to create a tree :)
“lack of user-defined data structures” feels pretty far away from what I would find to be a satisfactory answer here, though honestly I bet you could probably implement structs with relatively little amounts of code with enough care. Famous last words perhaps (and of course when you introduce user-defined data structures suddenly you start wanting everything, and now you’re writing Agda)
I actually don’t know which Scheme-in-Scheme is canonical
http://www.paulgraham.com/rootsoflisp.html (you should click at “Complete Article (Postscript)”). This article is excellent. Yes, it is not about typed languages, but certainly worth reading. Yes, this is subset of CommonLisp, not of Scheme (i. e. it is subset-of-cl-in-subset-of-cl). But the code can be easily modified to became scheme-in-scheme. I have such scheme-in-scheme interpreter, I can send you one.
Also, that Lisp can be easily modified to become language where normal values (i. e. symbols and lists) are syntactically disjoint from functions. In such language one will be unable to pass functions as arguments to other functions and assign them to variables. So, the language will become “worse”, because now functions are not first-class. But now you have two different types: normal values and functions! So, this will be possibly the simplest statically typed language implemented in itself! With two types. And yes, you can easily add some exceptions, which will be thrown in case of syntax/type error.
Of course, this is not the only way. In any case, “start from above mentioned cl-in-cl and proceed from here” is good strategy. Say, you can start there and add two types: lists and numbers.
Yeah that’s kind of what I was thinking, and I thought someone would have a huge program that’s a clever church encoding of a PARSER and type checker for such a thing.
I’m defining the self-hosted language as a real program you can run from the command line, with parsing and printing.
So after querying Reddit and lobste.rs, it appears that this may not have been done before, which is surprising
Idris 2, maybe?
Idris is pretty cool btw
I enjoy how vastly different are the proposed answers here to the title question — a version of C that’s even more minimal and low-level, to the point of being esoteric, and an cutting-edge dependently typed functional language that provides the only(?) practical implementation of Quantitative Type Theory.
One thing I can say as a (very incomplete yet) student of Idris is that even if you have zero intentions of actually using it, it will make you a better programmer in almost any language you’d work in. For example, I just learned of a concept called “total” (as in, “total function”) which is when a function can accept all its designated possible inputs; Idris checks for that upfront and lets you know when something is “not total”, and on the same day I learned that concept I actually had to fix a bug in a function I wrote which had to do with… not handling all possible inputs. >..<
Anyway, Idris 2 IS bootstrapped (implemented in itself), it can output a C version which can then be compiled down to a statically linked executable, and the code you end up with is apparently fairly fast, all things considered, though I have a bunch more testing/playing with it to do.
It’s pretty well-known by Haskell types as an “un-fucked Haskell” (literally what one person I know said about it), I found it while bemoaning the fact that Elixir doesn’t let you build a single binary out of it.
Here’s the source code for Unix, second edition, from 1972: https://www.tuhs.org/cgi-bin/utree.pl?file=V2
The C compiler source code is linked from the page above, as http://www.tuhs.org/Archive/Applications/Early_C_Compilers/last1120c.tar.gz
I count 2329 lines of C code in the C compiler (leaving out
cvopt, a separate program I can’t identify).There are also 2477 lines of assembly. This appears to be the runtime for compiled programs, not the compiler itself.
ncc.cis the front end program (cc), the command that you run from the shell. It invokes the first pass (/usr/lib/c0), the second pass (/usr/lib/c1) and finally the assembler (/bin/as) to generate machine code.c0reads a C source file and outputs a mix of assembly language and some kind of IR representation for expressions.c1reads the output ofc0and outputs pure assembly language.This is the very first released C compiler written in C itself, for a PDP-11. You won’t be able to compile this with a modern C compiler. Based on the file nc0/c00.c, the reserved words are: int, char, float, double, auto, extern, static, goto, return, if, while, else, switch, case, break, continue, do, default. No
structyet. No header files or preprocessor yet.I posted this before reading @andyc’s post on C4, which is an even shorter self-hosted C compiler designed along similar lines, although with far fewer language features. Only 528 lines of code. (Andy claims that C4 is “dynamically typed”, which isn’t true. It has
int,charand pointer types: multiple static types. Type errors are reported at compile time, not at run time. There is no dynamic typing facility by which you can store a value of any type in a variable, and query its type at runtime.)What type errors were you able to get from it? I tried the following:
And you can freely assign i to c, i to p, etc. You can also compare them for equality freely.
There are no function parameters in C4, and thus there is no type checking of function parameters :)
It seems very much like old school B or BCPL – a structured assembler.
To me dynamic vs. static typing is a spectrum, with ANSI C being dynamically typed in many areas (implicit decay of array to pointer, implicit conversion of pointer to bool, of double to int – or maybe that’s a C++ overload rule, either way I just hit it). Most of this was an explicit design by Thompson and Ritchie – they had more of the dynamic typing philosophy.
But overall it still emits many type errors, so ANSI C is statically typed.
On the other hand, I was not able to get ANY type error out of C4. It has to emit at least one type error to be called statically typed :)
Some compile time type errors that I can see are
In B or BCPL, there’s only one type, the machine word. By contrast, C4 is what C was originally like. There is more than one type.
intandcharare distinct types that occupy different amounts of memory. And unlike B/BCPL, you need to keep track of pointer types so that you know the size and type of the object you are pointing to. Soint*,int**,char*,char**are all different types. The compiler generates different code for dereferencing anint*vs achar*(LIvsLC). The type discipline in early C was less strict than today: the compiler performed more automatic conversions between types than what we now expect. But it’s still static typing, not dynamically typed, and not typeless like B or BCPL.Ah OK, it has a single, global variable
tythat is mutated in the expr() function, while it parses.There is no symbol table – no mapping of variables to types, which is why many type errors go undetected.
But there are indeed some static type checks.
The global variable – an ingenious way to implement (some) type checking in a language without structs or arrays!
OK maybe C4 is indeed the smallest typed language implemented in itself :)
The symbol table is a global variable called
sym. It is used for both global and local variables. When a function definition is entered, its local variables are added to the symbol table. On exiting a function definition, the local variable definitions are removed.Each entry in the symbol table has 8 fields, defined in an enum at top of file. The
Typefield represents the static type of a variable.Ah OK, thanks for the correction. Apparently I’m not very good at reading C4’s code!
I did try it on a bunch of snippets though.
So it’s interesting that it does have compound data structures in the form of malloc()ing an array, and then pointers and a[i] give enough access to that structure. Which I suppose is the cleverness of the original C.
I don’t see any usage of stack or global arrays, but it does have dynamically allocated arrays.
Thanks for checking. Definitely a candidate, although it seems like we should be able to do better after 51 years!
Thompson C in its early days was very untyped, closer to BCPL, but I think it still counts as typed because it had structs.
I think this is exactly what you need: https://crypto.stanford.edu/~blynn/compiler/ . The compiler seems to be written in itself. And it is able to produce webasm binaries!
Trying to think of what simplifying assumptions you could make to “Design golf” a language that is both super simple and still usable enough to write a compiler for, without just saying everything is of type “object”.
I think if you entirely get rid of type inference and rely on the C preprocessor for language imports, you could get to something very small. Maybe a miserable language to use but it would work I think!
Smalltalk was originally created as a bet that you could fully specify a useful language on a single page of paper (I think US letter sized). It cheated slightly by punting a lot of things to the standard library (e.g. loops, conditional branches) but it provided everything that you needed to implement them. Its not statically typed though.
It’s also not that interesting to think about the compiler in isolation. Most languages can’t implement their own run-time system entirely. The C runtime includes some assembly, as does the C++ runtime. A Java VM requires either a load of C/C++ or extensions to Java like com.sun.Unsafe that let you bypass the safety guarantees of the language. For a high-level language, not being able to implement the language in itself is a feature: if I can’t do unsafe things, I can build systems that depend on the fact that no code that I pull in from third parties can either.
If you just think about compilers, then you’ll find that functional languages with pattern matching are really good for writing compilers (a compiler is a pure function that consumes some text, builds a tree, pattern matches over the tree, and produces some output). @ltratt has written a lot about the dangers of optimising a language for implementing the compiler for the language. I think these are less valid if you consider incremental recompilation, REPLs, and LSP tooling such as online refactoring to be part of the compiler, because then the compiler starts to look more like a representative subset of useful programs, but you don’t get these if you’re aiming for the smallest possible compiler.
One of the things that you can do to simplify bootstrapping for a static language is skip all of the type checking. If you have a bootstrap compiler that implements the language but generates weird output on ill-formed input, you can then implement the full compiler in the language. This is the opposite of the original question, which wanted a smaller implementation in itself, but it means that if the compiler compiles with itself then it will also compile with the bootstrap compiler for bring up on new platforms and you just have to worry about codegen in the bootstrap compiler.
I think C4 (discussed above) pretty much did it !
tythat is mutated in theexpr()function, which parses expressions, and does some very basic and incomplete type checksIt definitely has a non-zero amount of static typing. Although I wouldn’t argue if you said it wasn’t statically typed :)
So yeah probably the key concerns are how to represent the symbol table. That requires some power in the language.
Also if you want to skip creating any kind of tree, like c4 does. For what I’m doing, I would want a language powerful enough to create a tree :)
“lack of user-defined data structures” feels pretty far away from what I would find to be a satisfactory answer here, though honestly I bet you could probably implement structs with relatively little amounts of code with enough care. Famous last words perhaps (and of course when you introduce user-defined data structures suddenly you start wanting everything, and now you’re writing Agda)
Correction: as doug-moen pointed out, it does have arrays and it has a symbol table!
This code is very clever
http://www.paulgraham.com/rootsoflisp.html (you should click at “Complete Article (Postscript)”). This article is excellent. Yes, it is not about typed languages, but certainly worth reading. Yes, this is subset of CommonLisp, not of Scheme (i. e. it is subset-of-cl-in-subset-of-cl). But the code can be easily modified to became scheme-in-scheme. I have such scheme-in-scheme interpreter, I can send you one.
Also, that Lisp can be easily modified to become language where normal values (i. e. symbols and lists) are syntactically disjoint from functions. In such language one will be unable to pass functions as arguments to other functions and assign them to variables. So, the language will become “worse”, because now functions are not first-class. But now you have two different types: normal values and functions! So, this will be possibly the simplest statically typed language implemented in itself! With two types. And yes, you can easily add some exceptions, which will be thrown in case of syntax/type error.
Of course, this is not the only way. In any case, “start from above mentioned cl-in-cl and proceed from here” is good strategy. Say, you can start there and add two types: lists and numbers.
Yeah that’s kind of what I was thinking, and I thought someone would have a huge program that’s a clever church encoding of a PARSER and type checker for such a thing.
I’m defining the self-hosted language as a real program you can run from the command line, with parsing and printing.
So after querying Reddit and lobste.rs, it appears that this may not have been done before, which is surprising