I think you need to consider the Turing Tarpit. It’s true you can write anything in C in theory. However, programming languages are for humans, and you need to consider the human factor.
There are types of “simple” programs (e.g. the ones we use scripting languages for) that don’t need performance or low-level features. For these cases C is an overkill, and therefore needlessly hard or tedious. The problem isn’t that C can’t do it, but that the skill and effort required needlessly increases development time and cost. C can do everything that Excel can, but when you have an Excel-sized problem, it’s cheaper to solve it in Excel.
And on the other end of the spectrum there are some complex problems for which C is not helping enough. You can do multi-threading in C, but doing so requires a very high level of diligence or limiting yourself to only simplest coarse-grained constructs. Same for security — in theory there isn’t anything in C that makes it insecure if you write it properly, but somehow we struggle to find programmers who never write bugs.
I think size and complexity of programs changes what language is needed. If you can hold the whole program in your head, then you can use a small/simple language. OTOH if the program is so large it’s maintained by a whole team of programmers, over many years, with people coming and going, then you start benefiting from higher-level abstractions and the language knowing the program better than you do. At such scale C is doable, but far from convenient.
I think you need to consider the Turing Tarpit. It’s true you can write anything in C in theory. However, programming languages are for humans, and you need to consider the human factor.
It is worth remembering that there is an infinitely large set of programs but a very small set of useful programs. A programming language is a form of ad-hoc compression algorithm, it is intended to sort the set of useful programs to have shorter encodings than the undesirable ones. Programs with certain categories of error or security vulnerability should be harder or impossible to express. C does very poorly when considered in this way.
The abstract view of programming languages as ad-hoc compression algorithms does not motivate any particular criteria for which categories of error should be made difficult to express, or how difficult. So there is a lot missing to get to your conclusion.
This compression perspective doesn’t explain everything in language design, but I think it’s roughly correct. BTW, it is the basis of Kolmogorov complexity of programs.
We don’t compress code in a literal byte-saving way (apart from code golfing), but we do like to reduce repetitiveness of code (the DRY principle), eliminate “boilerplate”, and replace detailed low-level constructs with terser high-level ones. All of these happen to reduce the number of symbols/constructs used to build the program.
When we discover “good” programming patterns, we invent shorter syntax for them and make languages based on that. When you look from assembly perspective, even basic constructs like structs and functions with a stack are syntax sugar to codify common patterns and eliminate mistakes. When structs of function pointers became popular, we’ve slapped a syntax on it and called it OOP. Some languages decided that callbacks are good for network code, but there are too many ways to mess up the callbacks, so there’s async/await. Even type systems are an example of the principle that good code is shorter: casts require extra code.
In case of Rust making safe syntax terse and unsafe syntax verbose is an explicit guiding principle. That’s why single-char . operator performs automagic dereference on smart pointers, but the same on “raw” pointers is an eyesore: unsafe { (*x).foo() }.
All of these happen to reduce the number of symbols/constructs used to build the program.
I feel like this would deserve more importance tha it gets. All these discussions quickly turn into bikeshedding around type safety, type anotations, casts, type systems, etc.
Yet a terse simple syntax that minimizes the amount of code that needs to be written to perform a task, is often ignored.
The popularity of python, shellscripts or even JavaScript is much due to their ergonomics. Getting the usual tasks done requires writting little code and therefore little effort. None of these languages has strong enforced type checking on compile time. I personally would like some level of type safety in these languages, but that is still secondary. And also, type declarations have to be written sometimes that slows one down.
The most popular language on the planet by a huge margin, with over a billion users, is the Excel formula language, for exactly the reasons you describe. Users can start with some data and write a program to do something useful with it in a few seconds.
All true. But the abstract view doesn’t justify any particular notion of “safe” or “unsafe,” or any criteria for when a language is doing a good job. C requires extra syntax to use a value as a different type without warnings: (unsigned int) x;
I think it does: we’re interested in writing programs that don’t crash, so we optimize languages for not crashing, instead of giving equal importance to crashing and not crashing.
Since we’re unable to express this so directly, the actual syntax is a proxy of that via safe/not-sure-if-safe. For example, an incorrectly placed free can crash the program. So we make syntax for a correctly placed free shorter (e.g. zero in case of GC/ARC/RAII) and then from the pigeonhole principle the syntax for an incorrectly placed free is either impossible or has to be longer.
The C cast is an example where language makes the safe syntax terser than the unsafe syntax. If the types are compatible (which C thinks is good and safe) you just use x, and when the types are not compatible (which may be unsafe), you have to use (type)x. C isn’t very good at it (some bad implicit conversions are allowed), but clearly it’s trying.
The abstract view is that a programming language should let you write useful programs more easily/concisely than useless ones. That much is uncontroversial.
From that premise I don’t think you can say that avoiding crashing is the most important criterion. What if “useful” means a program runs as fast as possible using a minimum of resources? Then it’s impossible to express certain desirable programs in languages with automatic memory management, which could be expressed in C.
Crashing is never a goal or a desirable property (apart from xkcd 1172), therefore languages never optimize for it.
Programs that are too slow are not useful, so optimization for useful (not too slow) programs may have to make compromises if safer constructs are impossible/not invented yet.
Yeah yeah, mention Rust. Rust is too complicated to implement by one person.
I’m not sure that’s a practical metric by which to judge a tool. The C compilers that provide a practical foundation for modern software development were not implemented by one person either.
In general Turing completeness is necessary but not sufficient: it’s just one facet of what makes a language practically useful. There are many other properties that end up resulting in costs someone has to pay to use a language; e.g., is it memory safe, or will engineers and users alike be on the hook for an unknown number of egregious memory safety bugs?
It’s 100K lines of code, and majority of it was developed over a 2-3 year period (with ongoing development to catch up with evolution of Rust). The number of commits and lines of code happens to be close to TCC:
It does take a couple of shortcuts: it’s a Rust-to-C compiler (no machine code generation) and it doesn’t perform borrow checking (the Rust language is carefully designed to make it optional. Lifetimes are purely a compile-time lint, and don’t affect generated code or its behavior).
I think overall in terms of implementation difficulty Rust is somewhere between C and C++. Parsing of Rust is much simpler than C++, and Rust has fewer crufty language features than C++ (there’s one way to initialize a variable), but some features are big-ish (borrow checker, type inference).
How hard it is to implement mainly depends on how good quality of implementation you want to have. For example, LLVM is 85× larger than mrustc and tcc, with over 130× more commits. It’s a 20-year collaborative effort, likely not possible to do by a single person. The main rustc project is also maximalist like that, because it isn’t merely an effort to get Rust working, but to make it fast, efficient, user-friendly, well-documented, reliable, portable, etc., so much much more work went into it beyond just the language implementation.
I cannot speak for mrustc, but 100k loc for tcc is bullshit. Just counting sources and headers in the top level, I get 55k loc (the remainder is taken up by tests and win32 headers). Close to 40k is taken up by target-specific code. The core compiler is about 10k loc.
openhub stats I’ve quoted are for the whole repo, and I see 57K .c and 38K .h in there. This includes tests, so it’s indeed more than just the compiler.
If I run a word count on everything in the ‘src’ directory of mrustc, I get about 130k loc. I therefore conclude that mrustc’s rust compiler is approximately 10x larger and more complex than tcc’s c compiler. Recall that tcc also includes assemblers and linkers, and supports many targets.
I feel like this is a fairly disingenuous and dismissive argument - your original post stated that “Rust is too complicated to implement by one person.” The comment you were responding to was making the point that not only is there an implementation of Rust by primarily one person, but a single-contributor C implementation is a comparable size and would theoretically take a similar amount of effort to implement. People here aren’t trying say it’s not a lot of effort, but that it does exist and you may be trivializing the amount of effort needed for a C implementation.
Sorry, I didn’t mean to dismiss anything! Isn’t the statement still true if it’s been mentioned they still got help?… Regardless the general sentiment is right. I should have said instead that it’s not reasonable!
I may very well be trivializing the effort for a C implementation. In my mind C’s type system, lack of borrow checker, and other features make its implementation maybe a magnitude easier. I could be completely wrong though and please elaborate if that’s the case!
A non-optimizing C89 or C90 compiler is relatively simple to implement, with only minor inconveniences from the messy preprocessor, bitfields, parsing ambiguities of dangling else and typedef (did you know it can be scoped and nested and this affects syntax around it!?). The aren’t any things that are hard per-se, mostly just tedious and laborious, because there’s a lot of small quirks underneath the surface (e.g. arrays don’t always decay to pointers, sizeof evaluates things differently, there are rules around “sequence points”).
There are corners of C that most users don’t use, but compiler in theory needs to support, e.g. case doesn’t have to be at the top level of switch, but can be nested inside other arbitrary code. C can generate “irreducible” control flow, which is hard to reason about and hard to optimize. In fact, a lot of optimization is pretty hard due to aliasing, broken const, and the labyrinth of what is and isn’t UB described in the spec.
There are corners of C that most users don’t use, but compiler in theory needs to support, e.g. case doesn’t have to be at the top level of switch, but can be nested inside other arbitrary code
It’s worth noting that, since you said ‘non-optimising’ these things are generally very easy in a non-optimising compiler. You can compile C more or less one statement at a time, including case statements, as long as you are able to insert labels after you insert a jump to them (which you can with most assembly languages). Similarly, sequence points matter only if you’re doing more than just evaluating expressions as you parse them.
The original C compiler ran on a computer that didn’t have enough memory for a full parsed AST and so the language had to support incremental code generation from a single-pass compiler.
LLVM was originally just Chris Latner. I think the question isn’t “Can one person build it?” It’s “Can one person build it to the point where it has enough value for other people to work on it too?”
I actually looked at Wikipedia first before my comment, but that made it seems like it was Latner’s project under Adve’s mentorship. I’ll take your word for it that it was a group effort from the start.
This was my first thought as well. There are a lot of very useful things that are too complicated to be implemented by one person - the current state of Linux probably falls into that category, and I know that at least I wouldn’t want to go back to even a version from 5 years ago, much less back to a version that could have been implemented by a single person.
Ha, I agree with that, was mostly just highlighting that I don’t feel like “too complicated to implement by one person” is a good reason to dismiss Rust’s potential usefulness.
For myself, I originally got frustrated with Rust not allowing me to do things; eventually, I realized that it was statically removing bad habits that I’d built in the past. Now I love when it yells at me :)
[Tool] is too complicated to implement by one person.
I’m not sure that’s a practical metric by which to judge a tool
I am. Short term, that means the tool will cost much less: less time to make, fewer bugs, more opportunities for improvement. Long term it means other people will be able to rebuild it from scratch if they need to. At a lower cost.
The flip side of this is that the tool will do much less. A wooden hammer is a tool that a single person can make. A hammer with a steel head that can drive in nails requires a lot more infrastructure (smelting the ore and casting the head are probably large enough tasks that you’ll need multiple people before you even get to adding a wooden handle). An electric screwdriver requires many different parts made in different factories. If I want to fix two pieces of wood together than a screw driven by an electric screwdriver is both easier to use and produces a much better result than a nail driven by a wooden hammer.
Obviously I was limiting my analysis to software tools, where the ability of a single person to make it is directly tied to its complexity.
One fair point you do have is how much infrastructure the tool sits upon. Something written in Forth needs almost nothing besides the hardware itself. Something written in Haskell is a very different story. Then you need to chose what pieces of infrastructure you want to depend on. For instance, when I wrote my crypto library I chose C because of it’s ubiquity. It’s also a guarantee of fairly extreme stability. There’s a good chance that the code I write now will still work several decades from now. If I wanted to maximise safety instead, I would probably have picked Rust.
Obviously I was limiting my analysis to software tools, where the ability of a single person to make it is directly tied to its complexity.
My point still applies. A complex software tool allows me to do more. In the case of a programming language, a more complex compiler allows me to write fewer bugs or more features. The number of bugs in the compiler may be lower for a compiler written by a single person but I would be willing to bet that the number of bugs in the ecosystem is significantly higher.
The compiler and standard library are among the best places for complexity in an ecosystem because the cost is amortised across a great many users and the benefits are shared similarly. If physical tools were, like software, zero marginal cost goods, then nail guns, pillar drills, band saws, and so on would all be ubiquitous. If you tried to make the argument that you prefer a manual screwdriver to an electric one because you could build one yourself if you needed then you’d be laughed at.
For instance, when I wrote my crypto library I chose C because of it’s ubiquity. It’s also a guarantee of fairly extreme stability
It also gives you absolutely no help in writing constant-time code, whereas a language such as Low* allows you to prove constant-time properties at the source level. The low* compiler probably depends on at least a hundred person-years of engineering but I’d consider it very likely that the EverCrypt implementations of the same algorithms would be safer to use than your C versions.
I reckon amortized cost is a strong argument. In a world where something is build once and used a gazillion times the cost analysis is very different from something that only has a couple users. Which is why by the way I have a very different outlook for Oberon and Go: the former were used in a single system, and the cost of a more powerful compiler could easily outweigh the benefits across the rest of the system; while Go set out to be used by a gazillion semi-competent programmers, and the benefit of some conspicuously absent features would be multiplied accordingly.
Honestly, I’m not sure where I stand. For the things I make, I like to keep it very very simple. On the other hand, If I’m being honest with myself I have little qualms sitting on a mountain of complexity, provided such foundation is solid enough.
Do you have a link to Low*? My search engine is failing me.
It’s important because fast forward 300 years and no one uses your language anymore. It must be reasonable the future humans can write a compiler on their own if they want to run your program.
I’m really trying to encourage people thinking beyond their lives in the software realm lately, just as we need to do the same for the ecosystem.
trying to build software to last 300 years seems like it would limit hardware development
and everyone implements C compatibility in their new hardware so that people will use it
if people can figure out quantum computers and computers not based on binary, they’ll probably need to figure out what the next C will be for that new architecture
if you want your software to last 300 years, write it in the most readable and easy-to-understand manner, and preserve it’s source so people can port it in the future
And this is why C is not good for longevity, but languages which are more abstracted. Thank you for that! Completely agree with what you’re thinking here.
i don’t think the biggest blockers to software longevity is language choices or even hardware, it’s the economy/politics of it… long lasting anything doesn’t fit in well with our throw-away society, and since it can’t be monetized, the capitalist society snubs it’s nose at it
Today’s humans were able to create Rust, so I don’t see why future humans wouldn’t. Future humans will probably just ask GPT-3000 to generate the compiler for them.
If you’re thinking about some post-apocalyptic scenario with a lone survivor rebuilding the civilisation, then our computing is already way beyond that. In the 1960’s you were able to hand-stitch RAM, but to even hold source code of modern software, let alone compile and run it, you need more technology than a single person can figure out.
C may be your point of reference, because it’s simple by contemporary standards, but it wasn’t a simple language back when the hardware was possible to comprehend by a single person. K&R C and single-pass C compilers for PDP-11 are unusable for any contemporary C programs, and C is too complex and bloated for 8-bit era computers.
If GPT can do that for us then hey, I will gladly gladly welcome it. I’m not thinking about a post-apocalyptic scenario but I can see the relationship to it.
It’s important because fast forward 300 years and no one uses your language anymore. It must be reasonable the future humans can write a compiler on their own if they want to run your program.
If you’re considering a person 300 years in the future then you should also consider that they will have tools 300 years more advanced than ours. 30 years ago, writing a simple game like space invaders was weeks worth of programming, now it’s something that you can do in an afternoon, with significantly better graphics. In the same time, parser generators have improved hugely, reusable back ends are common, and so on. In 300 years, it seems entirely feasible that you’d be able to generate a formal specification for a language from a spoken description and synthesise an optimising compiler directly from the operational semantics.
You’re right, I haven’t considered this! I don’t know what to say immediately other than I think this is very important to think about. I’d like to see what others have to comment on this aspect too…!
In which case, implementing the compiler is one of the easiest parts of the problem. I could build a simple mechanical computer that could execute one instruction every few seconds out of the kind of materials that a society with a Victorian level of technology could produce, but that society existed only because coal was readily accessible. I’ve seen one assessment that said that if the Victorians had needed to use wood instead of coal to power their technology they’d have completely deforested Britain in a year. You can smelt metals with charcoal, but the total cost is significantly higher than with coal (ignoring all of the climate-related externalities).
Going from there to a transistor is pretty hard. A thermionic valve is easier, but it requires a lot of glass blowing (which, in turn, requires an energy-dense fuel source such as coal to reach the right temperatures) and the rest of a ‘50s-era computer required fairly pure copper, which has similar requirements. Maybe a post-collapse civilisation would be lucky here because there’s likely to be fairly pure copper lying around in various places.
Doping silicon to produce integrated circuits requires a lot of chemical infrastructure. Once you can do that, the step up to something on the complexity of a 4004 is pretty easy but getting lithography to the point where you can produce an IC powerful enough to run even a fairly simple C program is nontrivial. Remember that C has a separate preprocessor, compiler (which traditionally had a separate assembler), and linker because it was designed for computers that couldn’t fit more than one of those in RAM at a time. Even those computers were the result of many billions of dollars of investment from a society that already had mass production, mass mining, and large-scale chemistry infrastructure.
C code today tends to assume megabytes of RAM, at a minimum. Magnetic core storage could do something like 1 KiB in something the size of a wardrobe. Scaling up production to the point where 1 MiB is readily available requires ICs, so any non-trivial C program is going to have a dependency on at least ’80s-era computing hardware.
TL;DR: If a society has collapsed and recovered to the point where it’s rediscovering computers, writing a compiler for a fairly complex language is going to be very low cost in comparison to building the hardware that the compiler can target.
Well, I wasn’t anticipating such a hard collapse. I was imagining a situation where salvage is still a thing, or where technology doesn’t regress that far. Still, you’re making a good point.
That’s an interesting middle ground. It’s hard for me to imagine a scenario in which computers are salvageable but storage is all lost to the point where a working compiler is impossible to find. At the moment, flash loses its ability to hold charge if not powered for a few years but spinning rust is still fine, as is magnetic tape, for a much longer period, so you’d need something else to be responsible for destroying them. Cheap optical storage degrades quite quickly but there are archive-quality disks that are rated for decades. If anything, processors and storage are more fragile.
In the event of a collapse of society, I think it’s a lot more likely that copies of V8 would survive longer than any computer capable of running them. The implicit assumption in the idea that the compiler would be a bottleneck recovering from a collapse of society is that information is more easily destroyed than physical artefacts. This ignore the fact that information is infinitely copyable, whereas the physical artefacts in question are incredibly complex and have very tight manufacturing tolerances.
Of course, this is assuming known threats. It’s possible that someone might create a worm that understands a sufficiently broad range of vulnerabilities that it propagates into all computers and erases all online data. If it also propagates into the control systems for data warehouses then it may successfully destroy a large proportion of backups. Maybe this could be combined with a mutated bacterium that ate something in optical disks and prevented recovering from backup DVDs or whatever. Possibly offline storage will completely go out of fashion and we’ll end up with all storage being some form of RAM that is susceptible to EMP and have all data erased by a solar flare.
It really depends on what we can salvage, and what chips can withstand salvage operations. In a world where we stop manufacturing computers (or at least high-end chips), I’d expect chips to fail over the years, and the most complex ones will likely go first. And those that don’t will be harder to salvage for various reasons: how thin their connection pins are, ball arrays, multi-layer boards requirements, and the stupidly fast rise times that are sure to cause cross-talk and EMI problems with the hand made boards of a somewhat collapsed future.
In the end, many of us may be stuck with fairly low-end micro controllers and very limited static memory chips (forget about controlling DRAM, it’s impossible to do even now without a whole company behind you). In that environment, physical salvage is not that horrible, but we’d have lost enough computing power that we’ll need custom software for it. Systems that optimise for simplicity, like Oberon, might be much more survivable in this environment.
C code today tends to assume megabytes of RAM, at a minimum.
In this hypothetical future, that is relevant indeed. Also, I believe you. But then the first serious project I wrote in C, Monocypher, requires only a couple KB of stack memory (no heap allocation) for everything save password hashing. The compiled code itself fits requires less than 40KB of memory. Thing is, I optimised it for simplicity and speed, not for memory usage (well, I did curb memory use a bit when I’ve heard I had embedded users).
I suspect that when we optimise for simplicity, we also tend to use less resources as a side effect.
Now sure, those simple systems will take no time to rebuild from scratch… if we have the skills. In our world of bigger and faster computers with a tower of abstraction taller than the Everest, I feel most of us simply don’t have those skills.
Now sure, those simple systems will take no time to rebuild from scratch… if we have the skills. In our world of bigger and faster computers with a tower of abstraction taller than the Everest, I feel most of us simply don’t have those skills.
While it’s an interesting thought exercise, but I think this really is the key point. The effort in salvaging a working compiler to be able to run some tuned C code in a post-apocalyptic future may be significantly higher than just rewriting it in assembly for whatever system you were able to salvage (and, if you can’t salvage an assembler, you can even assemble it by hand after writing it out on some paper. Assuming cheap paper survives - it was very expensive until a couple of hundred years ago).
Most of us probably don’t have the skills to reproduce the massive towers of abstraction that we use today from scratch but my experience teaching children and young adults to program suggests that learning to write simple assembly routines is a skill that a large proportion of the population could pick up fairly easily if necessary. If anything, it’s easier to teach people to write assembly for microcontrollers than JavaScript for the web because they can easily build a mostly correct mental model of how everything works in the microcontroller.
Perhaps more importantly, it’s unlikely that any software that you write now will solve an actual need for a subsistence level post-apocalyptic community. They’re likely to want computers for automating things like irrigation systems or monitoring intrusion sensors. Monocypher is a crypto library that implements cryptosystems that assume an adversary who had access to thousands of dedicated ASICs trying to crack your communications. A realistic adversary in this scenario would struggle to crack a five-wheel Enigma code and that would be something that you could implement in assembly in a few hours and then send the resulting messages in Morse code with an AM radio.
Most of us probably don’t have the skills to reproduce the massive towers of abstraction that we use today from scratch but my experience teaching children and young adults to program suggests that learning to write simple assembly routines is a skill that a large proportion of the population could pick up fairly easily if necessary.
I feel a little silly for not having thought of that. Feels obvious in retrospect. If people who have never programmed can play Human Resource Machine, they can probably learn enough assembly to be useful.
Perhaps more importantly, it’s unlikely that any software that you write now will solve an actual need for a subsistence level post-apocalyptic community.
But why one person? I think we’ll still write software in teams in 2322, if we write software at all by that point instead of flying spaceships and/or farming turnips in radioactive wastelands. The software was written by teams today, and I think, if it needs to be rewritten, it will be rewritten by teams in the future.
I would also be careful about timespans here. computers haven’t been around for a century yet, so who knows what things will be like 100 years from now? I don’t even know if it’s possible to emulate an ENIAC and run old punch card code on modern hardware, that’s the sort of change we’ve seen in just 75y. maybe multicore x86 machines running windows/*nix/BSD will seem similarly arcane 300y from now.
Wouldn’t a published standard be more important to future programmers? Go might be a wonderful language, but is there a standards document I can read from which an implementation can be written from?
I think that if I were forced to work in C every day for every problem I need to solve I would be deeply unhappy and would definitely consider a new field of work :)
I know and like C, but my brain rebels, wriggling and squirming in discomfort at needing to work at such an incredibly low level of abstraction.
I’m glad C exists, and for solving certain classes of systems problems it very clearly has earned a place in the programming language pantheon, but I personally do not enjoy working in it and would much prefer working at a higher level of abstraction for the problems I need to solve.
Almost all programs written today are hardware-dependent. It’s a necessity to get good performance and to interact with the outside world. The rare exceptions are pure mathematics, e.g. proofs in Coq, Agda, Lean, Kind.
There are various VM’s, e.g. JVM, BEAM, etc. that can in theory be ported to any hardware, but in reality they and the programs written for them are designed for the dominant hardware of the day. One in-development exception is the HVM. Hardware optimized for the HVM does not currently exist, but it’s possible that it might one day. Nonetheless, the syntax and semantics are small enough to be easily ported to any hardware and expressive enough to run any algorithm (particularly when extended with appropriate IO).
One goal I have for Dawn is to have a high-level layer of the language that allows expression of purely mathematical logic (including business logic) without considering the underlying hardware, and lower-level layers that take into account more and more aspects of the underlying hardware. These lower-level layers would act as both computer and human readable and writable intermediate representations, with the goal of providing the best of both worlds: high level hardware-independent expression by default, and low level hardware-optimized control when needed. My hope is that, over time, compilers for this language will become able to perform the kinds of sophisticated transformations, e.g. data-oriented design to maximize cache locality, that are currently done almost exclusively by humans.
Almost all programs written today are hardware-dependent.
How is this true in a world where Python, JavaScript, and Java exist? I politely disagree.
It’s a necessity to get good performance
There are numerous examples of performant abstract code? It’s not a necessity.
but in reality they and the programs written for them are designed for the dominant hardware of the day
It’s not “in theory”, the JVM has been ported to tons of non-dominant hardware too.
I wish you luck on Dawn, but I think you’re fighting a really tough battle, one where hundreds of specialists have failed even! I’m beginning to come to terms with C being a necessary intermediate for all higher level abstractions. I tell myself: maybe if I call C something else… like “Intermediate Hardware Language”, would that make it all seem better? And it does! It does to me at least. “C as a IHL” is just a bootstrapping language for higher order code. In the future Rust will be as ubiquitous as C as an IHL.
So, the Python/JS/Java may not be written for specific CPUs, but their implementations are almost always are written for a set of CPUs, and what the runtimes are able to make fast influences how performance-reliant code gets written. And what the runtimes are able to make fast is based on the intersection of the language design (where Python’s design tends to be slow in flavor of making things flexible, but is able to drop hot loops into C, or to primitives in C), and the hardware. When people care about performance, they usually reach for faster code in their existing language before diving a layer down.
There’s nothing wrong with that, it’s an implementation optimization which is abstracted away. I don’t understand how that negatively affects software longevity since the language semantics itself don’t rely on the underlying hardware, unlike C!
I think trying to make a language do both high level and low level is essentially just smashing two DSLs together. Might as well do C and whatever else and FFI between them?
Can you explain further? I ask, because the C standard goes out of its way to avoid any particular hardware implementation (which is why it has a lot of undefined behavior).
Thanks for your reply! Please see my responses, below.
How is this true in a world where Python, JavaScript, and Java exist? I politely disagree.
Well, even ignoring that the by far dominant Python implementation is written in C, the vast majority of Python code is calling out to C code that is designed to run on the dominant hardware of the era. C is one of the most portable languages in existence, but that’s in part because no one comes out with hardware that can’t be targeted by C compilers. It would be economic suicide to do so.
Similarly, JavaScript is littered with hints of it’s dependence on current hardware. It’s primary numeric type is Float64, for example, because that is what was the most general numeric type supported by hardware at the time it was designed.
Similarly, Java is a monstrosity of complexity on top of C’s execution model, and again, C, despite being one of the most portable languages, was designed for the current paradigm of Von Neumann, binary, fixed word-size hardware.
There are numerous examples of performant abstract code? It’s not a necessity.
Could you share one or two examples that come to mind?
It’s not “in theory”, the JVM has been ported to tons of non-dominant hardware too.
Has it been ported to run directly on GPU’s, or to hardware designed to execute Forth, or to any hardware not designed as a target for C?
I wish you luck on Dawn, but I think you’re fighting a really tough battle, one where hundreds of specialists have failed even!
Thanks! Yeah, don’t hold your breath. This is definitely a many year project, perhaps even a many decade project, if I’m successful at all.
I’m beginning to come to terms with C being a necessary intermediate for all higher level abstractions.
It’s execution model is definitely a necessary intermediate for targeting the current paradigm of hardware. I’m currently working on a low-level, first-order variant of the multi-stack concatenative calculus that closely matches C’s execution model for precisely this reason.
However, C, the language is not a good intermediate representation. It’s abstract machine is fine, and compatibility with it’s ABI is a requirement in order to interface with currently dominant operating systems. But this entire industry is extremely young, and all of our tools have been cobbled together haphazardly. C won out through trial and error, but there’s no reason to believe it is the global optimum. I believe we can move past it, in time. We cannot abandon it by fiat, but we can design languages with less ad-hoc and hardware-specific semantics that still interface well with existing software and hardware, while opening the door to significant departures from the current paradigm.
C is pretty simple but it feels pretty “large” if I just want to make a quick tool to automate some tedious task. I rather use Python for that.
By large i mean setting up things and building. With python i can just write the script and run it, and test it faster. And i dont care if the quick tool i run every now and then takes 200ms more than C equivalent. What matters most is that i spent less time to solve the tedious problem, so i can move on to more interesting ones.
I could do everything with C, yes. But I prefer working faster, depending on what i am solving. All languages have their weights and we can choose a best tool for the job.
I love C but i dont think i could use it for everything without losing my mind because i would feel like im doing excessive work.
Have you looked into Knuth’s efforts? For TaoCP he defined his own assembly language for this exact reason. TeX was implemented as a literate program in Pascal, which is an easy to implement, very stable target (it was designed to be implemented in a compiler class, after all). He probably is the person whose software has the longest intentional life of anything running today.
FWIW I’ve been enjoying your exploration of ideas around software longevity and hope you keep exploring ideas. You’ve helped push me to take a look at SML and provided an extra dimension of consideration around some half-baked ideas I was chewing on.
Just as an aside, the next “step” for me is writing my first real 100 year program; I’m still unsure what it will be. Most likely it’ll be something which replaces what I use frequently, like IRC, email, RSS feed reader, or a basic HTTP client to view and interact with simple web pages like lobste.rs. Thanks to the “simple web” movement the latter is becoming more useful again! My day job though involves full featured web browsers so it won’t sever me completely unfortunately.
I’ve also had the idea of writing a pidgin-like clone for services I use, so I can get a seamless experience using IRC, Facebook Messenger, Discord, Matrix, etc. I think the best idea is to transform all those messages into ActivityPub messages and then consume them… So essentially a bunch of ActivityPub bridges.
A data point. One program I wrote back in 1991 I still use to this day. It’s in C, with no dependencies other than a standard C compiler (and the standard C library). It has been modified since then (in 1992, 2009, 2011, 2012, 2015, 2020 and 2021) so I’m unsure what that means in this context. Another program I wrote in 1999 that I still use today is in C (my blogging engine). And just for kicks, I just typed in a C program published in 1985 (but probably written earlier) and it worked as is (but with some warnings).
Now, am I saying that C should be a 100 year language? Well, it’s half way there. The only other contenders I see are Fortran, LISP, and Cobol (from an “it exists and is still used” perspective).
If Cobol counts, then so do Ada, Forth and Basic, all three still used in specific domains. If LISP counts, then so do SQL, APL and Prolog, whose original implementations aren’t used anymore but still live through other incarnations.
And I’m sure I’ve forgotten a few others. That makes C a little less impressive I think.
Did they produce device drivers? FoxNet was a network stack in SML and that too in userspace. Not exactly device drivers. With the exception of mlton, SML runtimes (and OCaml too) use an uniform value representation therefore use a few bits of the pointer for the type tag. GC also takes away a bit typically which means full width machine words needs to be boxed and that can be inefficient.
ATS, though, is the perfect combination of ML and C, barring the syntax. The metasepi project is using ATS for kernel components[1].
C is just simple enough. Python is simple enough, too.
I use the simplest parts of C++, but a language cannot expect to be adopted if it’s not simple enough to use. It’s the common denominator for all languages.
C++ has been widely adopted despite its complexity. Java is also complex (considering its ecosystem of frameworks, annotations, DI containers etc.) and it is widely used too. Rust is also comparably complex and is quite popular/hyped. …
C is very powerful and with C structures, functions and pointers you can do almost anything (I mean achieve a reasonable level of abstraction). But what is real pain in C is memory management (and error handling).
FWIW, I’d love to read a good tutorial/book on how to write a non-toy SML compiler, with an approachable explanation of the type inference algorithm(s). In either C or Pascal (ideally, targeting the same VM/IR as the Pascal, with some fuurther compiler of the IR to x86 machine code & Linux+Mac+Windows binary packages). Need not emit super optimized code, but something that would let me write real software in. I’d seriously consider this for a foundation to write all my personal software in then. I tried to understand H-M a few times, but still didn’t manage to find an explanation I’d grasp in the limited time and attention-span I have for it, nor further understand how to apply it to an actual compiler. Plus obviously all the other stuff needed, like a GC, some registry spilling algorithm, scope tracking, etc.
For a number of reasons: portability, ease of understanding for me (I’m not experienced esp. with functional languages), and importantly, not glossing over important features like GC (which in more advanced languages can often be just reused from their runtime).
As for the tutorial - thanks, I’ll definitely try to read it! Although it explicitly speaks about implementing a toy language, and not in C/Pascal but in some functional language Miranda that I didn’t even hear about before, so I’m not sure I’ll be able follow, plus I presume GC will be glossed over and strings or FFI won’t be supported. But will try based on your recommendation, H-M is bugging me so if that’s a good chance to try cracking it, try I will! :) Also it mentions some parallel VM interpreter, which could be a notable high point OTOH IIUC.
There is a lot of literature on GC-writing. Though I can see why it would be nice to have a fully integrated tutorial. I might look at ‘crafting interpreters’ for that; it walks through an implementation of a simple bytecode vm for a python-like language, including gc.
There was a good tutorial on HM type inference I saw at one point, but I cannot find it again. Much material on the topic is full of notation which may seem opaque. Best of luck.
not glossing over important features like GC (which in more advanced languages can often be just reused from their runtime)
C doesn’t really help you there. If you want to use the GC from a runtime written in another language then you need one of the following:
The GC doesn’t run while there are runtime functions on the stack. That’s fine if all of your runtime services are leaf functions.
The GC doesn’t relocate objects (or, at least, doesn’t relocate objects that have stack references to them)
The GC is able to update pointers on the stack.
You can do the last one of these with C (there’s a paper about it) but it’s quite cumbersome - you need to build a linked list of on-stack objects that contain all of your roots and ensure that you don’t keep any temporary pointers live across function calls.
If anything, the fact that C (practically, if not according to a strict reading of the standard) conflates pointers and integers makes this harder.
H-M is basically a handful of rules for syntax-driven specification of a first-order unification problem, and then solving that unification problem, so I would suggest learning / implementing first-order unification, first. Then try to understand the typing rules for H-M, which are what define the unification problem.
Wikipedia is sufficient for both, as long as you’re okay with learning the mathematical notation. I’ve implemented unification several times now, in multiple ways, as part of my work on Dawn. I’d be happy to answer any questions you might have. Feel free to DM me.
I think you need to consider the Turing Tarpit. It’s true you can write anything in C in theory. However, programming languages are for humans, and you need to consider the human factor.
There are types of “simple” programs (e.g. the ones we use scripting languages for) that don’t need performance or low-level features. For these cases C is an overkill, and therefore needlessly hard or tedious. The problem isn’t that C can’t do it, but that the skill and effort required needlessly increases development time and cost. C can do everything that Excel can, but when you have an Excel-sized problem, it’s cheaper to solve it in Excel.
And on the other end of the spectrum there are some complex problems for which C is not helping enough. You can do multi-threading in C, but doing so requires a very high level of diligence or limiting yourself to only simplest coarse-grained constructs. Same for security — in theory there isn’t anything in C that makes it insecure if you write it properly, but somehow we struggle to find programmers who never write bugs.
I think size and complexity of programs changes what language is needed. If you can hold the whole program in your head, then you can use a small/simple language. OTOH if the program is so large it’s maintained by a whole team of programmers, over many years, with people coming and going, then you start benefiting from higher-level abstractions and the language knowing the program better than you do. At such scale C is doable, but far from convenient.
It is worth remembering that there is an infinitely large set of programs but a very small set of useful programs. A programming language is a form of ad-hoc compression algorithm, it is intended to sort the set of useful programs to have shorter encodings than the undesirable ones. Programs with certain categories of error or security vulnerability should be harder or impossible to express. C does very poorly when considered in this way.
The abstract view of programming languages as ad-hoc compression algorithms does not motivate any particular criteria for which categories of error should be made difficult to express, or how difficult. So there is a lot missing to get to your conclusion.
This compression perspective doesn’t explain everything in language design, but I think it’s roughly correct. BTW, it is the basis of Kolmogorov complexity of programs.
We don’t compress code in a literal byte-saving way (apart from code golfing), but we do like to reduce repetitiveness of code (the DRY principle), eliminate “boilerplate”, and replace detailed low-level constructs with terser high-level ones. All of these happen to reduce the number of symbols/constructs used to build the program.
When we discover “good” programming patterns, we invent shorter syntax for them and make languages based on that. When you look from assembly perspective, even basic constructs like structs and functions with a stack are syntax sugar to codify common patterns and eliminate mistakes. When structs of function pointers became popular, we’ve slapped a syntax on it and called it OOP. Some languages decided that callbacks are good for network code, but there are too many ways to mess up the callbacks, so there’s
async
/await
. Even type systems are an example of the principle that good code is shorter: casts require extra code.In case of Rust making safe syntax terse and unsafe syntax verbose is an explicit guiding principle. That’s why single-char
.
operator performs automagic dereference on smart pointers, but the same on “raw” pointers is an eyesore:unsafe { (*x).foo() }
.I feel like this would deserve more importance tha it gets. All these discussions quickly turn into bikeshedding around type safety, type anotations, casts, type systems, etc.
Yet a terse simple syntax that minimizes the amount of code that needs to be written to perform a task, is often ignored.
The popularity of python, shellscripts or even JavaScript is much due to their ergonomics. Getting the usual tasks done requires writting little code and therefore little effort. None of these languages has strong enforced type checking on compile time. I personally would like some level of type safety in these languages, but that is still secondary. And also, type declarations have to be written sometimes that slows one down.
The most popular language on the planet by a huge margin, with over a billion users, is the Excel formula language, for exactly the reasons you describe. Users can start with some data and write a program to do something useful with it in a few seconds.
All true. But the abstract view doesn’t justify any particular notion of “safe” or “unsafe,” or any criteria for when a language is doing a good job. C requires extra syntax to use a value as a different type without warnings: (unsigned int) x;
I think it does: we’re interested in writing programs that don’t crash, so we optimize languages for not crashing, instead of giving equal importance to crashing and not crashing.
Since we’re unable to express this so directly, the actual syntax is a proxy of that via safe/not-sure-if-safe. For example, an incorrectly placed
free
can crash the program. So we make syntax for a correctly placedfree
shorter (e.g. zero in case of GC/ARC/RAII) and then from the pigeonhole principle the syntax for an incorrectly placedfree
is either impossible or has to be longer.The C cast is an example where language makes the safe syntax terser than the unsafe syntax. If the types are compatible (which C thinks is good and safe) you just use
x
, and when the types are not compatible (which may be unsafe), you have to use(type)x
. C isn’t very good at it (some bad implicit conversions are allowed), but clearly it’s trying.The abstract view is that a programming language should let you write useful programs more easily/concisely than useless ones. That much is uncontroversial.
From that premise I don’t think you can say that avoiding crashing is the most important criterion. What if “useful” means a program runs as fast as possible using a minimum of resources? Then it’s impossible to express certain desirable programs in languages with automatic memory management, which could be expressed in C.
Crashing is never a goal or a desirable property (apart from xkcd 1172), therefore languages never optimize for it.
Programs that are too slow are not useful, so optimization for useful (not too slow) programs may have to make compromises if safer constructs are impossible/not invented yet.
Yes, obviously.
Thank you for your thoughts - all good stuff!
I’m not sure that’s a practical metric by which to judge a tool. The C compilers that provide a practical foundation for modern software development were not implemented by one person either.
In general Turing completeness is necessary but not sufficient: it’s just one facet of what makes a language practically useful. There are many other properties that end up resulting in costs someone has to pay to use a language; e.g., is it memory safe, or will engineers and users alike be on the hook for an unknown number of egregious memory safety bugs?
Also mrustc has been implemented mostly by one person.
I knew this would be brought up; you know the effort they’ve had to do to achieve this? An incredible amount.
It’s 100K lines of code, and majority of it was developed over a 2-3 year period (with ongoing development to catch up with evolution of Rust). The number of commits and lines of code happens to be close to TCC:
It does take a couple of shortcuts: it’s a Rust-to-C compiler (no machine code generation) and it doesn’t perform borrow checking (the Rust language is carefully designed to make it optional. Lifetimes are purely a compile-time lint, and don’t affect generated code or its behavior).
I think overall in terms of implementation difficulty Rust is somewhere between C and C++. Parsing of Rust is much simpler than C++, and Rust has fewer crufty language features than C++ (there’s one way to initialize a variable), but some features are big-ish (borrow checker, type inference).
How hard it is to implement mainly depends on how good quality of implementation you want to have. For example, LLVM is 85× larger than mrustc and tcc, with over 130× more commits. It’s a 20-year collaborative effort, likely not possible to do by a single person. The main
rustc
project is also maximalist like that, because it isn’t merely an effort to get Rust working, but to make it fast, efficient, user-friendly, well-documented, reliable, portable, etc., so much much more work went into it beyond just the language implementation.I cannot speak for mrustc, but 100k loc for tcc is bullshit. Just counting sources and headers in the top level, I get 55k loc (the remainder is taken up by tests and win32 headers). Close to 40k is taken up by target-specific code. The core compiler is about 10k loc.
openhub stats I’ve quoted are for the whole repo, and I see 57K
.c
and 38K.h
in there. This includes tests, so it’s indeed more than just the compiler.If I run a word count on everything in the ‘src’ directory of mrustc, I get about 130k loc. I therefore conclude that mrustc’s rust compiler is approximately 10x larger and more complex than tcc’s c compiler. Recall that tcc also includes assemblers and linkers, and supports many targets.
I mean if 3 years is not a lot of effort then cheers to you! You must be an absolute coding beast.
I feel like this is a fairly disingenuous and dismissive argument - your original post stated that “Rust is too complicated to implement by one person.” The comment you were responding to was making the point that not only is there an implementation of Rust by primarily one person, but a single-contributor C implementation is a comparable size and would theoretically take a similar amount of effort to implement. People here aren’t trying say it’s not a lot of effort, but that it does exist and you may be trivializing the amount of effort needed for a C implementation.
Sorry, I didn’t mean to dismiss anything! Isn’t the statement still true if it’s been mentioned they still got help?… Regardless the general sentiment is right. I should have said instead that it’s not reasonable!
I may very well be trivializing the effort for a C implementation. In my mind C’s type system, lack of borrow checker, and other features make its implementation maybe a magnitude easier. I could be completely wrong though and please elaborate if that’s the case!
A non-optimizing C89 or C90 compiler is relatively simple to implement, with only minor inconveniences from the messy preprocessor, bitfields, parsing ambiguities of dangling else and
typedef
(did you know it can be scoped and nested and this affects syntax around it!?). The aren’t any things that are hard per-se, mostly just tedious and laborious, because there’s a lot of small quirks underneath the surface (e.g. arrays don’t always decay to pointers,sizeof
evaluates things differently, there are rules around “sequence points”).There are corners of C that most users don’t use, but compiler in theory needs to support, e.g.
case
doesn’t have to be at the top level ofswitch
, but can be nested inside other arbitrary code. C can generate “irreducible” control flow, which is hard to reason about and hard to optimize. In fact, a lot of optimization is pretty hard due to aliasing, brokenconst
, and the labyrinth of what is and isn’t UB described in the spec.It’s worth noting that, since you said ‘non-optimising’ these things are generally very easy in a non-optimising compiler. You can compile C more or less one statement at a time, including case statements, as long as you are able to insert labels after you insert a jump to them (which you can with most assembly languages). Similarly, sequence points matter only if you’re doing more than just evaluating expressions as you parse them.
The original C compiler ran on a computer that didn’t have enough memory for a full parsed AST and so the language had to support incremental code generation from a single-pass compiler.
LLVM was originally just Chris Latner. I think the question isn’t “Can one person build it?” It’s “Can one person build it to the point where it has enough value for other people to work on it too?”
Several of the folks in / formerly in Vikram Adve’s group at UIUC would be quite surprised to learn that.
I actually looked at Wikipedia first before my comment, but that made it seems like it was Latner’s project under Adve’s mentorship. I’ll take your word for it that it was a group effort from the start.
This was my first thought as well. There are a lot of very useful things that are too complicated to be implemented by one person - the current state of Linux probably falls into that category, and I know that at least I wouldn’t want to go back to even a version from 5 years ago, much less back to a version that could have been implemented by a single person.
…And there are a lot of useful things that are simple enough for one person to implement! :D
Ha, I agree with that, was mostly just highlighting that I don’t feel like “too complicated to implement by one person” is a good reason to dismiss Rust’s potential usefulness.
For myself, I originally got frustrated with Rust not allowing me to do things; eventually, I realized that it was statically removing bad habits that I’d built in the past. Now I love when it yells at me :)
I am. Short term, that means the tool will cost much less: less time to make, fewer bugs, more opportunities for improvement. Long term it means other people will be able to rebuild it from scratch if they need to. At a lower cost.
The flip side of this is that the tool will do much less. A wooden hammer is a tool that a single person can make. A hammer with a steel head that can drive in nails requires a lot more infrastructure (smelting the ore and casting the head are probably large enough tasks that you’ll need multiple people before you even get to adding a wooden handle). An electric screwdriver requires many different parts made in different factories. If I want to fix two pieces of wood together than a screw driven by an electric screwdriver is both easier to use and produces a much better result than a nail driven by a wooden hammer.
Obviously I was limiting my analysis to software tools, where the ability of a single person to make it is directly tied to its complexity.
One fair point you do have is how much infrastructure the tool sits upon. Something written in Forth needs almost nothing besides the hardware itself. Something written in Haskell is a very different story. Then you need to chose what pieces of infrastructure you want to depend on. For instance, when I wrote my crypto library I chose C because of it’s ubiquity. It’s also a guarantee of fairly extreme stability. There’s a good chance that the code I write now will still work several decades from now. If I wanted to maximise safety instead, I would probably have picked Rust.
My point still applies. A complex software tool allows me to do more. In the case of a programming language, a more complex compiler allows me to write fewer bugs or more features. The number of bugs in the compiler may be lower for a compiler written by a single person but I would be willing to bet that the number of bugs in the ecosystem is significantly higher.
The compiler and standard library are among the best places for complexity in an ecosystem because the cost is amortised across a great many users and the benefits are shared similarly. If physical tools were, like software, zero marginal cost goods, then nail guns, pillar drills, band saws, and so on would all be ubiquitous. If you tried to make the argument that you prefer a manual screwdriver to an electric one because you could build one yourself if you needed then you’d be laughed at.
It also gives you absolutely no help in writing constant-time code, whereas a language such as Low* allows you to prove constant-time properties at the source level. The low* compiler probably depends on at least a hundred person-years of engineering but I’d consider it very likely that the EverCrypt implementations of the same algorithms would be safer to use than your C versions.
I reckon amortized cost is a strong argument. In a world where something is build once and used a gazillion times the cost analysis is very different from something that only has a couple users. Which is why by the way I have a very different outlook for Oberon and Go: the former were used in a single system, and the cost of a more powerful compiler could easily outweigh the benefits across the rest of the system; while Go set out to be used by a gazillion semi-competent programmers, and the benefit of some conspicuously absent features would be multiplied accordingly.
Honestly, I’m not sure where I stand. For the things I make, I like to keep it very very simple. On the other hand, If I’m being honest with myself I have little qualms sitting on a mountain of complexity, provided such foundation is solid enough.
Do you have a link to Low*? My search engine is failing me.
This paper is probably the best place to start
Right but there are many C compilers which were written by one person and still work. To me, that’s the important part. Thank you for your thoughts!
Why is that important?
It’s important because fast forward 300 years and no one uses your language anymore. It must be reasonable the future humans can write a compiler on their own if they want to run your program.
I’m really trying to encourage people thinking beyond their lives in the software realm lately, just as we need to do the same for the ecosystem.
trying to build software to last 300 years seems like it would limit hardware development
and everyone implements C compatibility in their new hardware so that people will use it
if people can figure out quantum computers and computers not based on binary, they’ll probably need to figure out what the next C will be for that new architecture
if you want your software to last 300 years, write it in the most readable and easy-to-understand manner, and preserve it’s source so people can port it in the future
And this is why C is not good for longevity, but languages which are more abstracted. Thank you for that! Completely agree with what you’re thinking here.
i don’t think the biggest blockers to software longevity is language choices or even hardware, it’s the economy/politics of it… long lasting anything doesn’t fit in well with our throw-away society, and since it can’t be monetized, the capitalist society snubs it’s nose at it
Hehe, an interesting thread of thought we could travel down here. I’ll just say I agree to a degree.
Today’s humans were able to create Rust, so I don’t see why future humans wouldn’t. Future humans will probably just ask GPT-3000 to generate the compiler for them.
If you’re thinking about some post-apocalyptic scenario with a lone survivor rebuilding the civilisation, then our computing is already way beyond that. In the 1960’s you were able to hand-stitch RAM, but to even hold source code of modern software, let alone compile and run it, you need more technology than a single person can figure out.
C may be your point of reference, because it’s simple by contemporary standards, but it wasn’t a simple language back when the hardware was possible to comprehend by a single person. K&R C and single-pass C compilers for PDP-11 are unusable for any contemporary C programs, and C is too complex and bloated for 8-bit era computers.
If GPT can do that for us then hey, I will gladly gladly welcome it. I’m not thinking about a post-apocalyptic scenario but I can see the relationship to it.
If you’re considering a person 300 years in the future then you should also consider that they will have tools 300 years more advanced than ours. 30 years ago, writing a simple game like space invaders was weeks worth of programming, now it’s something that you can do in an afternoon, with significantly better graphics. In the same time, parser generators have improved hugely, reusable back ends are common, and so on. In 300 years, it seems entirely feasible that you’d be able to generate a formal specification for a language from a spoken description and synthesise an optimising compiler directly from the operational semantics.
You’re right, I haven’t considered this! I don’t know what to say immediately other than I think this is very important to think about. I’d like to see what others have to comment on this aspect too…!
Unless there has been a collapse in between. With climate change and peak oil, we have some serious trouble ahead of us.
In which case, implementing the compiler is one of the easiest parts of the problem. I could build a simple mechanical computer that could execute one instruction every few seconds out of the kind of materials that a society with a Victorian level of technology could produce, but that society existed only because coal was readily accessible. I’ve seen one assessment that said that if the Victorians had needed to use wood instead of coal to power their technology they’d have completely deforested Britain in a year. You can smelt metals with charcoal, but the total cost is significantly higher than with coal (ignoring all of the climate-related externalities).
Going from there to a transistor is pretty hard. A thermionic valve is easier, but it requires a lot of glass blowing (which, in turn, requires an energy-dense fuel source such as coal to reach the right temperatures) and the rest of a ‘50s-era computer required fairly pure copper, which has similar requirements. Maybe a post-collapse civilisation would be lucky here because there’s likely to be fairly pure copper lying around in various places.
Doping silicon to produce integrated circuits requires a lot of chemical infrastructure. Once you can do that, the step up to something on the complexity of a 4004 is pretty easy but getting lithography to the point where you can produce an IC powerful enough to run even a fairly simple C program is nontrivial. Remember that C has a separate preprocessor, compiler (which traditionally had a separate assembler), and linker because it was designed for computers that couldn’t fit more than one of those in RAM at a time. Even those computers were the result of many billions of dollars of investment from a society that already had mass production, mass mining, and large-scale chemistry infrastructure.
C code today tends to assume megabytes of RAM, at a minimum. Magnetic core storage could do something like 1 KiB in something the size of a wardrobe. Scaling up production to the point where 1 MiB is readily available requires ICs, so any non-trivial C program is going to have a dependency on at least ’80s-era computing hardware.
TL;DR: If a society has collapsed and recovered to the point where it’s rediscovering computers, writing a compiler for a fairly complex language is going to be very low cost in comparison to building the hardware that the compiler can target.
Well, I wasn’t anticipating such a hard collapse. I was imagining a situation where salvage is still a thing, or where technology doesn’t regress that far. Still, you’re making a good point.
That’s an interesting middle ground. It’s hard for me to imagine a scenario in which computers are salvageable but storage is all lost to the point where a working compiler is impossible to find. At the moment, flash loses its ability to hold charge if not powered for a few years but spinning rust is still fine, as is magnetic tape, for a much longer period, so you’d need something else to be responsible for destroying them. Cheap optical storage degrades quite quickly but there are archive-quality disks that are rated for decades. If anything, processors and storage are more fragile.
In the event of a collapse of society, I think it’s a lot more likely that copies of V8 would survive longer than any computer capable of running them. The implicit assumption in the idea that the compiler would be a bottleneck recovering from a collapse of society is that information is more easily destroyed than physical artefacts. This ignore the fact that information is infinitely copyable, whereas the physical artefacts in question are incredibly complex and have very tight manufacturing tolerances.
Of course, this is assuming known threats. It’s possible that someone might create a worm that understands a sufficiently broad range of vulnerabilities that it propagates into all computers and erases all online data. If it also propagates into the control systems for data warehouses then it may successfully destroy a large proportion of backups. Maybe this could be combined with a mutated bacterium that ate something in optical disks and prevented recovering from backup DVDs or whatever. Possibly offline storage will completely go out of fashion and we’ll end up with all storage being some form of RAM that is susceptible to EMP and have all data erased by a solar flare.
It really depends on what we can salvage, and what chips can withstand salvage operations. In a world where we stop manufacturing computers (or at least high-end chips), I’d expect chips to fail over the years, and the most complex ones will likely go first. And those that don’t will be harder to salvage for various reasons: how thin their connection pins are, ball arrays, multi-layer boards requirements, and the stupidly fast rise times that are sure to cause cross-talk and EMI problems with the hand made boards of a somewhat collapsed future.
In the end, many of us may be stuck with fairly low-end micro controllers and very limited static memory chips (forget about controlling DRAM, it’s impossible to do even now without a whole company behind you). In that environment, physical salvage is not that horrible, but we’d have lost enough computing power that we’ll need custom software for it. Systems that optimise for simplicity, like Oberon, might be much more survivable in this environment.
In this hypothetical future, that is relevant indeed. Also, I believe you. But then the first serious project I wrote in C, Monocypher, requires only a couple KB of stack memory (no heap allocation) for everything save password hashing. The compiled code itself fits requires less than 40KB of memory. Thing is, I optimised it for simplicity and speed, not for memory usage (well, I did curb memory use a bit when I’ve heard I had embedded users).
I suspect that when we optimise for simplicity, we also tend to use less resources as a side effect.
Now sure, those simple systems will take no time to rebuild from scratch… if we have the skills. In our world of bigger and faster computers with a tower of abstraction taller than the Everest, I feel most of us simply don’t have those skills.
While it’s an interesting thought exercise, but I think this really is the key point. The effort in salvaging a working compiler to be able to run some tuned C code in a post-apocalyptic future may be significantly higher than just rewriting it in assembly for whatever system you were able to salvage (and, if you can’t salvage an assembler, you can even assemble it by hand after writing it out on some paper. Assuming cheap paper survives - it was very expensive until a couple of hundred years ago).
Most of us probably don’t have the skills to reproduce the massive towers of abstraction that we use today from scratch but my experience teaching children and young adults to program suggests that learning to write simple assembly routines is a skill that a large proportion of the population could pick up fairly easily if necessary. If anything, it’s easier to teach people to write assembly for microcontrollers than JavaScript for the web because they can easily build a mostly correct mental model of how everything works in the microcontroller.
Perhaps more importantly, it’s unlikely that any software that you write now will solve an actual need for a subsistence level post-apocalyptic community. They’re likely to want computers for automating things like irrigation systems or monitoring intrusion sensors. Monocypher is a crypto library that implements cryptosystems that assume an adversary who had access to thousands of dedicated ASICs trying to crack your communications. A realistic adversary in this scenario would struggle to crack a five-wheel Enigma code and that would be something that you could implement in assembly in a few hours and then send the resulting messages in Morse code with an AM radio.
I feel a little silly for not having thought of that. Feels obvious in retrospect. If people who have never programmed can play Human Resource Machine, they can probably learn enough assembly to be useful.
Yeah, I have to agree there.
But why one person? I think we’ll still write software in teams in 2322, if we write software at all by that point instead of flying spaceships and/or farming turnips in radioactive wastelands. The software was written by teams today, and I think, if it needs to be rewritten, it will be rewritten by teams in the future.
I would also be careful about timespans here. computers haven’t been around for a century yet, so who knows what things will be like 100 years from now? I don’t even know if it’s possible to emulate an ENIAC and run old punch card code on modern hardware, that’s the sort of change we’ve seen in just 75y. maybe multicore x86 machines running windows/*nix/BSD will seem similarly arcane 300y from now.
Wouldn’t a published standard be more important to future programmers? Go might be a wonderful language, but is there a standards document I can read from which an implementation can be written from?
Clearly C is a distraction from ALGOL 68. ;)
I think that if I were forced to work in C every day for every problem I need to solve I would be deeply unhappy and would definitely consider a new field of work :)
I know and like C, but my brain rebels, wriggling and squirming in discomfort at needing to work at such an incredibly low level of abstraction.
I’m glad C exists, and for solving certain classes of systems problems it very clearly has earned a place in the programming language pantheon, but I personally do not enjoy working in it and would much prefer working at a higher level of abstraction for the problems I need to solve.
Almost all programs written today are hardware-dependent. It’s a necessity to get good performance and to interact with the outside world. The rare exceptions are pure mathematics, e.g. proofs in Coq, Agda, Lean, Kind.
There are various VM’s, e.g. JVM, BEAM, etc. that can in theory be ported to any hardware, but in reality they and the programs written for them are designed for the dominant hardware of the day. One in-development exception is the HVM. Hardware optimized for the HVM does not currently exist, but it’s possible that it might one day. Nonetheless, the syntax and semantics are small enough to be easily ported to any hardware and expressive enough to run any algorithm (particularly when extended with appropriate IO).
One goal I have for Dawn is to have a high-level layer of the language that allows expression of purely mathematical logic (including business logic) without considering the underlying hardware, and lower-level layers that take into account more and more aspects of the underlying hardware. These lower-level layers would act as both computer and human readable and writable intermediate representations, with the goal of providing the best of both worlds: high level hardware-independent expression by default, and low level hardware-optimized control when needed. My hope is that, over time, compilers for this language will become able to perform the kinds of sophisticated transformations, e.g. data-oriented design to maximize cache locality, that are currently done almost exclusively by humans.
How is this true in a world where Python, JavaScript, and Java exist? I politely disagree.
There are numerous examples of performant abstract code? It’s not a necessity.
It’s not “in theory”, the JVM has been ported to tons of non-dominant hardware too.
I wish you luck on Dawn, but I think you’re fighting a really tough battle, one where hundreds of specialists have failed even! I’m beginning to come to terms with C being a necessary intermediate for all higher level abstractions. I tell myself: maybe if I call C something else… like “Intermediate Hardware Language”, would that make it all seem better? And it does! It does to me at least. “C as a IHL” is just a bootstrapping language for higher order code. In the future Rust will be as ubiquitous as C as an IHL.
So, the Python/JS/Java may not be written for specific CPUs, but their implementations are almost always are written for a set of CPUs, and what the runtimes are able to make fast influences how performance-reliant code gets written. And what the runtimes are able to make fast is based on the intersection of the language design (where Python’s design tends to be slow in flavor of making things flexible, but is able to drop hot loops into C, or to primitives in C), and the hardware. When people care about performance, they usually reach for faster code in their existing language before diving a layer down.
There’s nothing wrong with that, it’s an implementation optimization which is abstracted away. I don’t understand how that negatively affects software longevity since the language semantics itself don’t rely on the underlying hardware, unlike C!
I think trying to make a language do both high level and low level is essentially just smashing two DSLs together. Might as well do C and whatever else and FFI between them?
Can you explain further? I ask, because the C standard goes out of its way to avoid any particular hardware implementation (which is why it has a lot of undefined behavior).
Thanks for your reply! Please see my responses, below.
Well, even ignoring that the by far dominant Python implementation is written in C, the vast majority of Python code is calling out to C code that is designed to run on the dominant hardware of the era. C is one of the most portable languages in existence, but that’s in part because no one comes out with hardware that can’t be targeted by C compilers. It would be economic suicide to do so.
Similarly, JavaScript is littered with hints of it’s dependence on current hardware. It’s primary numeric type is Float64, for example, because that is what was the most general numeric type supported by hardware at the time it was designed.
Similarly, Java is a monstrosity of complexity on top of C’s execution model, and again, C, despite being one of the most portable languages, was designed for the current paradigm of Von Neumann, binary, fixed word-size hardware.
Could you share one or two examples that come to mind?
Has it been ported to run directly on GPU’s, or to hardware designed to execute Forth, or to any hardware not designed as a target for C?
Thanks! Yeah, don’t hold your breath. This is definitely a many year project, perhaps even a many decade project, if I’m successful at all.
It’s execution model is definitely a necessary intermediate for targeting the current paradigm of hardware. I’m currently working on a low-level, first-order variant of the multi-stack concatenative calculus that closely matches C’s execution model for precisely this reason.
However, C, the language is not a good intermediate representation. It’s abstract machine is fine, and compatibility with it’s ABI is a requirement in order to interface with currently dominant operating systems. But this entire industry is extremely young, and all of our tools have been cobbled together haphazardly. C won out through trial and error, but there’s no reason to believe it is the global optimum. I believe we can move past it, in time. We cannot abandon it by fiat, but we can design languages with less ad-hoc and hardware-specific semantics that still interface well with existing software and hardware, while opening the door to significant departures from the current paradigm.
A terse RISC is all we need. Everything else can reduce to it through composition.
Sure, I mean sand, water and fire is all you need to build stuff… but it’s a lot easier with more advanced tools.
C is pretty simple but it feels pretty “large” if I just want to make a quick tool to automate some tedious task. I rather use Python for that.
By large i mean setting up things and building. With python i can just write the script and run it, and test it faster. And i dont care if the quick tool i run every now and then takes 200ms more than C equivalent. What matters most is that i spent less time to solve the tedious problem, so i can move on to more interesting ones.
I could do everything with C, yes. But I prefer working faster, depending on what i am solving. All languages have their weights and we can choose a best tool for the job.
I love C but i dont think i could use it for everything without losing my mind because i would feel like im doing excessive work.
HTTP is unnecessary, TCP is all you need.
Then show me what I have been missing because C is just a dangerous tool to me.
Looking to start a dialog and nothing else really. I’m having doubts about my recent thoughts on software longevity.
Have you looked into Knuth’s efforts? For TaoCP he defined his own assembly language for this exact reason. TeX was implemented as a literate program in Pascal, which is an easy to implement, very stable target (it was designed to be implemented in a compiler class, after all). He probably is the person whose software has the longest intentional life of anything running today.
I haven’t but that sounds perfect to check out - thank you!
FWIW I’ve been enjoying your exploration of ideas around software longevity and hope you keep exploring ideas. You’ve helped push me to take a look at SML and provided an extra dimension of consideration around some half-baked ideas I was chewing on.
Thank you, it’s motivating to know I’m not alone.
Just as an aside, the next “step” for me is writing my first real 100 year program; I’m still unsure what it will be. Most likely it’ll be something which replaces what I use frequently, like IRC, email, RSS feed reader, or a basic HTTP client to view and interact with simple web pages like lobste.rs. Thanks to the “simple web” movement the latter is becoming more useful again! My day job though involves full featured web browsers so it won’t sever me completely unfortunately.
I’ve also had the idea of writing a pidgin-like clone for services I use, so I can get a seamless experience using IRC, Facebook Messenger, Discord, Matrix, etc. I think the best idea is to transform all those messages into ActivityPub messages and then consume them… So essentially a bunch of ActivityPub bridges.
A data point. One program I wrote back in 1991 I still use to this day. It’s in C, with no dependencies other than a standard C compiler (and the standard C library). It has been modified since then (in 1992, 2009, 2011, 2012, 2015, 2020 and 2021) so I’m unsure what that means in this context. Another program I wrote in 1999 that I still use today is in C (my blogging engine). And just for kicks, I just typed in a C program published in 1985 (but probably written earlier) and it worked as is (but with some warnings).
Now, am I saying that C should be a 100 year language? Well, it’s half way there. The only other contenders I see are Fortran, LISP, and Cobol (from an “it exists and is still used” perspective).
If Cobol counts, then so do Ada, Forth and Basic, all three still used in specific domains. If LISP counts, then so do SQL, APL and Prolog, whose original implementations aren’t used anymore but still live through other incarnations.
And I’m sure I’ve forgotten a few others. That makes C a little less impressive I think.
Yes, this is quite like what the Fox Project in the 90s did.
Did they produce device drivers? FoxNet was a network stack in SML and that too in userspace. Not exactly device drivers. With the exception of mlton, SML runtimes (and OCaml too) use an uniform value representation therefore use a few bits of the pointer for the type tag. GC also takes away a bit typically which means full width machine words needs to be boxed and that can be inefficient.
ATS, though, is the perfect combination of ML and C, barring the syntax. The metasepi project is using ATS for kernel components[1].
[1] https://metasepi.org/en/tags/ats.html
C is just simple enough. Python is simple enough, too.
I use the simplest parts of C++, but a language cannot expect to be adopted if it’s not simple enough to use. It’s the common denominator for all languages.
C++ has been widely adopted despite its complexity. Java is also complex (considering its ecosystem of frameworks, annotations, DI containers etc.) and it is widely used too. Rust is also comparably complex and is quite popular/hyped. …
C is very powerful and with C structures, functions and pointers you can do almost anything (I mean achieve a reasonable level of abstraction). But what is real pain in C is memory management (and error handling).
Given that Rust is an ML, I’m not sure what the hot take is here?
FWIW, I’d love to read a good tutorial/book on how to write a non-toy SML compiler, with an approachable explanation of the type inference algorithm(s). In either C or Pascal (ideally, targeting the same VM/IR as the Pascal, with some fuurther compiler of the IR to x86 machine code & Linux+Mac+Windows binary packages). Need not emit super optimized code, but something that would let me write real software in. I’d seriously consider this for a foundation to write all my personal software in then. I tried to understand H-M a few times, but still didn’t manage to find an explanation I’d grasp in the limited time and attention-span I have for it, nor further understand how to apply it to an actual compiler. Plus obviously all the other stuff needed, like a GC, some registry spilling algorithm, scope tracking, etc.
I am not sure why you want to write in c, but you may find this tutorial from simon peyton-jones helpful.
For a number of reasons: portability, ease of understanding for me (I’m not experienced esp. with functional languages), and importantly, not glossing over important features like GC (which in more advanced languages can often be just reused from their runtime).
As for the tutorial - thanks, I’ll definitely try to read it! Although it explicitly speaks about implementing a toy language, and not in C/Pascal but in some functional language Miranda that I didn’t even hear about before, so I’m not sure I’ll be able follow, plus I presume GC will be glossed over and strings or FFI won’t be supported. But will try based on your recommendation, H-M is bugging me so if that’s a good chance to try cracking it, try I will! :) Also it mentions some parallel VM interpreter, which could be a notable high point OTOH IIUC.
There is a lot of literature on GC-writing. Though I can see why it would be nice to have a fully integrated tutorial. I might look at ‘crafting interpreters’ for that; it walks through an implementation of a simple bytecode vm for a python-like language, including gc.
There was a good tutorial on HM type inference I saw at one point, but I cannot find it again. Much material on the topic is full of notation which may seem opaque. Best of luck.
C doesn’t really help you there. If you want to use the GC from a runtime written in another language then you need one of the following:
You can do the last one of these with C (there’s a paper about it) but it’s quite cumbersome - you need to build a linked list of on-stack objects that contain all of your roots and ensure that you don’t keep any temporary pointers live across function calls.
If anything, the fact that C (practically, if not according to a strict reading of the standard) conflates pointers and integers makes this harder.
H-M is basically a handful of rules for syntax-driven specification of a first-order unification problem, and then solving that unification problem, so I would suggest learning / implementing first-order unification, first. Then try to understand the typing rules for H-M, which are what define the unification problem.
Wikipedia is sufficient for both, as long as you’re okay with learning the mathematical notation. I’ve implemented unification several times now, in multiple ways, as part of my work on Dawn. I’d be happy to answer any questions you might have. Feel free to DM me.