Threads for andyc

  1. 4

    He’s obviously being defensive, but he has a good point about considering other types of safety than just memory. For example, languages without something like RAII don’t have a good way to enforce the cleanup of resources in a timely way — you have to remember to use optional constructs like “defer” to call cleanup code, otherwise the cleanup won’t happen until the GC decides to finalize the owning object, or maybe never. The arbitrary nature of finalizers has been a pain point of Java code for as long as I can remember, when working with any resource that isn’t Pure Java(tm).

    1. 8

      Part of the problem though is that: a) That is a deflection from the entire point of the NSA thing that Stroustrup is ostensibly replying to, which is that almost all serious safety problems are memory safety problems of some kind, which C++ can not seriously mitigate b) The ‘other forms of safety’ that Stroustrup talks about in the linked letter and positions as being better without actually explicitly arguing for it (what he calls ‘type-and-resource safety’) are also things that C++ just can fundamentally never do - the linked documents are about as serious an approach to getting the described properties for C++ as the smart pointer work was about getting memory safety for C++.

      Like, C++ doesn’t have memory safety (and also some related things like ‘iterator invalidation safety’) and fundamentally cannot get it without massively breaking changes, and (specifically) the lack of temporal memory safety and aliasing safety means that their approaches to ‘type-and-resource safety’ will fundamentally do essentially nothing.

      This is part of a long pattern of Stroustrup trying to stop any possibility of progress on safety by (amongst other things) using his name, reputation, and any position he is able to get to push for the diversion of effort and resources into big and difficult projects of work that will look like progress but fundamentally cannot ever achieve anything good.

      1. 13

        I would argue that memory safety is not a problem of the C++ language, it’s a problem of implementations. Real Soon Now[1], my team is going to be open sourcing a clean slate RTOS targeting a CHERI RISC-V core. The hardware enforces memory safety, the key RTOS components are privilege separated and the platform has been an absolute joy to develop.

        Languages like Rust have a stronger guarantee: they check a lot of properties at compile time, which avoids the bugs, rather than simply preventing them from being exploitable. This comes with the caveat that the only data structure that you can express is a tree without dipping into unsafe (either explicitly or via the standard library) and then you need to reason about all of the ways in which those unsafe behaviours interact, without any help from the type system. The Oakland paper from a while back that found a couple of hundred CVEs in Rust crates by looking for three idioms where people misuse things that hide unsafe behind ‘safe’ interfaces suggests that people are not good at this.

        The other problem that we’ve seen with Rust is that the compiler trusts the type system. This is fine if all of the code is within the Rust abstract machine, but is a nightmare for systems that interact with an adversary. For example, we saw some code that read a Rust enumeration from an MMIO register and checked that it was in the expected range. The compiler knew that enumerations were type safe so elided the check, introducing a security hole. The correct fix for this is to move the check into the unsafe block that reads from the MMIO register, but that’s the kind of small detail that’s likely to get overlooked (and, in code review, someone may well say ‘this check isn’t doing anything unsafe, can you move it out of the unsafe block?’ because minimising the amount of unsafe code is normally good practice). We need to check a bunch of things at API boundaries to ensure that the caller isn’t doing anything malicious and, in Rust, all of those things would be things that the compiler would want to assume can never happen.

        We will probably rewrite a chunk of the code in Rust at some point (once the CHERI support in the Rust compiler is more mature) because there are some nice properties of the language, but we have no illusions that a naive Rust port will be secure.

        [1] I think we have all of the approvals sorted now…

        1. 3

          The other problem that we’ve seen with Rust is that the compiler trusts the type system. This is fine if all of the code is within the Rust abstract machine, but is a nightmare for systems that interact with an adversary. For example, we saw some code that read a Rust enumeration from an MMIO register and checked that it was in the expected range. The compiler knew that enumerations were type safe so elided the check, introducing a security hole.

          Heh, Mickens was right – you can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus (or, in this case, type theory…) by osmosis :-).

          This is one of the things I also struggled with back when I thought I knew enough Rust to write a small OS kernel and I was a) definitely wrong and b) somewhat disappointed. I ran into basically the same problem – reading an enum from a memory-mapped config register. As usual, you don’t just read it, because some ranges are valid, some are reserved, some are outright invalid, and of course they’re not all consecutive ranges, so “reading” is really just the happy ending of the range checks you do after a memory fetch.

          At the time, I figured the idiomatic way to do it would be via the TryFrom trait, safely mapping config register values to my enum data type. The unsafe code block would read a word and not know/care what it means, then I’d try build the enum separately from that word, which was slower and more boilerplatey than I’d wished but would prevent the compiler from “helping” me along. That looked cool both on paper and on screen, until I tried to support later revisions of the same hardware. Teaching it to deal with different hardware revisions, where valid and reserved ranges differ, turned out to be really stringy and more bug-prone than I would’ve wanted.

          My first instinct had been to read and range-check the values in the unsafe block, then build the enum From that, which was at least slightly faster and more condensed (since it was guaranteed to succeed) – or skip enumerating values separately altogether. However, it seemed that was just safety theater, as the conversion was guaranteed to succeed only insofar as the unsafe check was right, thus reducing the whole affair to C with a very verbose typecast syntax.

          Frankly, I’m still not sure what the right answer would be, or rather, I haven’t found a satisfactory one yet :(.

          1. 1

            Haha great quote … the way I phrase it is that is “When models and reality collide, reality wins

            Type systems are models, not reality … I see a lot of solipsistic views of software that mistake the map for the territory

            Previous comment on “the world”: https://lobste.rs/s/9rrxbh/on_types#c_qanywm

            Not coincidentally, it also links to an article about interfacing Rust with hardware

            1.  

              It’s hard to say without looking a specific example, but a common trap with Rust enums is that often time you don’t want an enum, you want an integer with a bunch of constants:

              struct Flag(u8);
              
              impl Flag {
                const Foo: Flag = Flag(0);
                const Bar: Flag = Flag(1);
              }
              
              1.  

                I may be misunderstanding some details about how this works, but in the context of interfacing with the underlying hardware I think I generally want both: a way to represent related values (so a struct Flag(u8) with a bunch of constant values) and an enumerated set of valid flag values, so that I can encode range checks in TryFrom/TryInto. Otherwise, if I do this:

                let flags = cfg.get_flags(dev.id)?
                

                where

                fn get_flags(&self, id: DeviceId) -> Result<Flag>
                

                I will, sooner or later, write get_flags in terms of reading a byte from a corrupted flash device and I’ll wind up trying to write Flag(42) to a config register that only takes Flag::Foo or Flag::Bar.

                Having both means that my config read/write chain looks something like this: I get a byte from storage, I build my enum Flag instance based on it. If that worked, I now know I have a valid flag setting that I can pass around, modulo TryFrom<u8> implementation bugs. To write it, I hand it over to a function which tl;dr will turn my flags into an u8 and yell it on the bus. If that function worked, I know it passed a valid flag, modulo TryInto<u8> implementation bugs.

                Otherwise I need to hope that my read_config function checked the byte to make sure it’s a valid flag, and that my set_config function checked the flag I got before bus_writeing it, and I do not want to be that optimistic :(.

            2. 2

              I would argue that memory safety is not a problem of the C++ language, it’s a problem of implementations. Real Soon Now[1], my team is going to be open sourcing a clean slate RTOS targeting a CHERI RISC-V core. The hardware enforces memory safety, the key RTOS components are privilege separated and the platform has been an absolute joy to develop.

              That’s cool. I’m quite excited for CHERI. My question is this - when you do run into a memory safety issue with CHERI what is the dev experience? In Rust you get a nice compiler error, which feels much “cheaper” to handle. With CHERI it feels like it would be a lot more expensive to have these bugs show up so late - although wayyyyyyy better than having them show up and be exploitable.

              The Oakland paper from a while back that found a couple of hundred CVEs in Rust crates by looking for three idioms where people misuse things that hide unsafe behind ‘safe’ interfaces suggests that people are not good at this.

              For sure. Rudra is awesome. Unsafe is hard. Thankfully, the tooling around unsafe for Rust is getting pretty insane - miri, rudra, fuzzing, etc. I guess it’s probably worth noting that the paper is actually very positive about Rust’s safety.

              My opinion, and what I have observed, is that while there will be unsafety in rust it’s quite hard to exploit it. The bug density tends to be very low, low enough that chaining them together can be tough.

              This is fine if all of the code is within the Rust abstract machine, but is a nightmare for systems that interact with an adversary. For example, we saw some code that read a Rust enumeration from an MMIO register and checked that it was in the expected range. The compiler knew that enumerations were type safe so elided the check, introducing a security hole.

              I don’t understand this. What are you referring to with regards to “an adversary”. Did an attacker already have full code execution and then leveraged a lack of check elsewhere? Otherwise if the compiler eliminated the check it shouldn’t be possible to reach that without unsafe elsewhere. Or did you do something like cast the enum from a value without checking? I don’t really understand.

              We need to check a bunch of things at API boundaries to ensure that the caller isn’t doing anything malicious and, in Rust, all of those things would be things that the compiler would want to assume can never happen.

              I’m just not understanding who this attacker is.

              1. 2

                That’s cool. I’m quite excited for CHERI. My question is this - when you do run into a memory safety issue with CHERI what is the dev experience? In Rust you get a nice compiler error, which feels much “cheaper” to handle. With CHERI it feels like it would be a lot more expensive to have these bugs show up so late - although wayyyyyyy better than having them show up and be exploitable.

                It’s all run-time trapping. This is, I agree, much worse than catching things at compile time. On the other hand, running existing code is a better developer experience than asking people to rewrite it. If you are writing new code, please use a memory-safe (and, ideally, type-safe language).

                I don’t understand this. What are you referring to with regards to “an adversary”.

                One of the problems with Rust is that all non-Rust code is intrinsically unsafe. For example, in our model, we can pull in things like the FreeRTOS network stack, mBedTLS, and the Microvium JavaScript VM without having to rewrite them. In Rust, any call to these is unsafe. If an attacker compromises them, then it’s game over for Rust (this is no different from C/C++, so Rust at least gives you attack-surface reduction).

                If a Rust component is providing a service to untrusted components then it can’t trust any of its arguments. You (the programmer) still need to explicitly check everything.

                Did an attacker already have full code execution and then leveraged a lack of check elsewhere?

                This case didn’t have an active adversary in software. It had an attacker who could cause power glitches that caused a memory-mapped device to return an invalid value from a memory-mapped register. This is a fairly common threat model for embedded devices. If the out-of-range value is then used to index something else, you can leverage it to gain memory corruption and possibly hijack control flow and then you can use other paths to get arbitrary code execution.

                I’m just not understanding who this attacker is.

                Everyone else who provides any code that ends up in your program, including authors of libraries that you use. Supply chain vulnerabilities are increasingly important.

                1. 1

                  On the other hand, running existing code is a better developer experience than asking people to rewrite it.

                  For sure. Mitigations like CHERI are critical for that reason - we can’t just say “well you should have used Rust”, we need practical ways to make all code safer. 100%.

                  If an attacker compromises them,

                  So basically the attacker has full code execution over the process. Yeah, unless you have a virtual machine (or hardware support) I don’t think that’s a problem you can solve in Rust or any other language. At that point the full address space is open to the attacker.

                  It had an attacker who could cause power glitches that caused a memory-mapped device to return an invalid value from a memory-mapped register.

                  This sounds like rowhammer, which I can’t imagine any language ever being resistant to. That has to happen at a hardware level - I think that’s your point? Because even if the compiler had inserted the check, if the attacker here can flip arbitrary bits I don’t think it matters.

                  Supply chain vulnerabilities are increasingly important.

                  For sure, and I think perhaps we’re on the same page here - any language without a virtual machine / hardware integration is going to suffer from these problems.

                  1. 1

                    So basically the attacker has full code execution over the process

                    That’s the attacker’s goal. Initially, the attacker has the ability to corrupt some data. They may have the ability to execute arbitrary code in some sandboxed environment. They are trying to get arbitrary code outside of the sandbox.

                    This sounds like rowhammer, which I can’t imagine any language ever being resistant to. That has to happen at a hardware level - I think that’s your point? Because even if the compiler had inserted the check, if the attacker here can flip arbitrary bits I don’t think it matters.

                    You get equivalent issues from converting an integer from C code into an enumeration where an attacker is able to do something like a one-byte overwrite and corrupt the value.

                    Typically, attacks start with something small, which can be a single byte corruption. They then chain together exploits until they have full arbitrary code execution. The problem is when the Rust compiler elides some of the checks that someone has explicitly inserted defensively to protect against this kind of thing. Note that this isn’t unique to Rust. C/C++ also has this problem to a degree (for example, eliding NULL checks if you accidentally dereference the pointer on both paths) but it’s worse in Rust because there’s more in type-safe Rust that the language abstract machine guarantees in C.

                    1. 1

                      I don’t really agree with this premise but that’s fine.

                      1. 1

                        You get equivalent issues from converting an integer from C code into an enumeration where an attacker is able to do something like a one-byte overwrite and corrupt the value.

                        I’m confused, you mean copying the int into a rust enum too narrow for it?

                        The problem is when the Rust compiler elides some of the checks that someone has explicitly inserted defensively to protect against this kind of thing.

                        Are you referring to checks at the boundary, or checks far behind it?

                        1. 2

                          I’m confused, you mean copying the int into a rust enum too narrow for it?

                          No, the flow is a C function returning an enumeration that you coerce into a Rust enumeration that holds the same values. An attacker is able to trigger a one-byte overwrite in the C code that means that the value returned is not actually a valid value in that enumeration range. The Rust programmer doesn’t trust the C code and so inserts an explicit check that the enumeration is a valid value. The Rust compiler knows that enumerations are type safe and so elides the check. Now you have a way for an attacker with a one-byte overwrite in C code to start a control-flow hijacking attack on the Rust code.

                          Are you referring to checks at the boundary, or checks far behind it?

                          Checks in the trusted (Rust) code, outside of unsafe blocks.

                2. 2

                  The Oakland paper from a while back that found a couple of hundred CVEs in Rust crates by looking for three idioms where people misuse things that hide unsafe behind ‘safe’ interfaces suggests that people are not good at this.

                  Can you share a link to that paper?

                  1. 2

                    I misremembered, it was at SOSP, not Oakland (it was covered here). The title was:

                    Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale

                    1. 1

                      Thanks! That one I do remember 😄

                  2. 1

                    Languages like Rust have a stronger guarantee: they check a lot of properties at compile time, which avoids the bugs, rather than simply preventing them from being exploitable. This comes with the caveat that the only data structure that you can express is a tree without dipping into unsafe

                    (Tracing) gc has no trouble with actual graphs, and still prevents all those nasty bugs by construction.

                    The other problem that we’ve seen with Rust is that the compiler trusts the type system

                    Yes—I am still waiting for capability-safety to be table stakes. Basically no one should get the ‘unsafe’ god-capability.

                    1. 3

                      (Tracing) gc has no trouble with actual graphs, and still prevents all those nasty bugs by construction.

                      But it does have problems with tail latency and worst-case memory overhead, which makes it unfeasible in the kind of scenarios where you should consider C++. If neither of those are constraints for your problem domain, C++ is absolutely the wrong tool for the job.

                      Yes—I am still waiting for capability-safety to be table stakes. Basically no one should get the ‘unsafe’ god-capability.

                      Unfortunately, in Rust, core standard-library things like RC depend on unsafe and so everything would need to hold the capability to perform unsafe to be able to pass it down to those crates, unless you have a compile-time capability model at the module level.

                      1. 2

                        Unsafe can be switched of at the module level and the module is indeed also the boundary of unsafe in Rust.

                        A mistake with unsafe may be triggered from the outside, but a correct unsafe implementation is well-encapsulated. That very effectively reduces the scope of review.

                  3. 2

                    I basically agree with you! I haven’t been aware of these tendencies of his, but I’m not surprised.

                    But I think the types of safety provided by RIAA are valuable too. My day-job these days is mostly coding in Go and I miss RIAA a lot. Just yesterday I had to debug a deadlock produced by a high-level resource issue (failure to return an object to a pool) that wouldn’t have occurred in C++ because I would have used some RIAA mechanism to return it.

                    1. 1

                      a) That is a deflection from the entire point of the NSA thing that Stroustrup is ostensibly replying to, which is that almost all serious safety problems are memory safety problems of some kind, which C++ can not seriously mitigate

                      Thank you. I’m soooo sick of seeing “but Rust doesn’t solve all forms of safety so is it even safe?”. “Rust is safe” means “Rust is memory safe”. That’s a big deal, memory safety vulnerabilities are highly prevalent and absolutely worst-case.

                      The whole post by him is really ignorant.

                      1. 1

                        That is a deflection from the entire point of the NSA thing that Stroustrup is ostensibly replying to, which is that almost all serious safety problems are memory safety problems of some kind,

                        That would have to be heavily qualified to a domain – otherwise I’d say it’s just plain untrue.

                        String injection like HTML / SQL / Shell are arguably worse problems in the wide spectrum of the computing ecosystem, in addition to plain mistakes like logic errors and misconfiguration.

                        As far as I can tell, none of these relate to memory safety:

                        https://www.checkpoint.com/cyber-hub/cloud-security/what-is-application-security-appsec/owasp-top-10-vulnerabilities/

                    1. 6

                      Ha, glad he mentioned Herb Sutter’s Cpp2 at the end, because that’s basically what was going through my head while reading it.

                      I watched this 2 hour video back in November, expecting not to really like it, but it has a number of good points about language design, and is well-motivated:

                      https://www.youtube.com/watch?v=ELeZAKCN4tY

                      It shows why syntax does matter – for generic programming and compile-time metaprogramming. If you have wildly different syntax for data values, type values, and function values, then you can’t easily write “higher order” programs.

                      He doesn’t mention Zig but I believe that compile-time metaprogramming is the rationale for its somewhat usual syntax, e.g.

                      const Config = struct {
                          vals: struct { testing: u8, production: u8 },
                          uptime: u64,
                      };
                      

                      rather than struct Config { } like in C.

                      Although Zig shows pub fn main() on its home page – I thought you could also do something like main = fn() { ... } ?

                      https://ziglang.org/


                      Though to be honest, for a brief while, I wrote JavaScript like var f = function() { ... }, but then I switched back to function f(), simply because 95%+ of code doesn’t need any kind of reflection or metaprogramming.

                      And often if you have 20% of code that DOES use reflection/metaprogramming, then it’s poorly factored. You can separate it into a “compiler” and a “runtime”, not mixing the two things haphazardly.

                      So I guess that is the counterpoint: it doesn’t matter most of the time, which is probably why it’s far down on the list of issues with Rust.

                      1. 5

                        Since files are also structs, Zig’s import syntax is unified with struct declarations:

                        const std = @import("std");
                        

                        I love how simple this is. “Fewer composable general features” as Herb quotes in that presentation.

                        Also thanks for linking this talk - watching it now!

                        1. 2

                          Ahh I didn’t realize that, very cool!

                          It honestly reminds me of the Perlis-Thompson Principle

                          wiki: https://github.com/oilshell/oil/wiki/Perlis-Thompson-Principle

                        2. 3

                          Although Zig shows pub fn main() on its home page …

                          The proposal #1717 would change the syntax to make functions follow the const name = ... pattern as well. Andrew accepted it in 2018 but then removed the accepted label in 2019. Looks like they’re still considering it and haven’t decided yet.

                        1. 3

                          I like that distinction. Another way I’ve phrased it is that interpreters are a function of argv, but compilers aren’t. (Obviously a compiler has its own argv, but you can probably see what I mean)

                          For example, I remember someone on Reddit made the claim that Python has compile-time metaprogramming.

                          Python has a bytecode compiler, and you can do stuff before main() in Python. And that “stage” is often used to do things like what you would do with constexpr in C++ or comptime in Zig.

                          But it’s not compile-time – it’s interpreted runtime. Because you can use sys.argv in the stage before main. So Python has a single stage of execution with runtime reflection, not 2 stages (compile-time metaprogramming, then runtime)

                          (Python does have a bytecode compiler, but no compile-time metaprogramming. It doesn’t expose a hook in the actual bytecode compilation phase, although it does export a reflective bytecode compiler interface at runtime. [1])


                          One thing I found interesting is that AFAIK PyPy’s RPython (which is different than Python) makes this real: the compiler “starts” at main().

                          I do a similar thing in Oil:

                          • In the interpreted Python mode, it computes a bunch of stuff before main(), and then main() uses that data.
                          • In the translation to C++, the C++ code generators import the stuff before main(), evaluate it, and then write the result to global C++ data structures. The C++ code then uses those “baked in” globals, incurring zero startup time.

                          For example, this is done with the huge array of regular languages in the lexer, flag parser definitions, computed string constants, computed lookup tables (e.g. Pratt parser precedence), and more! Other shells have traditional C++ code gen steps to do the same thing, but I’ve come to like this “one program but partially evaluated” style.


                          Request: someone should write a follow-up explaining Futamura projections :) As I understand it, it’s a form of partial evaluation where you take a interpreter + compiler, and get a compiler for free.

                          I think it should be possible to do it with “Make a Lisp” which has lots of real implementations – https://github.com/kanaka/mal

                          And then maybe compile to: https://en.wikipedia.org/wiki/SECD_machine

                          For that case, the problem with BF is that nobody knows how to write a BF compiler in BF, and even if they could, nobody could read it. i.e. the compiler/interpreter examples were in Rust and not BF

                          Ideas here: https://old.reddit.com/r/ProgrammingLanguages/comments/10m1d76/distinguishing_an_interpreter_from_a_compiler/j618x8j/


                          [1] edit: just realized that the author’s Converge language explored precisely that issue for a Python-like language :) https://convergepl.org/about.html

                          1. 3

                            Is there really anyone around who would argue to the contrary in 2023? I mean sure, there are plenty of people still using string based approaches, but only because they’re easy, or because the codebase already works that way, surely not because they actually think they’re better?

                            1. 3

                              What’s the argument precisely? That all languages should have JSX-like literals?

                              But many don’t currently (Go, Rust, Ruby, Python), so you will almost certainly have to use string-based approaches sometimes.

                              I agree that manual escaping is not really acceptable anymore. Auto-escaping is more desirable, but still has limitations. It’s also not widely deployed.

                              The major limitation of the AST approach is that it only goes one level deep, whereas JS, CSS, and URIs are commonly embedded in HTML. e.g. my comment above: https://lobste.rs/s/j4ajfo/producing_html_using_string_templates#c_zyoyqn

                              So honestly the problem is UNSOLVED and there’s not “one answer in 2023” that everyone should use.

                              1. 5

                                What’s the argument precisely? That all languages should have JSX-like literals?

                                I haven’t used JSX; the argument I’m thinking of is more that the ERb/PHP style of “here’s an HTML file; just smush some strings into it” is not as good as using your language’s native data structures to represent the document’s tree structures. For instance, I use this in Fennel to create an HTML list, but it’s the same notation any application would use for data structures:

                                [:li {} [:a {:href "https://www.minetest.net/"} "video"]
                                        [:a {:href "https://love2d.org"} "game"]
                                        [:a {:href "https://tic80.com"} "dev"]]
                                

                                The major limitation of the AST approach is that it only goes one [language] deep, whereas JS, CSS, and URIs are commonly embedded in HTML.

                                I understand that this is supported by HTML, but I have never understood how this could be considered a desirable thing in nontrivial cases. I see it done a lot, but every time I do it myself, I end up regretting it, and it goes much better when you have separate languages in separate files. (for JS/CSS, not URIs). The fact that this is not supported counts as a big plus in my book.

                                1. 1

                                  That approach doesn’t seem to be popular, for whatever reason. If I had to guess, I’d say it’s < 5% of web apps that use it, while string templates or JSX-like literals are 95%.

                                  For example lobste.rs almost certainly doesn’t use it, since it’s written in Rails. (Probably one of the few web apps that does is Hacker News, which is written in a Lisp dialect!)

                                  So I think plenty of people would argue the contrary in 2023 – like the people writing 95% of the web apps we use :)

                                  I think the reason is pretty mundane: You can do some version of that in Ruby and Python (and Lua), but it’s cumbersome and ugly.

                                  I also wonder if it’s better to have the same syntax for HTML/JSX across N languages, than to use N different syntaxes for the same thing. It certainly makes it easier to copy example HTML / CSS / JS from the web.

                                  1. 2

                                    Most rails apps I’ve worked on use slim (or haml in the old days) which you can argue if it’s pure ast, but it’s not just smushing strings like erb/php

                                    1. 2

                                      OK interesting.

                                      Well lobste.rs itself uses ERB, which appears to be “string templating with HTML auto-escaping”, as far as I can tell. It looks like you call raw() to disable HTML escaping.

                                      https://github.com/lobsters/lobsters/blob/master/app/views/messages/index.html.erb

                                      Now that I think about it more, it’s not clear to me that this solution is worse in terms of escaping than the AST approach.

                                      It understands a single level of escaping, and AFAICT so do most of the AST approaches.

                                      According to the article, examples of the AST approach include:

                                      • SXML as used by Lisp/Scheme and Stan for Python.
                                      • DSLs
                                        • These are DSLs implemented fully independently. Examples include Haml and Slim (both Ruby) and Hamlet (used by Haskell’s Yesod web framework). A large number of variant languages inspired by Haml also exist.

                                      But I think all of them are DSLs for HTML. Ditto with the XML-based solutions used, which also seem rarely used in practice.

                                      Do they understand

                                      • URI syntax inside HTML
                                      • JavaScript syntax inside HTML, including JavaScript string litearls
                                      • URIs inside JavaScript inside HTML

                                      ?

                                      What do you think @hlandau ?

                                      1. 1

                                        Back when I did Rails (and this was well over a decade ago) 95% of people used ERb because that’s what Rails used out of the box, presumably because it was easy to learn due to its similarity to PHP. But everyone I know who actually bothered to try one of the alternatives (haml, markaby, whatever) hated ERb, but got stuck using it anyway, because it was too hard to change once the project got going, or they had to make concessions to teammates, or whatever.

                                        That’s why my original question was not “does anyone use the string approach”; I was more trying to get at “does anyone who has tried both approaches still actually think the string one is better”.

                                        1. 2

                                          I mostly use the string approaches after trying structured approaches. I don’t think it’s better in every way, but the structured approaches all have drawbacks IMO.

                                          IME the problem with the HAML-like external DSLs is that the languages aren’t expressive enough for logic.

                                          Or you make them expressive enough, and now you’re programming in a crappy template language rather than a real language.

                                          When using HAML, don’t you ever want to call back to Ruby to do something tricky? (honest question)

                                          The internal DSL approach with Lisp-like structures avoids that problem, because you can interleave data and code. (Actually Oil is trying to do that for shell: https://www.oilshell.org/release/latest/doc/hay.html Basically make data and code similar and parsed the same way)

                                          But like I said, there is the mundane reason that doing such things in Python / Ruby isn’t convenient.


                                          So what I’m claiming is that there is still a problem to be solved. I don’t think the answer is “everyone needs to use AST based approaches, and anyone who tries will realize that it’s better”.

                                          Probably the very first web app I wrote in Python was using the AST approach, because it has obvious appeal. But IME it just doesn’t work well in practice. It’s harder to read and write, and it doesn’t give you additional security over a string-based template engine.

                                          And the issue of copying markup from examples is not trivial either. Most programmers don’t remember all the nooks and carnnies of HTML. It’s nice to be able to test things out in a file first, and then copy it into a dynamic program.


                                          My impression is that the AST approach works fine for small or toy websites (which is kind of what I write, very simple websites).

                                          But perhaps people who write “real websites” have settled on the string-based approach for good reasons – probably the issue of “volume”. i.e. making a website with 10,000 templates probably changes how you think about the tradeoffs.

                                          I believe string-based approaches are the default in nearly all web frameworks. Certainly in Django and Flask, in addition to rails. I don’t think the authors of those frameworks have simply never tried AST approaches.

                                          (and I would be interested in how many Rails apps use HAML vs. more string-based approaches.)

                                          1. 1

                                            When using HAML, don’t you ever want to call back to Ruby to do something tricky? (honest question)

                                            Yeah, so my memory here was pretty fuzzy; I forgot that HAML was an external thing. It’s much better than ERb but I don’t advocate its approach for the reasons you outlined above. What I prefer is something that uses the native data structures and methods of the language in question.

                                            But like I said, there is the mundane reason that doing such things in Python / Ruby isn’t convenient.

                                            I disagree; markaby here (AST-based) is quite clear and readable:

                                                h1 "Boats.com has great deals"
                                                ul do
                                                  li "$49 for a canoe"
                                                  li "$39 for a raft"
                                                  li "$29 for a huge boot that floats and can fit 5 people"
                                                end
                                            
                                            1. 1

                                              (late reply)

                                              Markaby looks nice! But it has the awkwardness of not having a place for the HTML attributes.

                                              It invents a special language for “class” and “id”, e.g. li.myclass and li.myid!

                                              https://github.com/markaby/markaby

                                              In Python it would be even more awkward:

                                              with h1("Boats", ):
                                                with ul(class="foo"):
                                                  li("$49", id="bar")
                                                  a("my link", href=url_for(mydict))
                                              

                                              You would have to put the attributes at the end.

                                              (Also I saw that Markaby was invented by “why”, who was early in Ruby’s community … I sort of refuse to believe that the authors of Rails and the like simply never tried it.)


                                              But interestingly a shell-like syntax would work quite well, and give you a lot of flexibility around quoting and interpolation:

                                              h1 "Boats" {
                                                ul class="foo" {
                                                   li id="bar" '$49'
                                                   a href=$[url_for(mydict)] 'my link'
                                                }
                                              }
                                              

                                              I thought about this for Oil, but I am a little wary of it since JSX-style literals seem to be ubiquitous. They have a good “copy-paste” story.

                                              But being able to interleave code and data seamlessly (which we can do with Oil + Hay-style data blocks), is a killer feature.

                                              Hmm …

                                  2. 1

                                    I’m not sure where that approach came from but I first saw it in Seaside and it was fantastic for two reasons:

                                    • it let you build modular components easily, where the outer view created a div and then called the inner view’s draw method.
                                    • it composed very cleanly with caching: if a view didn’t mark itself as needing a redraw then the framework could just reuse the same HTML (data structure or string) as last time.
                                  3. 1

                                    small update since I can’t edit: contextual auto-escaping isn’t widely deployed, but auto-escaping is

                                1. 17

                                  Kubernetes is our generation’s Multics.

                                  I (too?) believe it exists because UNIX never moved past being a multi-user operating system for a solitary minicomputer. Users and groups are vestigial organs. Berkeley sockets was a hack. And Plan9 failed.

                                  Erlang is the Galapagos. 😉

                                  1. 17

                                    (author here) Thanks for mentioning it :) I had collected some more links about fundamental design problems with Kubernetes, but I haven’t had time to put them on the blog. I’ll drop a couple good recent links here:

                                    Kubernetes Was Never Designed for Batch Jobs - https://lobste.rs/s/z8gxe6/kubernetes_was_never_designed_for_batch - 4 months ago on lobste.rs

                                    Kubernetes StatefulSets are Broken - https://www.plural.sh/blog/kubernetes-statefulsets-are-broken/

                                    https://news.ycombinator.com/item?id=32439255 - 5 months ago

                                    The tagline is VERY right:

                                    Kubernetes was originally intended to act as a container orchestration platform for stateless workloads, not stateful apps.

                                    An important thing that seems to be lost in the hype is that Google’s Borg (which Kubernetes began as a clone of) was ALSO “intended to act as a container orchestration platform for stateless workloads, not stateful apps”. (I used it for several years across many teams starting ~2007, and tried to write something of my own many years ago.)

                                    That’s because Borg was built for Google search in ~2003-2004. This app requires mind-boggling amounts of CPU/memory/storage/networking, but is very “stateless”. That is, the state was basically tiers of enormous immutable files that were inverting the whole web. There is no online state; there is no notion of consistency and thus no notion of transactions.

                                    This was a world before Gmail or Google Docs existed as public products (Maps and YouTube too).

                                    In the years after Borg was first deployed, the difficulty of running regular non-search, stateful apps on it became well known. The Broccoli man video went viral in Google because so many engineers (including myself) had experienced that pain:

                                    https://www.youtube.com/watch?v=3t6L-FlfeaI

                                    (Although my memory is that it made fun of what I think of as the M requirements x N applications problem [1], not necessarily stateful apps, but both were problems. The status quo for stateful apps was to write your own application-specific distributed database, like GMail, or try to latch on to Bigtable, which was a non-relational DB with no transactions and limited querying ability. It was designed for search.)

                                    I think not until Spanner was there a true solution for fine-grained and consistent state, in applications like Gmail or Google Docs. (FWIW Google ads was the early app other than search, and it the ran MySQL clusters for a long time – like completely into the tens / hundreds of billions of dollars range.)


                                    Anyway, my point is that overall there is a huge problem to be solved, as Kubernetes didn’t solve it. It basically took all the Broccoli Man problems we had over 10 years ago and exported them to the rest of the world. It was NOT designed to run your applications well – hence all the layers of additional stuff you need to run the most basic workloads like a web app with a database.

                                    The problem is to create abstractions that let you run distributed applications on your own hardware, not the cloud. All cloud providers and self-hosters like Hey need to solve this problem! This is similar to Unix letting you bring up say a typesetting package on any machine to create documents, without writing assembly code for that machine.

                                    (Maybe it’s worth rabble rousing a bit more, as I noticed a few months ago that a nascent Kubernetes / systemd “competitor” called Aurae cited my “narrow waist” posts as inspiration:

                                    https://medium.com/@kris-nova/why-fix-kubernetes-and-systemd-782840e50104

                                    https://lobste.rs/s/dwui6j/why_fix_kubernetes_systemd – followup: I didn’t manage to crawl out from under a big pile of work, which is good because we got the garbage collector working :) I’m still looking for people who want to be paid to write this kind of code under our second NLnet grant – contact me if interested.


                                    [1] Good references on this architectural problem:

                                    https://www.oilshell.org/blog/2022/02/diagrams.html#these-abstractions-can-be-designed

                                    https://lobste.rs/s/mjo19d/unix_microservice_platforms

                                    1. 2

                                      Mild nit, because it really has no bearing on your well-written comment:

                                      BigTable does have transactions, they’re just single row, where a row is an arbitrary collection of columns (KV pairs) grouped into column families. This property was/is leveraged pervasively to build transactional applications on BigTable. Sadly (or, perhaps blessedly) some of these systems never had papers published, and avoided opensource projects making their mistakes.

                                      It probably doesn’t reveal too much to say that most applications you’re familiar with that you might guess are using Spanner today predated the wide deployment of Spanner.

                                    2. 6

                                      and https://www.unison.cloud/ is another planet :)

                                      1. 1

                                        Indeed. I wonder if a single language distributed runtime can succeed?

                                        1. 1

                                          I wonder as well! But it’s not more closed than other runtimes in the end (a “traditional” language isn’t better at interop per se). Also if the protocol used for distribution is open, competing runtimes could participate to the cluster. Reinventing a whole ecosystem (again, applicable to non-distributed runtimes as well) is what’s risky.

                                      2. 4

                                        And Plan9 failed.

                                        I’ve been playing with Plan 9 recently and love it. I’m leaning towards the conclusion, though, that I wouldn’t give up Common Lisp programming to switch.

                                        1. 2

                                          Why not both? All you gotta do is port SBCL …

                                          1. 1

                                            I’m not planning to be unemployed for that long ;-P

                                            Seriously, beyond the time issue, there’s also the skill and experience. I’m still planning to (somehow) get mDNS working & integrated with 9front (whether they accept the patch or not is another matter …). But a SBCL port I think would be aiming a bit too high.

                                      1. 1

                                        He’s using Python as an example of non refcounted garbage collection, but historically Python was refcounted with cycle breakers - is it now fully mark sweep?

                                        1. 2

                                          Python is still ref counting with cycle detection. It’s extraordinarily difficult to change because every function in every C extension is littered with Py_INCREF, Py_DECREF, etc.

                                          FWIW one thing I really learned in implementing a collector is how “littered all over” the codebase memory management is. It’s hard to change later, unless you have a layer of indirection like PyPy or Oil, which I mentioned here:

                                          https://www.oilshell.org/blog/2023/01/garbage-collector.html#a-delicate-octopus-with-thousands-of-arms

                                          I think non-moving tracing (e.g. mark and sweep) is the least “littered all over” of all strategies


                                          Aside: Looks like they have a new dev guide which is kinda cool, haven’t seen this:

                                          https://devguide.python.org/internals/garbage-collector/

                                          1. 2

                                            Yeah I would have been shocked if they managed to change it, though there are ways (for example you could have a non-zero refcount push an object into a separate root table, though then you need manage cycle breaking again :) ).

                                        1. 9

                                          Eeeeeeeh. The point is pretty decent but to me the argument is mediocre.

                                          There are various ways of implementing parsers, but for me there are generally only two which we really need to consider: LR parsing and recursive descent parsing.

                                          I think what they are really trying to say is that there is “automatically generated parsers” and “hand written parsers”, and they want to argue in favor of automatically generated parsers. There’s nothing, as far as I know, stopping anyone from writing a parser generator that outputs a recursive descent parser.

                                          Although few people think of it as such, LR is effectively a type checker for grammar ambiguity: if you don’t get any errors, your grammar is guaranteed to be unambiguous.

                                          …however, there are unambiguous grammars that are not LR. Try feeding one of those into yacc and see how much it likes it.

                                          …learners cannot, in any sensible time, understand from the recursive descent parser the language’s syntax.

                                          Few people have a perfect understanding of any syntax more complicated than Scheme. It’s not as though understanding the nuances of an LR syntax is easy; if it were, writing input for a LR parser generator would be less of a gigantic pain in the ass. You talk about the bugs you’ve discovered in recursive descent parsers; reasonable, now may I talk about the bugs I’ve created trying to use LR parser generators?

                                          Although there aren’t, to the best of my knowledge, modern performance numbers, it is reasonable to assume that recursive descent parsing is generally faster than LR parsing.

                                          Bae, if you don’t have numbers to support it, do the honorable thing and say “I don’t know” instead of making “reasonable assumptions”. I would reasonably assume the opposite, that table-based parsers are generally faster than recursive descent parsers, since a small table and a small amount of driver code should all be able to fit in the CPU’s cache.

                                          1. 3

                                            @ltratt: “There are various ways of implementing parsers, but for me there are generally only two which we really need to consider: LR parsing and recursive descent parsing. “

                                            Parent: I think what they are really trying to say is that there is “automatically generated parsers” and “hand written parsers”

                                            I think what @ltratt is trying to say here is that for him there are only two sane choices. He is not trying to differentiate between handwritten and LR parsers.

                                            He also clarifies later on:

                                            @ltratt: In practice, a recursive descent parser can be thought of as meaning “a hand-written written parser”: unless you go out of your way to do otherwise, nearly all hand-written parsers are recursive descent parsers.

                                            Of course, ANTLR, GLL, etc exist, but I think that is external from the point he is making, which is having a sort of (sound, but incomplete) ambiguity checker.

                                            and they want to argue in favor of automatically generated parsers.

                                            No, @ltratt wants to argue in favor of LR parsing, specifically because @ltratt believes that they are able to identify and reject some of the bad grammars (here bad grammars are those that contain ambiguity). — The bad grammar here is very subjective;

                                            Fundamentally, recursive descent parsers have no equivalent “type checker”, and we know that we can’t build such an equivalent for the general case.

                                            Personally, I prefer LL(1) grammars, and that also provide us with a type checker like what he claims for LR.

                                            @ltratt: There is an additional, easily over-looked, factor that comes from relying on recursive descent parsers: learners cannot, in any sensible time, understand from the recursive descent parser the language’s syntax.

                                            Here though, I think he is talking about handwritten traditional style that implements PEG.

                                            1. 2

                                              I would reasonably assume the opposite, that table-based parsers are generally faster than recursive descent parsers, since a small table and a small amount of driver code should all be able to fit in the CPU’s cache.

                                              I think LALRPOP optimizes LR by emitting a recursive ascent parser (essentially compile the table to a bunch of functions). It feels (:)) like that should be faster: you essentially say that interpreter is faster than compiler :)

                                              1. 1

                                                Looks table driven to me. Digging a bit deeper, it appears it can do both.

                                                I’m saying that function calls have overhead and that walking through a loop and switch statement can be faster than calling lots of functions. ie, a threaded interpreter vs tree-walking. I won’t speculate more without data though.

                                                1. 2

                                                  Performance is discussed a bit (/s) in this post: https://smallcultfollowing.com/babysteps/blog/2015/09/14/lalrpop/

                                              2. 2

                                                I agree, the OP is conflating “recursive descent” with “hand-written.”

                                                There’s nothing, as far as I know, stopping anyone from writing a parser generator that outputs a recursive descent parser.

                                                Like ANTLR, one of the most popular parser generators, which produces RD parsers.

                                                IIRC, PEG parser generators also output recursive descent parsers. I may be misremembering this, but I’ve looked at the gnarly C code emitted by one and it consisted of a whole lot of little functions, not the giant tables used by LALR parsers.

                                                1. 4

                                                  I think you’re looking for “top-down”

                                                  • LL parsers and recursive descent parsers are top-down; LR parsers are bottom-up.
                                                  • ANTLR produces top-down parsers, not recursive descent parsers. (it has had a variety of top-down algorithms over the years, e.g. LL(*) )
                                                  • PEGs are grammars that can be recognized with 2 different top-down parsing techniques: memoized packrat parsing, or backtracking.

                                                  I would also say all recursive descent parsers are hand-written; I’ve never seen any other usage. Recursive descent is basically a technique taking a grammar and creating a bunch of recursive procedures in a language like C or Pascal.

                                                  There is also the case where you just write a bunch of ad hoc recursive procedures without knowing precisely what the language you’re recognizing is (i.e. viewed from the generative, grammar-based approach)

                                                  1. 2
                                                    1. 1

                                                      PEG grammars can be converted directly into a recursive descent parser, but with the worst-case exponential time complexity, so they are usually implemented as memoized packrat parsers, which have a linear time complexity, but use considerably more memory.

                                                      1. 1

                                                        Well, it looks like Terence Parr, the creator of ANTLR, disagrees with you, in a paper he co-wrote in 2011 (p.2.):

                                                        ANTLR generates top-down, recursive-descent, mostly non-speculating parsers…

                                                        1. 2

                                                          Ah OK sorry, I thought only the earliest versions of ANTLR generated RD parsers, and then the more advanced top-down algorithms they developed moved away from that approach. But I guess I have not looked at their generated code in a very very long time.

                                                          If they generate a bunch of procedures where each one goes with a rule in the grammar, then I’d agree that’s recursive descent (wish I could correct my original comment).

                                                          But there is a fuzzy line, because it probably has to call procedures for lookahead, some of which may involve tables … i.e. I’m pretty sure the LL(*) has to do something pretty complex for lookahead, but I don’t have ANTLR handy now to look at its output.


                                                          I think PEGs are basically a formalization of backtracking recursive descent parsers, but they have the optional faster algorithm of packrat parsing (which is not recursive descent, but is top-down). They precisely define what lookahead constructs are allowed.

                                                          FWIW the parser I use for Oil is based on Python’s top-down parser generator, which is an LL, table driven parser (also not RD).


                                                          edit: I looked it up and the newer ALL(*) used by ANTLR 4 also says it generates recursive descent with a custom lookahead function:

                                                          https://www.antlr.org/papers/allstar-techreport.pdf

                                                          (I last used ANTLR around the time ANTLR 4 was coming out)

                                                        2. 1

                                                          What would you consider recursive descent? Would something like peg_parse() here qualify as recursive descent? or does the nonterminals have to be implemented as procedures?

                                                          1. 1

                                                            I just looked it up, and my conception is EXACTLY what Wikipedia says as its first sentence.

                                                            https://en.wikipedia.org/wiki/Recursive_descent_parser

                                                            In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar.

                                                            The key point is that you take a grammar and write a program based on it. (I would disagree with “non-recursive equivalent” but meh)

                                                            It’s one form of top-down parsing, which is hand-written

                                                            I would call peg_parse() a top-down parser but not a recursive descent parser (assuming it’s doing what I think it is)

                                                            1. 1

                                                              Ah I see, so recursive descent is sort of grammar compiled into a custom parser. Would you call GLL such a parser? it substitutes label based jumping for subroutine call and return, but I think that is more of an optimization than anything else.

                                                    1. 11

                                                      Couple of thoughts:

                                                      • Pedagogically, I think it’s better to talk about recursive descent+Pratt parsing, or just Pratt parsers. Pure recursive descent is indeed quite inadequate, and many CS students (raises hand) indeed don’t realize how to parse arithmetic expressions manually.
                                                      • Checking grammar for absence of ambiguities is an important task! I am not quite sold on CFGs being the right tool here. With regular languages, it’s pretty clear that that’s the formalism to use, because the underlying computational device (fsm) is a very well-contained thing. With CFGs it’s much more murky. You don’t want to use the whole class, as it includes ambiguous grammars. “An unambiguous CFG grammar” is not something we can easily recognize. LL(k), LR(k), LL(*) always looked like arbitrary restrictions to me.
                                                      • Real grammars never looked particularly declarative to me. Like, the hard part is parsing expressions, and that requires either contorting the grammar to be unreadable mess, or precedence annotations. Another case which is handled badly by grammars is flags. In Rust, there are grammatical categories of “expressions which are not a struct literal” and “block-like expressions”, expressing that in grammar requires forking the whole expression grammar twice.
                                                      • There’s also a tension that grammars recognize strings, but what you want out of the parser is AST. The more you tweak the grammar to make it amendable to automated parsing, the worse the resulting AST structure becomes.
                                                      • To make it easier to understand what a hand written parser parses (or, really, any small function which parses text), I strongly recommend including the example of the parsed thing in the comment, or, better, actually writing tests next to each function and each branch in the function.

                                                      Ultimately, I think we have an array of tools which are good for some jobs, but we don’t have a universal thing.

                                                      • If you want a human-readable grammar which nails down the AST, you want ungrammar.
                                                      • if you want a parser, write Pratt parser by hand, fuzz it agains grammar to suss out positive and negative bugs. (If you want parsers for many languages, throw tree sitter at it).
                                                      • if you want to make sure you language has an unambiguous grammar, then, ideally, you’d write you grammar in a well-defined formalism which is at least ungrammar’s style recursive regular expressions, but also includes some way to deal with precedence and flags. Then there’s be a separate tool which would check this grammar for ambiguities, and spit out a suite of tests for the language.
                                                      • we don’t have such a magical grammar checker yet, so you have to get by with yacc style parser generators.
                                                      1. 4

                                                        Pedagogically, I think it’s better to talk about recursive descent+Pratt parsing, or just Pratt parsers. Pure recursive descent is indeed quite inadequate, and many CS students (raises hand) indeed don’t realize how to parse arithmetic expressions manually.

                                                        I just do not get this. Pratt parsers, shunting, etc. are way more complex thank simple, hand-done recursive descent parsers. I’m lazy, hence I use recursive descent.

                                                        The discussions here about + - * / seem overblown to me. Take a language def something like what was found in one of the other posts:

                                                        <E> ::= <E> <op> <E>
                                                             | ( <E> )
                                                             | <D>
                                                        <op> ::= + | - | * | /
                                                        <D> = [0-9]+
                                                        

                                                        Convert it to an LALR(1) form:

                                                        Expression
                                                             AdditiveExpression
                                                        
                                                        AdditiveExpression
                                                            MultiplicativeExpression
                                                            AdditiveExpression "+" MultiplicativeExpression
                                                            AdditiveExpression "-" MultiplicativeExpression
                                                        
                                                        MultiplicativeExpression
                                                            PrimaryExpression
                                                            MultiplicativeExpression "*" PrimaryExpression
                                                            MultiplicativeExpression "/" PrimaryExpression
                                                        
                                                        PrimaryExpression
                                                            "(" Expression ")"
                                                            Digit*
                                                        

                                                        And you’re already pretty much done. Now you just need some code, something like:

                                                        Expression parseExpression()
                                                            {
                                                            return parseAdditive();
                                                            }
                                                            
                                                        Expression parseAdditive()
                                                            {
                                                            Expression expr = parseMultiplicative();
                                                            while (true)
                                                                {
                                                                switch (peek())
                                                                    {
                                                                    case Add:
                                                                    case Sub:
                                                                        expr = new RelationalExpression(expr, current(), parseMultiplicative());
                                                                        break;
                                                        
                                                                    default:
                                                                        return expr;
                                                                    }
                                                                }
                                                            return expr;
                                                            }
                                                        
                                                        Expression parseMultiplicative()
                                                            {
                                                            Expression expr = parsePrimary();
                                                            while (true)
                                                                {
                                                                switch (peek())
                                                                    {
                                                                    case Mul:
                                                                    case Div:
                                                                        expr = new RelationalExpression(expr, current(), parsePrimary());
                                                                        break;
                                                        
                                                                    default:
                                                                        return expr;
                                                                    }
                                                                }
                                                            }
                                                        
                                                        Expression parsePrimary()
                                                            {
                                                            switch (peek())
                                                                {
                                                                case LParen:
                                                                    expect(LParen);
                                                                    Expr expr = parseExpression();
                                                                    expect(RParen);
                                                                    return expr;
                                                                    
                                                                case Digit:
                                                                default:
                                                                    // this will correctly report an error if there are no digits
                                                                    return eatDigitsAsLiteral();
                                                                }
                                                            }
                                                        

                                                        Sorry for so much code, but after you do this a few times, the scary complexity disappears, and it’s just cut & paste. But I admit, the first time was really hard to understand how to transform weird EBNF (etc.) into something simpler to comprehend. It’s the one time that the dragon book came in handy.

                                                        1. 2

                                                          What I usually do is

                                                          Expression parseAdd(Expression left, int myLevel) {
                                                              while (true) {
                                                                  if (accept(Add))
                                                                      left = new AddExpr(left, parseArithmetic(myLevel + 1));
                                                                  else if (accept(Sub))
                                                                      left = new SubExpr(left, parseArithmetic(myLevel + 1));
                                                                  else return left;
                                                              }
                                                          }
                                                          ...
                                                          Expression parseArithmetic(int level = 0) {
                                                              Expression left = parseExpression();
                                                              if (level <= 1) left = parseMul(left, 1);
                                                              if (level <= 0) left = parseAdd(left, 0);
                                                              return left;
                                                          }
                                                          

                                                          That way I can easily reorder my precedence and insert new operators. And it’s easy to intuitively understand what the code does: each parse term picks up “its level” of operators, while using parseArithmetic to eat terms with higher precedence.

                                                          Not sure if that’s an established thing.

                                                          1. 4

                                                            Not sure if that’s an established thing.

                                                            It is! It actually is a Pratt parser, with its characteristic shape:

                                                            parseExpr() {
                                                              while (true) {
                                                                ... parseExpr()
                                                              }
                                                            }
                                                            

                                                            If you expand ..., you’d have parseMul which wold looks exactly like parseAdd, and, if you DRY them out, you get exactly the classical Pratt parser. But that’s just a detail, the important thing is the while loop.

                                                          2. 2

                                                            I would call this a Pratt parser though! The core thing is having a combination of a while loop and recursion. To get canonical Pratt, you need to dry out multiplicative and additive branches:

                                                            Expression parseExpression(int level = 0)
                                                                {
                                                                Expression expr = parsePrimary();
                                                                while (true)
                                                                    {
                                                                    switch (peek())
                                                                        {
                                                                        case Add:
                                                                        case Sub:
                                                                            if (level > 0) break
                                                                            expr = new RelationalExpression(expr, current(), parseExpression(1));
                                                                            break;
                                                                        
                                                                        case Mul:
                                                                        case Div:
                                                                            if (level > 1) break
                                                                            expr = new RelationalExpression(expr, current(), parseExpression(2));
                                                                            break;
                                                            
                                                                        default:
                                                                            return expr;
                                                                        }
                                                                    }
                                                                }
                                                            

                                                            And, having done that, you can skip the first step of eliminating the left recursion from the grammar.

                                                            1. 4

                                                              FWIW it’s possible to parse expressions in a recursive descent style without pratt parsing.

                                                              For example, bash does that simply by encoding the precedence levels in the grammar, and then preserving that structure in recursive procedures.

                                                              https://opensource.apple.com/source/bash/bash-94.1.2/bash-3.2/expr.c.auto.html

                                                              I noted that here and probably in a few other places:

                                                              https://www.oilshell.org/blog/2017/03/30.html

                                                              It’s very wordy for C-style arithmetic, but definitely works.

                                                              That seems to be what @cpurdy is doing.

                                                              The while loop handles associativity, not precedence.

                                                              I’d say the difference between the two algorithms is whether the precedence is data or code. When it’s data, you have less recursion.

                                                              1. 2

                                                                The while loop handles associativity, not precedence.

                                                                Correct.

                                                                Here’s an example of a full Java parser from ’97 or so: https://github.com/oracle/coherence/blob/30b2188ddee49718a9496d636b81ce8239b5d62a/prj/coherence-core/src/main/java/com/tangosol/dev/compiler/java/Compiler.java#L1355

                                                                1. 1

                                                                  Not sure, for me passing precedence as a (compile time) parameter feels equivalent to manually “monomorphising” each precedence level to a differently named function.

                                                                  1. 1

                                                                    Right. You’re recursing on the same method, but with a parameter indicating that it has an ever-so-slightly different purpose.

                                                              2. 1

                                                                Now add another 4 precedence levels, prefix, and postfix operators (including function calls) to your LALR(1) form. That’s the bit that always made me sad.

                                                                1. 3

                                                                  Each precedence level is a “descent”, i.e. just another function. I don’t personally find it to be a bad thing 🤔

                                                                  OTOH, I can imagine that other people have other tastes in code layout, and for them, having more small functions might seem ugly. And obviously, when you’re stepping through in a debugger, the 10 levels of precedence you have to step through (i.e. 10 function calls) gets old in a hurry 🤣

                                                              3. 2

                                                                Pure recursive descent is indeed quite inadequate, and many CS students (raises hand) indeed don’t realize how to parse arithmetic expressions manually.

                                                                Is this a hard problem? if you have a grammar say

                                                                <E> ::= <E> <op> <E>
                                                                     | ( <E> )
                                                                     | <D>
                                                                <op> ::= + | - | * | /
                                                                <D> = [0-9]+
                                                                

                                                                which produces various parses for 1+2*3 =*> [1+2]*2; 1+[2*3] etc, isn’t it sufficient to extract the first parse, and apply shunting yard in the tree that correctly manages precedence? e.g. in any tree that has a child [], flatten the node post-parse, and apply the shunting yard on the children of the node.

                                                                1. 3

                                                                  This sounds like a shunting yard with extra steps? But you generally don’t want shunting yard, at least not for “student writes their first compiler”, as it’s imperative formulation with extra stack is not super intuitive. You want a Pratt parser here (which is recursive version of shunting yard).

                                                                  This also matches my knowledge at the end of the university: to parse expressions, you either eliminate left recursion, do shunting yard, or use yacc’s precedence annotations. I took 5 or 6 grammar/compile courses, and none of them taught Pratt, which actually is the first thing one should reach out for when parsing expressions.

                                                                  1. 2

                                                                    You want a Pratt parser here (which is recursive version of shunting yard).

                                                                    What I wanted to illustrate was that you can decompose the problem of parsing languages containing arithmetic expressions into two steps — the first, where you parse the syntactic structure, and second, where you move from a parse tree to an AST.

                                                                    There are numerous algorithms that let you do parsing. IMO, it is better for the students to learn how to implement precedence in infix separately without tying it to a specific parsing.

                                                                    1. 1

                                                                      What I wanted to illustrate was that you can decompose the problem of parsing languages containing arithmetic expressions into two steps — the first, where you parse the syntactic structure,

                                                                      I haven’t seen this approach in practice. Is there some example which shows how it generalizes to complex expression grammars (with unary, binary and ternary operators, and - which is both prefix and binary)?

                                                                      1. 2

                                                                        Not so far; I am a new teacher, developing my course materials now. I will certainly try both, and I should be able to answer which works best for students in a couple of years.

                                                                        1. 1

                                                                          By the way, do you have specific examples of this kind? which are particularly difficult to parse otherwise?

                                                                          1. 3

                                                                            I don’t know of the top of my head what would be the most difficult for the suggested two phase parsing, but I think it’s useful to parse basically everything:

                                                                            • + - * / as the basic case
                                                                            • ., function composition, as an example of right associative operator
                                                                            • prefix + - to cover both prefix and ambiguous ops
                                                                            • ! factorial as a basic postfix op
                                                                            • [] postfix array access with an argument.
                                                                            • ?: ternary
                                                                            1. 2

                                                                              Thank you! much appreciated.

                                                                  2. 2

                                                                    “An unambiguous CFG grammar” is not something we can easily recognize.

                                                                    Offhand I believe I’ve read that telling whether an arbitrary CFG is ambiguous or not is equivalent to the halting problem.

                                                                    1. 1

                                                                      Any chance of a reference? I am curious how this proof was formulated.

                                                                      1. 3

                                                                        I think it’s a corollary of the Post Correspondence Problem. Roughly iff a pair of sets α, β have a solution to PCP, then the grammar

                                                                        S → A | B
                                                                        A → (x ∈ α)A | ε
                                                                        B → (y ∈ β)B | ε
                                                                        

                                                                        is ambiguous.

                                                                        1. 1

                                                                          Thank you!, I will look it up.

                                                                          1. 1

                                                                            Well-stated! I think that you’re right. I was going to mention some folklore, but I see that Wikipedia now has a list of undecidable facts about context-free grammars, complete with bluelinks, and I’ll just link to that section.

                                                                      2. 2

                                                                        There’s a point in the design space I don’t think has been well explored: declarative Pratt parsing with some extensions like unary/binary and multifix. Simple implementation, linear time parsing, really good error messages. Probably not quite powerful enough to parse big existing languages but I haven’t tested enough to find out. The parse tree it produces is a bit further away from ASTs than what an CFG parser would produce. I implemented a parser generator in this space:

                                                                        https://github.com/justinpombrio/panfix

                                                                        Example grammar: https://github.com/justinpombrio/panfix/blob/master/tests/minilang.rs

                                                                        1. 1

                                                                          It is a very lax parser, and will happily generate a parse tree in the presense of a variety of errors

                                                                          Niiiiiice! https://github.com/JetBrains/Grammar-Kit is the most production ready tool in this space that I know

                                                                      1. 8

                                                                        I agree with 90%-95% of the article. I would phrase it more pithily as:

                                                                        1. Use recursive descent for recognizing existing languages, where you have a good test corpus. (And very good point about testing INVALID inputs too!)

                                                                        2. Use grammars to design new languages

                                                                        I follow this with https://www.oilshell.org/ – the compatible OSH is a huge recursive descent parser, and Oil is designed with a grammar.

                                                                        Guy Steele and Russ Cox also mentioned this methodology – design the language with a grammar, and then implement it a second time by hand. Then you REALLY know it! :)

                                                                        It is definitely possible to not know the language you’re implementing / creating!


                                                                        As an undergrad, I never tried to design a language, or even thought about it honestly. But I agree that undergrads should know of both approaches, without necessarily getting into the details of CFGs.

                                                                        It would be a mistake to just teach recursive descent.

                                                                        1. 3

                                                                          If I recall correctly, Haml didn’t actually build an AST - it just translated the syntax back to strings. And if you had a sub-template, you could call it (regardless of whether it was Haml or Erb) and get the results into the Haml via a string. So just having something that looks like it’s an AST doesn’t mean it actually has the security benefits of one.

                                                                          Note that nowadays most “string” solutions automatically escape their template-injected strings unless you explicitly ask it not to, so the problem isn’t as big as it used to be (although there’s still the problem of contextual placement, as the article points out). Nevertheless, I totally agree that a structural approach where you actually represent the tree you’re trying to write out as a string is the proper solution.

                                                                          1. 2

                                                                            Yeah I agree that the issue is more complex. On closer inspection, the article doesn’t actually justify what the title claims.

                                                                            i.e. It’s not clear to me that AST approaches are actually more secure than string-based ones in practice.

                                                                            https://lobste.rs/s/j4ajfo/producing_html_using_string_templates#c_smlpfm

                                                                            The comparison seems to be between real deployed string approaches, vs. mostly hypothetical AST approaches.

                                                                            AFAICT, the real AST-based approaches ALSO have the contextual escaping problem. Interested in counterexamples.

                                                                            1. 3

                                                                              AFAICT, the real AST-based approaches ALSO have the contextual escaping problem. Interested in counterexamples.

                                                                              If you treat URLs as proper objects, this is not a problem.

                                                                              For example, let’s say we want to construct an URI based on a base URI with two params (for the sake of example they’re hardcoded here):

                                                                              (let ((link-url (update-uri base-url query: '((param-a "this is" param-b "some&special stuff"))))
                                                                                     (link-text "some&other special < stuff"))
                                                                                `(a (@ (href ,(uri->string link-url))) ,link-text))
                                                                              

                                                                              This then encodes to:

                                                                              <a href="http://www.example.com/some/page?param-a=this%20is&amp;param-b=some%26special%20stuff">some&amp;other special &lt; stuff</a>
                                                                              

                                                                              The uri library is responsible for correctly url-encoding query parameters and path components, while the SXML library is responsible merely for correctly HTML/XML-encoding the attribute and text nodes. I think this is as it should be, the XML stuff shouldn’t even be aware of URI-encoding.

                                                                              NOTE: JavaScript in HTML is a different kettle of fish, unfortunately it doesn’t compose very well - AFAIK, JavaScript cannot be XML-encoded, so for example the following would be the logical thing to do when encoding JS inside XML and hopefully also HTML:

                                                                              <script type="text/javascript">
                                                                              if (10 &gt; 2) {
                                                                                window.alert("News flash: 10 is bigger than 2");
                                                                              }
                                                                              </script>
                                                                              

                                                                              Unfortunately, this is a syntax error. <script> blocks behave more like cdata sections than regular tags. This is no doubt for ergonomic reasons, but it also means that any script handling needs to be special cased. In SXML you could typically do that with a preorder rule that shortcuts the regular encoding. But it’s ugly no matter how you slice it.

                                                                              1. 2

                                                                                Right, good examples. I’m not saying the problem can’t be solved in principle – just saying that I don’t see a lot of practical and robust solutions for it. That is, what are the solutions in HAML and XHTML approaches for embedded languages and auto escaping (JS, URIs, URIs in URIs), and are they deployed in real apps?

                                                                                It does seems like the Lisp solutions have the strongest claim, but I think they are mainly used for “simple” sites like Hacker News (which is good IMO – I like simple sites)

                                                                                But I’d say it’s a little like comparing Python and Standard ML … Standard ML is nice and clean in its domain, but not really a fair comparison, because Python has to deal with a wider range of problems, because it’s used 1000x more

                                                                                1. 2

                                                                                  That is, what are the solutions in HAML and XHTML approaches for embedded languages and auto escaping (JS, URIs, URIs in URIs), and are they deployed in real apps?

                                                                                  The idea would work just fine in any other language. For example, if you’re using an embedded DSL like haml, you could do exactly the same thing:

                                                                                  - some_url = url_for(action: 'index', params: {param_a: 'this is', param_b: 'some&special stuff'})
                                                                                  %a{href: some_url.to_s} Some text
                                                                                  

                                                                                  Unfortunately, this requires Rails. AFAICT, the standard Ruby URI library doesn’t have a clean way to add parameters from a hash to a query string (with auto-escaping, no less), so the example is not the best.

                                                                                  BTW, if you have automatic string coercion methods you wouldn’t even have to call to_s on the result. I can’t remember if Haml does that.

                                                                                2. 1

                                                                                  One example of a system for safely building URLs programmatically is Ruby on Rails’s feature of path and URL helpers. However, that only helps with building URLs internal to the app.

                                                                                  The way it works is that the most common ways of defining a route also define methods that can build a URL to that route. For example, if you create a route /comments/:id, then in your view (whether it’s ERB, Haml, or another templating system), you can call the Ruby method comment_path(123) to get /comments/123 or comment_url(123) to get http://example.com/comments/123.

                                                                                  1. 1

                                                                                    I always really liked the syntax for constructing internal URLs. It’s too bad they didn’t build something equally ergonomic for constructing external URLs (AFAIK).

                                                                            1. 4

                                                                              Nice article. Note that the recent matchertext draft paper proposes a solution to the problem that plagues both approaches:

                                                                              The only real limitation of the AST-based approach is that some strings in an AST may embed other languages. This is the same circumstance where the previously mentioned context-sensing autoescaping approaches can easily fall down.

                                                                              https://lobste.rs/s/9ttq0x/matchertext_escape_route_from_language

                                                                              Admittedly it has a slim chance of being adopted! But most commenters didn’t seem to get the idea, probably because you had to follow the blog post into a paper to see the details.

                                                                              https://bford.info/pub/lang/matchertext/

                                                                              The matchertext paper specifically addresses:

                                                                              • URL in HTML (which the blog post does too)
                                                                              • URL in URL

                                                                              and a few others.


                                                                              Good link to the Go work at the end:

                                                                              https://rawgit.com/mikesamuel/sanitized-jquery-templates/trunk/safetemplate.html#problem_definition

                                                                              I would be very interested in some further analysis of where Go’s html/template falls down.

                                                                              It defines “context” as

                                                                              A parser state in the combined HTML, CSS, and JavaScript grammar used to determine the stack of sanitization routines that need to be applied to any untrusted data interpolated at that point to preserve the security properties outlined here.

                                                                              So presumably it does not handle autoescaping of URLs in HTML? Since it doesn’t mention that language.

                                                                              I know that work is over 10 years old, and Go is very popular. So I’m surprised with all the Go rants that I haven’t seen people poke at its limitations?

                                                                              Is it just better than everything else and that’s what we should use?

                                                                              Why hasn’t it been adopted in other languages?

                                                                              If it works reliably, then it seems better than the AST approach, which only solves “one level” of the problem as mentioned.

                                                                              The other solution is to get try to rid of escaping entirely by adjusting all the text formats, like matchertext, although that is probably only deployable in the very long term. It’s not a near term solution

                                                                              1. 4

                                                                                The Go approach is being revived as a HTML/JS spec draft called Trusted Types. Really fascinating stuff (both the original Safe Types from Mike Samuel et al as well as the new TT). Chrome already supports that draft spec and it seems to work for Google web properties. However, they didn’t manage to convince other browsers yet.

                                                                                Meanwhile, I believe a HTML sanitizer might also help a lot. So I’m working on specifying that (prototype implementations exist in Chrome and Firefox).

                                                                                1. 3

                                                                                  OK interesting! I’ll have to look up Trusted Types.

                                                                                  I remember ~10 years ago when Mike Samuel was going around adding auto-escaping to all of Google’s template engines – there were at least 3 or 4 in common use, and most of them weren’t open source. And then he added it to Go’s templates, which was new at the time.

                                                                                  But then I feel like I never heard about it again? Were there just lots of silent happy users?

                                                                                  It always interested me, though I guess I don’t know enough about the specifications of HTML, CSS, and JS to really judge it (and I’m not a Go user, so I didn’t get experience with it.) I think my issue is that it seemed to be “brute force of special cases”, though I would want to look at the code to understand it more.

                                                                                  I just read over the original paper again, and this part jumped out …

                                                                                  The last of the security properties that any auto-sanitization scheme should preserve is the property of least surprise. The authors do not know how to formalize this property.

                                                                                  Developer intuition is important. A developer (or code reviewer) familiar with HTML, CSS, and JavaScript; who knows that auto-sanitization is happening should be able to look at a template and correctly infer what happens to dynamic values without having to read a complex specification document.

                                                                                  I think my issue is that whenever I have more than 1 or 2 levels of escaping, I try to break it up into dynamic composition, not textual substitution. So I manually avoid complex escaping, but not everyone does.

                                                                                  I have been meaning to write a blog post about that, with examples in shell.

                                                                                  Still it’s an interesting problem that needs to be solved …

                                                                              1. 6

                                                                                I know Rust has many advantages over C++, but I’m already worried about the compile times…

                                                                                1. 17

                                                                                  chromium has one of the longest compile times of any open source project, I’m not sure rust would make it substantially worse.

                                                                                  1. 6

                                                                                    As a Gentoo user I regularly witness hours-long chromium compile times and know the pain all too well, even when it’s running in the background. Isn’t it scary to think that we might reach new extremes in chromium compile times now?

                                                                                    1. 7

                                                                                      From what I heard is that this might not be a relevant metric to the Chromium project? AFAIU most of the contributors compile “in the cloud” and don’t care about downstream compilation :/

                                                                                      1. 1

                                                                                        I have first-hand experience with this. My compiling machine was a jaw-droppingly beefy workstation which sat at my desk. A cached compilation cycle could take over an hour if it integrated all of Chromium OS, but there were ways to only test the units that I was working on. Maybe this is a disappointing answer – Google threw money at the problem, and we relied upon Conway’s Law to isolate our individual contributions.

                                                                                    2. 4

                                                                                      Chromium’s build tool Goma support distributed remote build. So as long as you have a server farm to support Goma, the build is actually pretty fast.

                                                                                      Similarly, at Mozilla they used distcc+ccache with Rust. So the compilation has always been distributed to the data center instead of running locally.

                                                                                      1. 12

                                                                                        Either you own your computers or you do not. If I need a datacenter/cluster to make software compilation at least half-bearable, the problem is software complexity/compilation speed and not too little compuational power. And even in the cloud it takes many minutes, also for incremental builds.

                                                                                        The first step to solving a problem is admitting there is one.

                                                                                        1. 37

                                                                                          The reason Chrome exists is to allow a multinational advertising company to run their software on your computer more efficiently. Slow compile times are not a problem they experience and they are not a problem they will address for you.

                                                                                          1. 1

                                                                                            I agree that the software complexity has grown. I don’t necessarily think of it as a problem though. Chromium to me is the new OS of the web. You build apps running on top of this OS and there are complex security, performance, and multi-tenancies concerns.

                                                                                            IMO, modern software has gotten complicated but that simply follows a natural growth as technologies mature and progress over time. You could solve this “problem” by inventing better tools, and better abstractions… that help you navigate the complexity better. But I don’t think reducing the complexity is always possible.

                                                                                            1. 7

                                                                                              This is where I fundamentally disagree. There are many good examples which tell that complexity is often unnecessary. Over time, software tends to build up layers over layers, often completely for no reason other than historic growth and cruft.

                                                                                              The system will not collapse, I think, but fall into disrepair. Currently, there is enough money in companies to pay armies of developers to keep this going, but now that we are already deep in a recession and going into a depression, it might be that the workload to feed these behemoths exceeds the available manpower. This might then motivate companies to give more weight to simplicity as it directly affects the bottom end and competitivity.

                                                                                              1. 2

                                                                                                Systems tend to collapse, replacing complex mechanisms with simpler equivalents. This used to be called “systems collapse theory” but apparently is now called collapsology. For example, we are seeing an ongoing migration away from C, C++, and other memory-unsafe languages; the complexity of manual memory safety is collapsing and being replaced with automatic memory management techniques.

                                                                                                1. 1

                                                                                                  Thanks for the pointer (pun intended!) at collapsology!

                                                                                        2. 2

                                                                                          This is a bit like saying “well the browser already has so many features: versions of HTML and CSS to support, Bluetooth, etc. – could adding more make it substantially worse?”

                                                                                          Yes, it could – there’s no upper bound on compile times, just like there’s no upper bound on “reckless” features.

                                                                                          That said, I only use Chrome as a backup browser, so meh

                                                                                        3. 4

                                                                                          Rust compile times are really good now. It’s not C fast but a lot better than C++. At this point it’s a non-issue for me.

                                                                                          (YMMV, depends on dependencies, etc etc)

                                                                                          1. 3

                                                                                            Is there any evidence to support the claim that replacing C++ with Rust code would substantially slow down compile times? As someone who writes C++ code every day and also has done some Rust projects that see daily use at work, I really don’t see much of a difference in terms of compile time. It’s slow for both languages.

                                                                                            1. 6

                                                                                              In the context of gentoo, you now need to compile both clang and rustc, this probably is a substantial increase.

                                                                                              1. 3

                                                                                                It appears to be about the same, perhaps slightly worse.

                                                                                              2. 2

                                                                                                AFAIR clean build in Rust and C++ has similar performance metrics.

                                                                                                1. 2

                                                                                                  Builds would be way quicker if they just ported everything from C++ to C.

                                                                                                  Memory safety problems would be way worse, but this thread seems unconcerned with that question.

                                                                                                1. 1

                                                                                                  It will allow recursive JSON-like data structures in Oil.

                                                                                                  This is an odd justification for a GC, because if you only want JSON-like things (i.e. trees) you can represent this with reference counting. Non-atomic reference counting (which is all that you need since shells are single threaded) can handle this. Reference counting with autorelease pools[1] is a fairly nice programming model for single-threaded systems and has some very nice properties with respect to deterministic memory consumption.

                                                                                                  [1] I’ve not seen these anywhere other than OpenStep / Cocoa, but they basically give you an operation that is ‘don’t decref this now, but do it later’ so the object remains valid until the end of the autorelease scope. This is really nice for interop with C things, because you can have temporary pointers floating around and as long as they’re gone by the end of the scope of the autorelease pool (which may be a few stack frames up), they remain valid. This addresses the f(g(), h()) for cases where f wants to take arguments that are not smart pointers.

                                                                                                  1. 3

                                                                                                    That’s occurred to me, but how do you limit your structures to trees when the mutator can do arbitrary computation?

                                                                                                    You would need some kind of cycle detection algorithm, which means you need basically all the metadata that a tracing GC has

                                                                                                    The problems are roughly equivalent, and I think there’s a fundamental reason that essentially no language has trees but not cyclic graphs


                                                                                                    Also note the other justification is memory safety at the C++ metalanguage level. The C++ code does have a few cycles, with the mutually recursive evaluators. Right now, it would probably be OK if they leak, but that could change and I’d rather not have that in the back of my mind :)

                                                                                                    1. 4

                                                                                                      That’s occurred to me, but how do you limit your structures to trees when the mutator can do arbitrary computation?

                                                                                                      It depends on whether you have aliasing in the language. If you have a JSON-like data model, then you can’t express cycles because you have no way of naming an object that is contained in another object, everything is an explicit path.

                                                                                                      You would need some kind of cycle detection algorithm, which means you need basically all the metadata that a tracing GC has

                                                                                                      Not quite. You remove the need to track roots. Whenever an object’s reference count is decremented and the object is not deleted, you colour it in the header to mark it as possible garbage and add it to a queue. When an object’s reference count is incremented, you colour it to mark it as not-garbage. When the queue is full, you scan it and, for any objects that are possibly garbage, you do a graph walk and decrement reference counts of every object that you find but don’t delete them. If the object that you started with ends with a reference count of 0, then it is garbage and you free it and decref all objects reachable from it. If there are external references, the reference count will not be 0, but you don’t care where they are: they could be on the stack, in globals, in C memory, or anywhere else that you can stash roots.

                                                                                                      If your data structures are mostly acyclic then this is very efficient. I’m curious how it would be for your use case.

                                                                                                      There are also incremental variants of this scheme that rely on noting when you incref an object that is being scanned and causing the cycle detector to backtrack. There’s also a fully concurrent version developed by David F. Bacon, but I can only keep how it works in my head for about 20 minutes after I read the paper so I’m not even going to try to explain it here.

                                                                                                      The problems are roughly equivalent, and I think there’s a fundamental reason that essentially no language has trees but not cyclic graphs

                                                                                                      Erlang springs to mind and it’s far from the only one. So does Rust, unless you use unsafe (explicitly or via a crate that uses unsafe internally).

                                                                                                      1. 3

                                                                                                        Well remember in this case the language is C++, which has aliasing and cycles. The collector works on all C++ objects in the program, not just Oil objects. It’s basically the equivalent of if we had written the shell in Go and used its GC.

                                                                                                        One downside I considered of ref counting is that it has the “false write” problem, which I took pains to avoid with the separate mark bitmap. That is the headers are immutable in our design to promote page sharing with fork().

                                                                                                        The previous post has links about that: Ruby’s bitmap marking and a post about Instagram/Python/ref counting:

                                                                                                        https://www.oilshell.org/blog/2022/10/garbage-collector.html

                                                                                                        It doesn’t mean necessarily mean ref counting is worse … but yeah at this point I’m happy to be mostly past the design phase and more into optimization. As mentioned it’s still spec driven, so in theory it can be changed at any point, probably when a lot more people are using it and there are 10x demands in terms of heap size / perf.

                                                                                                        The only thing that would prevent changes is if we expose some sort of shared library native interface (which bash actually has). But we don’t have that yet

                                                                                                    2. 3

                                                                                                      Also I should repeat the point of the PyPy quote – any memory management scheme can be plugged into Oil

                                                                                                      It’s a spec-driven project, which takes longer, but it means it’s decoupled from implementation details

                                                                                                      It would actually be a great project for someone wants to compare moving vs. non-moving, ref counting, etc. on a real workload

                                                                                                      All the benchmarks and tests are already set up, and run in the CI on every commit. The codebase is tiny, with fast incremental builds.

                                                                                                      The Cheney collector still builds – the main difference is that with Cheney you have to register all copies of a root, not just one copy, so they can be moved in addition to being traced. That’s in our “back pocket”, but I don’t think we’ll need it for the “main project”

                                                                                                      Ref counting would be interesting as well

                                                                                                      https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start

                                                                                                    1. 16

                                                                                                      As a markup language, the main problem is that it is not friendly for human editing. It’s not robust against common typos and it’s verbose enough that they are easy to make. That’s less annoying if you use an XML editor but I’ve never found a good one.

                                                                                                      The main complaints about XML have nothing to do with this though, they are to do with the complexity, XML entities alone are a nightmare to parse safely. XML namespaces are nice in principle (being able to say ‘now I am switching to another XML dialect until the closing tag’ is great) but the renaming magic that they allow is terrifying.

                                                                                                      The big problem is that it tries to be a rich generic document language, a text markup language, and a structured object model all at once and when these goals invariably come into conflict it adds complexity to address the differences. This complexity, in turn, means that almost nothing can parse XML. I wrote a parser for the tiny subset that XMPP needs but a complete parser is huge and complicated. This means that interoperability between things that actually use more than a tiny subset of XML features is more or less nonexistent.

                                                                                                      1. 6

                                                                                                        It’s not robust against common typos

                                                                                                        Isn’t it? Every opening tag has a matching closing tag. If you forget a </p>, your linter can tell you right away. And (devil’s advocate) the rules are simple, unlike Common Mark.

                                                                                                        1. 3

                                                                                                          We know from prior experience that humans and xml are not compatible. There are just too many “xml” standards where humans would write stuff and it would be invalid, even things like rss that were ostensibly xml from day 1.

                                                                                                          The problem is that xml by humans means requiring a human to be as flawless as a xml parser, which is honestly making unreasonable expectations of both :)

                                                                                                        2. 5

                                                                                                          And as a structured object model it is poor because it can’t include bytestrings, which are important for a lot of use cases. (Except by base64 or similar embeddings, which inflate the data and take time to parse.)

                                                                                                          1. 3

                                                                                                            It feels like some slightly stricter subset of HTML5 would work OK as a “modern XML” ?

                                                                                                            Ironically HTML5 has the opposite problem – any typo is valid, and the browser guesses an often wrong interpretation of your document. I’m thinking about what happens when you mess up the balancing of <table> <tr> <td> – you often get something weird.

                                                                                                            CSS selectors are also a pretty nice complement to HTML that are widely used and understood, and I’m not sure what the XML equivalent is, but it’s probably not better.

                                                                                                            1. 3

                                                                                                              XML has XPath, which is harder to use than query selectors, IMO.

                                                                                                              1. 1

                                                                                                                Which also comes with the “oh, this document uses namespaces… I guess I’ll google how to handle those again”. But apart from that, at least xpath has more features like partially matching the text inside a node.

                                                                                                              2. 2

                                                                                                                Ironically HTML5 has the opposite problem – any typo is valid, and the browser guesses an often wrong interpretation of your document

                                                                                                                So this is not true. A big part of the html5 process and the related ES3.1/5 process was fixing the horrors of the 90s and 00s. I recall us having to spend a lot of time working out how html was actually parsed and the DOM trees built to ensure we could actually define things. The end result of all of this is not necessarily the nicest language specification (in an academic grammar purity sense), but there is no guessing involved. The parsing of everything is exactly specified today, the result of typos, mismatched tags, etc is all well defined.

                                                                                                                It was the same in JS the “specifications” that the 90s and 00s provided were shitty: they were either incomplete or ambiguous, or represented what some spec writer wanted rather than what happened (presumably as they were all retroactively specified by people who knew what they wanted to be the case, and thought telling people that their existing content was now wrong was somehow a good idea)

                                                                                                                1. 3

                                                                                                                  Right that’s actually what I meant. As far as I know there are no syntax errors in HTML5, and every byte sequence is valid.

                                                                                                                  Except it’s probably hard for a human to predict what document many of those byte sequences produce, e.g. things like

                                                                                                                  <table> a <tr> b <td> c </td> d <table> </td> e <tr> f </td>  g
                                                                                                                  

                                                                                                                  and so forth

                                                                                                                  I think that decision to define everything (I guess based on exhaustive testing of dominant implementations) was probably right given the state of the web …

                                                                                                                  But it does mean that it’s harder to write HTML by hand, because you don’t have any feedback about mistakes in the doc. In practice what people do is just fix the ones that “make it look weird”.

                                                                                                                  1. 1

                                                                                                                    It’s not that loose in practice. There’s the w3c validator https://validator.w3.org/nu/#textarea and for some things it does complain a lot. For example for your fragment:

                                                                                                                    Error: Misplaced non-space characters inside a table.

                                                                                                                    Fatal Error: Cannot recover after last error. Any further errors will be ignored.

                                                                                                                    1. 1

                                                                                                                      In practice, I believe validator use is low to nonexistent

                                                                                                                      I tried to use them a few years ago for my website, and they were in very poor shape

                                                                                                                      Not to mention that W3C itself is unfortunately a lagging indicator these days, of WHATWG and HTML5

                                                                                                            1. 8

                                                                                                              I also really like XML! It has two more things that make it really good for text markup:

                                                                                                              1. It preserves whitespace, so you can embed code and other whitespace-sensitive stuff in it
                                                                                                              2. You can nest tags inside content, <a>lik<b>e thi</b>s</a>.

                                                                                                              I used both features to cut a week off the delivery time of learntla, in a way that I couldn’t have done with JSON, YAML, or even rST.

                                                                                                              1. 1

                                                                                                                Could you have done this with HTML?

                                                                                                                I think HTML has all the nice properties of XML, without a lot of weird stuff like XML namespaces

                                                                                                                The downside of HTML is that any typo still makes a valid doc. The browser often guesses what you mean in a weird way (i.e. non-balanced tags)

                                                                                                                It feels like HTML is too liberal, and XML is too strict, although I only really use HTML.

                                                                                                                What libraries did you use?

                                                                                                                1. 4

                                                                                                                  The downside of HTML is that any typo still makes a valid doc. The browser often guesses what you mean in a weird way (i.e. non-balanced tags)

                                                                                                                  That’s a pretty big downside. Also, HTML is now defined by a “living” standard put forth by the browser cartel, AKA WHATWG. In other words, HTML is what the big browser makers say it is. That’s a poor foundation on which to build.

                                                                                                                  All of that querkyness in HTML, like the no closing tags and so forth, is a relic of bygone days when most HTML was written by hand. It made the format friendlier to human authors. I really don’t want to ever write XML or HTML by hand in the general case. Doing so is the equivalent of manually speaking SMTP to a mail server. Sometimes I do manually speak SMTP to mail servers for debugging and such, but for the most part, that’s my MUA’s job, not mine.

                                                                                                                  1. 3

                                                                                                                    That’s a pretty big downside.

                                                                                                                    It does however have the benefit of matching what actually happens on the web. You can argue it’s bad, but the alternative is XML, and xhtml, which have repeatedly failed due to the strictness requirements vs human editing.

                                                                                                                    Also, HTML is now defined by a “living” standard put forth by the browser cartel,

                                                                                                                    Well that’s nonsense. Any one can make proposals and contribute to the various html specs, which pre html5/whatwg they could not.

                                                                                                                    In other words, HTML is what the big browser makers say it is.

                                                                                                                    I mean yes: if the browsers don’t implement something, then the fact that it’s in a “spec” is irrelevant. If the browser do implement something, then you want it to be specified so you don’t recreate the pre-whatwg version of html, where the spec does not match reality and cannot be used to write a browser.

                                                                                                                    That’s a poor foundation on which to build.

                                                                                                                    I’m sorry, this is complete BS. Before the WHATWG and the “browser cabal”, what we had was a bunch of specifications that were incomplete, ambiguous, and often times just outright incorrect. The entire reason WHATWG came to exist is because groups like W3C, ECMA, etc would define what they wanted things to be, and not what they actually were. You want them to be “living” because what exactly do you think the alternative is? Again, we’ve had static “versioned” specs were bad.

                                                                                                                    I get that many people have started their careers in the last decade or so, and I think that means that they are unaware of what the pre-whatwg world was like, and the costs that came with it. Similarly there are strange beliefs about how these standards bodies actually or operate, which can only come about from actually ever interacting with them (and certainly without ever trying to participate in the pre-whatwg bodies, because you generally needed to be an invited expert or a paying corporate member which seems less open than the apparently evil cabal)

                                                                                                                  2. 3

                                                                                                                    HTML isn’t extensible. It’s an object model customized to displaying content in a browser. You could technically replace all markup usages of XML with HTML but it gets ugly fast

                                                                                                                    • docx, pdf — uses XML to create a different object model that’s meant for being printed
                                                                                                                    • translating a work into lots of languages
                                                                                                                    • annotating music or lyrics

                                                                                                                    Markup isn’t necessarily for display, it’s adding data to text

                                                                                                                    1. 2

                                                                                                                      What’s not extensible about HTML?

                                                                                                                      That was the idea behind microformats

                                                                                                                      https://developer.mozilla.org/en-US/docs/Web/HTML/microformats

                                                                                                                      I guess I’m mainly talking about the syntax of HTML vs. the syntax of XML.

                                                                                                                      I know XML has DOM and SAX APIs. Lets leave out DOM altogether

                                                                                                                      If you just consider a SAX API (even driven) then I don’t see a huge difference between XML and HTML, besides syntax (and all the optional elaborations that are only used by certain apps ?)


                                                                                                                      I’m asking what libraries people use because maybe XML has a bunch of great libraries that make things easier than using HTML, but so far I am not aware of the benefit

                                                                                                                      1. 3

                                                                                                                        Isn’t HTML specifically about using a specific set of tags? Like if you are writing HTML with a bunch of custom tag names you’re actually writing XML and not HTML?

                                                                                                                        1. 1

                                                                                                                          I think that was the theory with XHTML, but it never happened in practice. XHTML meant that HTML was just XML with specific tags.

                                                                                                                          But we don’t use XHTML anymore – we use HTML5, which is more compatible with HTML 4, HTML 3, etc.

                                                                                                                          All browsers have always ignored HTML tags they don’t understand, because otherwise HTML could never be upgraded … e.g. I just tested this and it just shows “foo”. Maybe it makes it behave like <span> or something.

                                                                                                                          echo '<mytag>foo</mytag>' > _tmp/i.html
                                                                                                                          
                                                                                                                          1. 1

                                                                                                                            Fun fact, if you put a hyphen in a tag name in html, it’s specifically not a spec tag, so you can create your own formatting around custom tags instead of classes

                                                                                                                            1. 1

                                                                                                                              That’s not quite correct. All tags are explicitly valid, what changes is whether a tag has

                                                                                                                              • Additional non-display semantics (e.g. , , , ….)

                                                                                                                              • Builtin non-default styling - e.g. , (lol!), etc tags that predated that ability to style/layout that isn’t explicitly built in to the browser

                                                                                                                              But fundamentally the tag name of an element is not important - the DOM APIs are generic string base document.getElementsByTagName, and CSS’s sigil free (so arguably “default”) identifier matches the tag of an element.

                                                                                                                  1. 18

                                                                                                                    There’s a reason we stopped using CGI the way we did though. You wouldn’t want to use escript for them, unless you’re happy to wait for the BEAM startup on every single request. Then the memory is used per-request, so a few slow requests can kill your server. And you lose in-process cache and any connection pooling too. I find this a bit annoying every time someone says that Lambda is just CGI these days…

                                                                                                                    CGI has some basic uses, but these days, I can’t think of any reason I would use it instead of FastCGI or something similar.

                                                                                                                    1. 3

                                                                                                                      CGI has some basic uses, but these days, I can’t think of any reason I would use it instead of FastCGI or something similar.

                                                                                                                      One of the nicest things is if you write a program in such a way it works with basic cgi, it will also work in almost any other deploy system. You can transparently switch it to fastcgi or scgi or an embedded http server since the handler, by definition, will not rely on any of that cross-request in-process stuff (though you can still provide it if you want as a bonus, of course). There’s a lot of benefit in this - you can also do horizontal scaling with the model quite easily.

                                                                                                                      a few slow requests can kill your server.

                                                                                                                      this tends to be the case on a lot of setups, cgi is fairly resilient to it since you can always spawn more processes or kill individual requests from the outside.

                                                                                                                      1. 2

                                                                                                                        Exactly. The reason they listed CGI for Perl is because presumably Perl didn’t have anything better. CGI is very much “the simplest thing that could possibly work”, but it has terrible performance since it launches a new process to handle every request. It’s ok for a toy project, but not much more.

                                                                                                                        1. 6

                                                                                                                          but it has terrible performance since it launches a new process to handle every request

                                                                                                                          We should be careful to distinguish between Unix processes and interpreters / VMs

                                                                                                                          A process in C / C++ / Rust should start in 1 ms or so, which is plenty fast for many sites.

                                                                                                                          The real problem is that Perl’s “successors” like Python and Ruby start an order of magnitude slower – 30 ms is probably a minimum, and it’s easy to get up to 300 ms when you import libraries

                                                                                                                          node.js and Java have additional JIT startup time, and basically the JIT will not get warmed up at all.

                                                                                                                          A program that proves the point is cgit, which is CGIs written in C, and you can see it’s plenty fast.

                                                                                                                          https://git.zx2c4.com/cgit/

                                                                                                                          (Not making any comment on the wisdom of CGIs in C here :) But they made it work )

                                                                                                                          1. 4

                                                                                                                            My blog is a CGI program in C and it can return a page in 0.01 seconds. I’ve never really had an issue with performance in the 23 years I’ve been using the program (and that’s even when I first started out using a 66MHz 32-bit x86 Linux server in 1999). I think establishing the TLS connection is more system intensive than my CGI program.

                                                                                                                            1. 2

                                                                                                                              It pains me to think of launching a process as “fast”, but yeah, you’re right about that. A ms isn’t unreasonable latency.

                                                                                                                              Of course even a C/Rust CGI will slow down if if it has to do something like open a connection to a database server or parse a big config file.

                                                                                                                              FastCGI isn’t much harder to implement and saves you the startup overhead. (Although TBH I can’t recall whether it was FastCGI I used or a different similar protocol.)

                                                                                                                              1. 2

                                                                                                                                I think basically what happened is that everything else got so slow – Python/Ruby startup time, TLS, web pages do 100 network round trips now, etc. – that Unix processes are fast now :)

                                                                                                                                It’s basically like bash saying it’s “too big and too slow” in the man page written in the 90’s, but now it’s fast and small, compared to modern programs

                                                                                                                                Database connections are an issue, especially on shared hosting, but there seems to be this newfound love for sqlite which would almost eliminate that :)


                                                                                                                                I currently use FastCGI on Dreamhost, but my impression was that it was kind of a weird protocol without many implementations. It seems to be almost a PHP implementation technique – for some reason, it’s not widely used for Python / node.js / etc.

                                                                                                                                I actually had to download an ancient version of a FastCGI Python library to get it work. It’s clear to me that very few people use it on shared hosting

                                                                                                                                Which is shame because as mentioned in this thread, AWS Lambda is basically FastCGI – it has cold starts, and then it’s fast.

                                                                                                                                1. 4

                                                                                                                                  FastCGI was a big deal ~2000, less so now. There used to be a lot of stuff using it. It still has currency in PHP-land because php-fpm is the only reasonable way to run PHP besides Apache mod_php, and people have various reasons for not wanting to run Apache, or not wanting to have mod_php in their Apache.

                                                                                                                                2. 1

                                                                                                                                  For the uninitiated could you break down the difference between CGI and FastCGI. I understand the flow of CGI being request comes in, program runs and its stdout is sent back to the client.

                                                                                                                                  But in FastCGI is their persistence processes involved?

                                                                                                                                  1. 2

                                                                                                                                    Yeah, with fastcgi the launcher can spawn and kill worker processes as needed, and each worker process can handle an arbitrary number of requests in sequence. They might live a long time or a short time depending on load and configuration.

                                                                                                                                    You can abstract it so a single program works as cgi or fastcgi pretty easily.

                                                                                                                                    There’s also a scgi alternative, which aims to be a simplified version of fastcgi but is generally the same thing, just the communication protocol is a bit different.

                                                                                                                                    1. 1

                                                                                                                                      Si simply more control instead of directly spawning a process on visit, awesome - gotya!

                                                                                                                              2. 5

                                                                                                                                To be fair that’s what “lambda”/Function-as-a-service things do (aws). Worse, they bring up the entire virtual machine up when a request is made. They seem to be used in production, albeit in very narrow workflows where the latency requirements aren’t tight.

                                                                                                                                1. 8

                                                                                                                                  That’s exactly what I mentioned above… No, lambda is not even close to CGI. Lambda spawns an environment on your first request and retains it for a predefined time, so that other requests reuse it. Under enough load it will spawn more of those environments to keep the responses quick. This allows for in memory caching and preserving database connections. There’s also no “entire virtual machine”, but a thin implementation called firecracker, which is not that much slower than starting a standard process. It comes with lots of resource usage controls and scaling options which CGI itself has no concept of. If you’re fine with the first request in a batch taking over 100ms (which honestly most services are), you shouldn’t suffer from lambda latency. (Officially listed as 150ms https://firecracker-microvm.github.io/ )

                                                                                                                                  It’s closer to fastcgi container autoscaled in a cluster.

                                                                                                                                  1. 4

                                                                                                                                    I guess there’s a line to be drawn on whether you consider “CGI” and “CGI with bolted-on process/env persistence” to be completely separate things.

                                                                                                                                    I’m not sure I do.

                                                                                                                                    1. 5

                                                                                                                                      If you don’t, then everything is CGI with bolted on persistence. Literally anything that handles http requests is CGI with bolted on persistence if lambda is. Why even stop there - qmail is CGI with bolted on email protocol. That would make “CGI” itself meaningless.

                                                                                                                                      Or from the other side - since CGI is defined as starting a new process per connection, which implies memory separation between requests, why would anything that doesn’t implement that core idea be called CGI?

                                                                                                                                      1. 4

                                                                                                                                        There are a lot of “CGI-ish” protocols which have kept many of the conventions of CGI even if they haven’t kept exactly the same mechanics as CGI. So, as I pointed out in another comment, Python’s WSGI literally reuses a lot of the special names used by CGI; it just puts them as keys in a hash table instead of as names of env vars.

                                                                                                                                        To me, it doesn’t matter if the particular mechanics allow for persistence; if the overall conventions of how I’m supposed to understand a “request” are the same as CGI, I don’t meaningfully distinguish it, as a programming model, from CGI.

                                                                                                                                        1. 2

                                                                                                                                          And FastCGI which is the de facto way of running PHP using nginx and others.

                                                                                                                                          1. 1

                                                                                                                                            This has truly been one of the most entertaining comment threads I have read. The lambda thing that AWS has is interesting. I don’t use any AWS services, only know that all Amazon employees talk about are their products but lambda does sound, like CGIA for 2022 (but if done in 2022 with modern tech).

                                                                                                                                2. 3

                                                                                                                                  The process that it launches can be very lightweight though. A lot of cgi scripts that I’ve seen were punting most of the logic to another process. For example, connecting to a dev and putting all of the logic there, or simply serving a static file and prodding an IPC channel that coalesced messages and regenerated the file if it had grown stale.

                                                                                                                                  1. 2

                                                                                                                                    No, it’s because Perl was huge (much as Rails was once huge) back when no one had anything better. Perl has long since gotten a Rack/WSGI-like API and a nice little ecosystem of servers and middlewares and whatnot (and, of course, the ability to run anything that’s compliant with that API under CGI or FastCGI if that’s what floats your boat). But there was a time when “dynamic web content” meant “CGI”, and “CGI” meant “Perl”.

                                                                                                                                1. 3

                                                                                                                                  The biggest drag on C++ compile times is #include statements. Header files also make C++ more verbose. As the author notes.

                                                                                                                                  But C++20 has modules, which are supposed to fix these problems. Someone should compare C++20 to Rust. The benchmarks should have better results.

                                                                                                                                  1. 4

                                                                                                                                    One thing I’m doing on Oil is to keep track of the total size of expanded input passed to the compiler.

                                                                                                                                    Basically I have Ninja targets for each file that do gcc -E, and then another target that passes them to wc.

                                                                                                                                    It is pretty eye opening – it fluctuates a lot as the code changes, and keeping an eye on it has resulted in real build time improvements. It doesn’t perfectly correlate with build times, but it’s a lot easier to measure and optimize – by removing unused #includes and restructuring translation units.

                                                                                                                                    https://www.oilshell.org/release/0.13.1/pub/metrics.wwz/preprocessed/cxx-opt.txt

                                                                                                                                      60603 _build/preprocessed/cxx-opt/cpp/core.cc
                                                                                                                                       73984 _build/preprocessed/cxx-opt/mycpp/gc_str.cc
                                                                                                                                       77099 _build/preprocessed/cxx-opt/cpp/frontend_match.cc
                                                                                                                                       98702 _build/preprocessed/cxx-opt/_gen/bin/osh_eval.mycpp.cc
                                                                                                                                     1133793 total
                                                                                                                                    

                                                                                                                                    This relies on our new “declarative Ninja / mini-bazel” build system which I am growing very fond of

                                                                                                                                    http://www.oilshell.org/blog/2022/10/garbage-collector.html#declarative-ninja-mini-bazel


                                                                                                                                    In the next release, the total will be down to ~795K lines (from 1.1M you see) because I removed <unordered_set> from the GC implementation. As mentioned, the total size fluctuates a lot!

                                                                                                                                    1. 2

                                                                                                                                      Unfortunately, modules still don’t work in most compilers. import <iostream>; gives an error.

                                                                                                                                    1. 8

                                                                                                                                      Sometimes, oddly, best syntax is no syntax. For example, in OCaml, there is no special multi-line string syntax. Instead, you can just add line breaks inside normal double-quoted strings.

                                                                                                                                      I also find the ability to add your own infix operators a very nice micro-feature. However, no micro-feature comes at zero cost.

                                                                                                                                      1. 10

                                                                                                                                        Yeah… it feels like single-line strings are fixing something that’s no longer a problem: strings accidentally being unclosed and it causing confusing errors. With syntax highlighting and halfway decent error messages it’s not that important.

                                                                                                                                        BUT, the one case where I’d like a multi-line string syntax is something like:

                                                                                                                                        def run_query():
                                                                                                                                            sql = """
                                                                                                                                                SELECT * FROM x
                                                                                                                                            """
                                                                                                                                        

                                                                                                                                        Where I’d like that literal to actually resolve to "SELECT * FROM x" – dropping the leading and trailing empty line and any consistent leading whitespace (with a special case for internal empty lines).

                                                                                                                                        1. 2

                                                                                                                                          Oil has that!

                                                                                                                                          https://www.oilshell.org/blog/2021/09/multiline.html#multi-line-string-literals-and-and

                                                                                                                                          The indentation of the closing quote determines the leading whitespace that’s stripped.

                                                                                                                                          And if there’s an initial newline it’s dropped. But the last newline isn’t dropped, which I think makes sense ? (You can leave it off if you want by putting it on the same line as the last line of text.)

                                                                                                                                          oil$ var x = """
                                                                                                                                          >   hello
                                                                                                                                          >   there
                                                                                                                                          >   """
                                                                                                                                          
                                                                                                                                          
                                                                                                                                          oil$ = x
                                                                                                                                          (Str)   'hello\nthere\n'
                                                                                                                                          

                                                                                                                                          FWIW this pretty much comes from Julia, which has Python-like multi-line strings, except they strip leading whitespace

                                                                                                                                          The multi-line strings are meant to supplant the 2 variants of here docs in shell.

                                                                                                                                          There are unfortunately still 3 kinds multi-line strings to go with 3 kinds of shell strings. But I’m trying to reduce that even further:

                                                                                                                                          https://lobste.rs/s/9ttq0x/matchertext_escape_route_from_language#c_vire9r

                                                                                                                                          https://lobste.rs/s/9ttq0x/matchertext_escape_route_from_language#c_tlubcl

                                                                                                                                          1. 1

                                                                                                                                            Val has this: https://github.com/val-lang/specification/blob/main/spec.md#string-literals

                                                                                                                                            And I’m quite sure to have seen other languages do this, but can’t recall now.

                                                                                                                                            1. 2

                                                                                                                                              Java’s new-ish text blocks do this too!

                                                                                                                                              1. 3

                                                                                                                                                Aha, right. For anyone else curious it’s described in JEP 378: Text Blocks.

                                                                                                                                            2. 1

                                                                                                                                              Zig does that very well. You write:

                                                                                                                                              let sql =
                                                                                                                                                  \\\SELECT *
                                                                                                                                                  \\\FROM x
                                                                                                                                              

                                                                                                                                              and you get the string “SELECT * \nFROM x”. There’s no ambiguity about leading spaces. If you wanted a leading space before SELECT or FROM, you’d just put a space there.

                                                                                                                                              I’m just confused why it uses \\ instead of the more obvious “””.

                                                                                                                                            3. 4

                                                                                                                                              Yeah, Lua’s insistence on requiring a special syntax for strings just because they contain newlines makes zero sense. A syntax for strings that don’t require escaping backslashes is fine, but there’s no reason to couple that to newline support.

                                                                                                                                              1. 2

                                                                                                                                                OCaml does actually have a special literal-string syntax. It’s not required for multi-line strings, as you mention, but can be quite useful because you never need to (nor can!) escape anything in it. The delimiters are {[ and ]}, but also to cover the edge case of strings that contain the closing delimiter, you can put any identifier consisting of a-z or _ between the two characters of the opening delimiter, and then the closing delimiter must contain the same identifier.

                                                                                                                                                1. 1

                                                                                                                                                  This is wrong. There is no {[ ]} syntax in OCaml as of 5.0.0. You may be referring to the {| |} syntax which is not a special syntax for multi-line strings (since there’s no need for a special syntax, as normal strings already allow newlines) — it’s there to allow unescaped quotes in string literals. The side effect is that you indeed cannot escape anything.

                                                                                                                                              1. 54

                                                                                                                                                Totally agreed about kebab case. It’s an unusually major quality-of-life improvement.

                                                                                                                                                I’d also add being allowed to use ? in an identifier. user-record-valid? is pretty clear, both as a function or as a variable.

                                                                                                                                                1. 55

                                                                                                                                                  The argument I hear against kebab case is that it makes it impossible to write subtraction as foo-bar but like … that’s … good actually? Why are we designing our syntax specifically in order to accommodate bad readability patterns? Just put a space in there and be done with it. Same logic applies to question marks in identifiers. If there’s no space around it, it’s part of the identifier.

                                                                                                                                                  1. 12

                                                                                                                                                    Agreed! (hi phil 👋)

                                                                                                                                                    This is mentioned in the article too in a way. In addition to the readability point you make, the author makes the argument that most of us use multi-word identifiers far, far more often than we do subtractions.

                                                                                                                                                    1. 9

                                                                                                                                                      I dunno, I think there’s a lot of pesky questions here. Are all mathematical operators whitespace sensitive, or just -? Is kebab-case really worth pesky errors when someone doesn’t type things correctly?

                                                                                                                                                      I format my mathematical operators with whitespace, but I also shotgun down code and might leave out the spaces, then rely on my formatter to correct it.

                                                                                                                                                      Basically, I think kebab-case is nice, but properly reserved for lisps.

                                                                                                                                                      1. 28

                                                                                                                                                        Are all mathematical operators whitespace sensitive?

                                                                                                                                                        Yes, of course! There’s no reason to disallow tla+ as an identifier either or km/h for a variable to keep speed other than “that’s the way it’s been done for decades”.

                                                                                                                                                        I also shotgun down code and might leave out the spaces, then rely on my formatter to correct it.

                                                                                                                                                        The compiler should catch it immediately since it’d be considered an unrecognized identifier.

                                                                                                                                                        1. 3

                                                                                                                                                          I’m not sure if this is an argument for or against what you’re saying here, but this discussion reminded me of the old story about how fortran 77 and earlier just ignore all spaces in code:

                                                                                                                                                          There is a useful lesson to be learned from the failure of one of the earliest planetary probes launched by NASA. The cause of the failure was eventually traced to a statement in its control software similar to this:

                                                                                                                                                          DO 15 I = 1.100

                                                                                                                                                          when what should have been written was: DO 15 I = 1,100

                                                                                                                                                          but somehow a dot had replaced the comma. Because Fortran ignores spaces, this was seen by the compiler as:

                                                                                                                                                          DO15I = 1.100

                                                                                                                                                          which is a perfectly valid assignment to a variable called DO15I and not at all what was intended.

                                                                                                                                                          (from https://www.star.le.ac.uk/~cgp/prof77.html)

                                                                                                                                                        2. 8

                                                                                                                                                          If I see x-y, I always parse it visually as a single term, not x minus y. I think that’s a completely fair assumption to make.

                                                                                                                                                      2. 16

                                                                                                                                                        I have always found kebab-case easier on the eyes than snake_case, I wish the former was more prevalent in languages.

                                                                                                                                                        1. 14

                                                                                                                                                          Raku (previously known as Perl 6) does exactly this: dashes are allowed in variables names, and require spaces to be parsed as the minus operator.

                                                                                                                                                          1. 6

                                                                                                                                                            Crazy idea: reverse _ and - in your keyboard map :)

                                                                                                                                                            Probably would work out well for programmers. All your variables are easier to type

                                                                                                                                                            When you need to use minus, which is not as often, you press shift

                                                                                                                                                            1. 10

                                                                                                                                                              More crazy ideas.

                                                                                                                                                              • Use ASCII hyphen (-) in identifiers, and use the Unicode minus sign (−) for subtraction.
                                                                                                                                                              • Permit -- (two hyphens) as a synonym for − (minus). Related to the fact that some languages let you write ≤ instead of <=, and so on. Related to the fact that -- turns to – in markdown.
                                                                                                                                                              • Your text editor automatically converts -- to − and <= to ≤.
                                                                                                                                                              • This makes more sense if you are viewing source code using a proportional font. Identifiers consume less precious horizontal screen space in a proportional font. Hyphens are shorter than underscores, so it looks better and is nicer to read.
                                                                                                                                                              1. 15

                                                                                                                                                                Use ASCII hyphen (-) in identifiers, and use the Unicode minus sign (−) for subtraction.

                                                                                                                                                                #include <vader.gif> Nooooooo!!!!!!!!!

                                                                                                                                                                I really don’t like this idea. I’m all for native support for Unicode strings and identifiers. And if you want to create locale-specific keywords, that is also fine. I might even be OK with expanding the set of common operators to specific Unicode symbols, provided there is a decent way to input them. [1]

                                                                                                                                                                But we should never, ever use two visually similar symbols for different things. Yes, I know, the compiler will immediately warn you if you mixed them up, but I would like to strongly discourage ever even starting down that path.

                                                                                                                                                                [1] Something like :interpunct: for the “·” for example. Or otherwise let’s have the entire world adopt new standard keyboards that have all the useful mathematical symbols. At any rate, I’d want to think about more symbols a lot more before incorporating it into a programming language.

                                                                                                                                                                1. 4

                                                                                                                                                                  The hyphen and minus sign differ greatly in length, and are easily distinguished, when the correct character codes and a properly designed proportional font is used. According to The Texbook (Donald Knuth, page 4), a minus sign is about 3 times as long as a hyphen. Knuth designed the standards we still use for mathematical typesetting.

                                                                                                                                                                  When I type these characters into Lobsters and view in Firefox, Unicode minus sign (−) U+2212 is about twice the width of Unicode hyphen (‐) U+2010. I’m not sure if everybody is seeing the same font I am, but the l and I are also indistinguishable, which is also bad for programming.

                                                                                                                                                                  A programming language that is designed to be edited and viewed using traditional mathematical typesetting conventions would need to use a font designed for the purpose. Programming fonts that clearly distinguish all characters (1 and l and I, 0 and O), are not a new idea.

                                                                                                                                                                2. 7

                                                                                                                                                                  Sun Labs’ Fortress project (An HPC language from ~15 years ago, a one time friendly competitor to Chapel, mentioned in the article) had some similar ideas to this, where unicode chars were allowed in programs, and there were specific rules for how to render Fortress programs when they were printed or even edited. for example

                                                                                                                                                                  (a) If the identifier consists of two ASCII capital letters that are the same, possibly followed by digits, then a single capital letter is rendered double-struck, followed by full-sized (not subscripted) digits in roman font.

                                                                                                                                                                  QQ is rendered as ℚ

                                                                                                                                                                  RR64 is rendered as ℝ64

                                                                                                                                                                  it supported identifier naming conventions for superscripts and subscripts, overbars and arrows, etc. I used to have a bookmark from that project that read “Run your whiteboard!”

                                                                                                                                                                  the language spec is pretty interesting to read and has a lot of examples of these. I found one copy at https://homes.luddy.indiana.edu/samth/fortress-spec.pdf

                                                                                                                                                                  1. 9

                                                                                                                                                                    Thanks, this is cool!

                                                                                                                                                                    I feel that the programming community is mostly stuck in a bubble where the only acceptable way to communicate complex ideas is using a grid of fixed width ASCII characters. Need to put a diagram into a comment? ASCII graphics! Meanwhile, outside the bubble we have Unicode, Wikipedia and technical journals are full of images, diagrams, and mathematical notation with sophisticated typography. And text messages are full of emojis.

                                                                                                                                                                    It would be nice to write code using richer visual notations.

                                                                                                                                                                  2. 3

                                                                                                                                                                    Use dieresis to indicate token break, as in some style guides for coöperate:

                                                                                                                                                                    kebab-case

                                                                                                                                                                    infix⸚s̈ubtract

                                                                                                                                                                    (Unserious!)

                                                                                                                                                                    1. 1

                                                                                                                                                                      Nice. All the cool people (from the 1800’s) spell this word diaëresis, which I think improves the vibe.

                                                                                                                                                                      1. 2

                                                                                                                                                                        Ah yes, but if you want to get really cool (read: archaic), methinks you’d be even better served by diæresis, its ligature also being (to my mind at least) significantly less offensive than the Neëuw Yorker style guide’s abominable diære…sizing(?) ;-)

                                                                                                                                                                        1. 3

                                                                                                                                                                          Thank you for pointing this out. I think that diæresis is more steampunk, but diaëresis is self-referential, which is a different kind of cool.

                                                                                                                                                                  3. 7

                                                                                                                                                                    I’ve tried that before and it turns out dash is more common than underscore even in programming. For example terminal stuff is riddle with dashes.

                                                                                                                                                                    1. 7

                                                                                                                                                                      For me, this is not at all about typing comfort, it’s all about reading. Dashes, underscores and camel case all sound different in my head when reading them, the underscore being the least comfortable.

                                                                                                                                                                      1. 10

                                                                                                                                                                        For me, this is not at all about typing comfort, it’s all about reading. Dashes, underscores and camel case all sound different in my head when reading them

                                                                                                                                                                        I am the same way, except they all sound different from my screenreader, not just in my head. I prefer dashes. It’s also a traditional way to separate a compound word.

                                                                                                                                                                        1. 2

                                                                                                                                                                          Interesting, you must have some synesthesia :-)

                                                                                                                                                                          As far as I can tell, different variable styles don’t sound like anything in my head. They make it harder for me to read when it’s inconsistent, and I have to adjust to different styles, but an all_underscore codebase is just as good to me as an all camelCase.

                                                                                                                                                                          I use Ctrl-N in vim so typing underscore names doesn’t seem that bad. Usually the variable is already there somewhere. I also try to read and test “what I need” and then think about the code away from the computer, without referring to specific names

                                                                                                                                                                      2. 5

                                                                                                                                                                        I like ? being an operator you can apply to identifiers, like how it’s used with nullables in C#, or, as I recall, some kind of test in Ruby.

                                                                                                                                                                        1. 6

                                                                                                                                                                          In Ruby, ? is part of the ternary operator and a legal method suffix so method names like dst? are idiomatic.

                                                                                                                                                                          1. 1

                                                                                                                                                                            Ah, that makes sense. I don’t use Ruby so I wasn’t sure, I just knew I had seen it.

                                                                                                                                                                          2. 3

                                                                                                                                                                            In zig maybe.? resolves maybe to not be null, and errors if it is null.

                                                                                                                                                                            maybe? is different, in my mind.

                                                                                                                                                                            1. 1

                                                                                                                                                                              In Ruby it’s just convention to name your function valid? instead of the is_valid or isValid you have in most languages. The ? Is just part of the function name.

                                                                                                                                                                          1. 3

                                                                                                                                                                            Woah very interesting! I’ve been thinking about escaping and quoting a lot recently, and the mess of JSON, CSV, HTML, shell, etc.

                                                                                                                                                                            • JSON: C-style backslash escapes like \n and \\ [1]
                                                                                                                                                                            • CSV: double quote fields with commas, and then sometimes double the double quotes? Wah?
                                                                                                                                                                            • HTML: &quot; style escapes. Sometimes single quotes are special, but you need a hex escape for that?
                                                                                                                                                                            • Shell: bash has 7 different kinds of string literal with different escaping rules
                                                                                                                                                                              1. double quoted, respecting $ and backticks and \
                                                                                                                                                                              2. unquoted: respecting $ and backticks, but a different meaning for \ [2]
                                                                                                                                                                              3. raw single quoted strings, unlike C/Python/JS
                                                                                                                                                                              4. strings with C-style $'\n' (or the POSIX echo -e format, which is slightly different)
                                                                                                                                                                              5. here docs that respect substitution
                                                                                                                                                                              6. here docs that don’t
                                                                                                                                                                              7. dynamic printf format strings
                                                                                                                                                                            • arguably JSON strings are an 8th – they appear in bash scripts, and are different than bash’s rules

                                                                                                                                                                            So it’s a big mess


                                                                                                                                                                            The solution looks a little unconventional, but I guess if it weren’t, there would be a solution already! Definitely will take a closer look and see how other languages including data languages could be adapted (e.g. I’m sure people here have encountered CSV inside JSON, HTML inside JSON, or even JSON inside JSON, which seems to be a common bug)

                                                                                                                                                                            FWIW Bryan Ford is the inventor of Parsing Expression Grammars (2004), which are now used in the parsers of CPython (as of ~2018) and Zig. A nice case of theory to practice in a relatively short amount of time :)

                                                                                                                                                                            [1] aside: I discovered that in JS and Python, \xff is treated like a synonym \u{ff}, which IMO is a huge mistake. \xff should be a byte, not a code point.

                                                                                                                                                                            [2] echo \n outputs n, while echo "\n" usually outputs \n, or sometimes a newline!

                                                                                                                                                                            1. 2

                                                                                                                                                                              This reminds me of my old idea for generalized string literals.

                                                                                                                                                                              A couple of years ago I did a survey of the state of string literals in various programming languages because a lot of languages have grown more complicated literal syntax in the decade since I wrote down my idea. When I started thinking about generalized literals, the syntax I had in mind was rather outlandishly complicated, but by today’s standards it seems quite reasonable!

                                                                                                                                                                              1. 2

                                                                                                                                                                                Update: I think I decided that like C++ (thanks to your page) the tag can be a suffix.

                                                                                                                                                                                Honestly I’ve never used that in C++, but it seems like it’s OK ? This seems to make sense:

                                                                                                                                                                                echo "foo $var"
                                                                                                                                                                                echo "foo $var"html
                                                                                                                                                                                echo y"foo \n \{var}"
                                                                                                                                                                                echo y"foo \n \{var}"html
                                                                                                                                                                                

                                                                                                                                                                                And then multiline versions work the same way:

                                                                                                                                                                                proc p {
                                                                                                                                                                                  echo """
                                                                                                                                                                                  foo $var
                                                                                                                                                                                  line two, with leading space stripped
                                                                                                                                                                                  ""html
                                                                                                                                                                                }
                                                                                                                                                                                

                                                                                                                                                                                I feel like this makes sense:

                                                                                                                                                                                • the prefix r or y, similar to Python and Rust, tells you how you PARSE it
                                                                                                                                                                                • the suffix like html, similar to C++, tells you how to EVALUATE it

                                                                                                                                                                                ?

                                                                                                                                                                                Interested in feedback

                                                                                                                                                                                https://github.com/oilshell/oil/issues/1442

                                                                                                                                                                                1. 1

                                                                                                                                                                                  Interestingly, Python is experimenting with tagged strings, too: https://github.com/jimbaker/tagstr

                                                                                                                                                                                  1. 1

                                                                                                                                                                                    Hm interesting! didn’t know that

                                                                                                                                                                                    Definitely seems so

                                                                                                                                                                                    https://github.com/jimbaker/tagstr/blob/6ac5fcf23bdf81a9a2f26534d39ec9eb5d2e724b/examples/sql.py#L180

                                                                                                                                                                                    Although I wonder if you have sql"" where you put the r"" or the b"" etc.

                                                                                                                                                                                    Also it says PEP 999 in github, but that doesn’t exist: https://peps.python.org/pep-0000/

                                                                                                                                                                                    1. 1

                                                                                                                                                                                      I think the “999” is a placeholder.

                                                                                                                                                                                2. 1

                                                                                                                                                                                  I’ve been following this project to add something very similar to Python: https://github.com/jimbaker/tagstr

                                                                                                                                                                                  1. 1

                                                                                                                                                                                    Hm great survey! I hadn’t seen that page, and some of the languages have indeed gotten more elaborate than I thought

                                                                                                                                                                                    About choosing {} [] () as delimiters, I made a similar choice >10 years ago for a “JSON Template” language that ended up influencing the Go’s text/template. Although Go abandoned that feature and chose {{ }}.

                                                                                                                                                                                    So I see the logic, but for some reason (maybe not a good one?) I’m reluctant to make that choice again … But I’d be curious about your experience with it.


                                                                                                                                                                                    I am trying to reduce the number of string literal types from ~8 to 3 or 4. I think I have “one string literal to rule them all”, called YSTR – JSON-compatible, C escapes, Swift-style interpolation

                                                                                                                                                                                    It’s a successor to QSN. https://www.oilshell.org/release/latest/doc/qsn.html

                                                                                                                                                                                    The main flaw of QSN is that it’s not JSON compatible. It was designed to be consistent with Oil string literals, but I think it’s actually more important to be consistent with JSON/JS/Python string literals.

                                                                                                                                                                                    Probably the best description of YSTR is on the feedback I just gave about matchertext:

                                                                                                                                                                                    https://github.com/dedis/matchertext/issues/1#issuecomment-1371826392

                                                                                                                                                                                    There is also a ton of disjointed brainstorming on #oil-discuss-public : https://oilshell.zulipchat.com/#narrow/stream/325160-oil-discuss-public/topic/Unifying.20String.20Notation.3A.20JSON.2C.20CSV.2C.20HTML.2C.20C.2C.20Shell

                                                                                                                                                                                    I’d definitely be interested in feedback. I think what you’re suggesting is very similar, although I think YSTR is a little more conventional in the common cases, while still supporting rare cases.


                                                                                                                                                                                    Though your post did remind me is that I have not managed to stuff in JavaScript-style tags.

                                                                                                                                                                                    I wonder if that can be done in the parser, not in the lexer.

                                                                                                                                                                                    Oil has an “atom” / symbol syntax %myatom, which is just like an opaque string, with different syntax.

                                                                                                                                                                                    So I wonder if we could do something like:

                                                                                                                                                                                    var x = "raw interpolation of ${var}"
                                                                                                                                                                                    var y = %html "escaped interpolation of $[var]"
                                                                                                                                                                                    

                                                                                                                                                                                    It’s basically having the “space” as an operator.

                                                                                                                                                                                    I guess one downside is that you can’t put that in commands

                                                                                                                                                                                    echo %html "escaped interpolation of $[var]"
                                                                                                                                                                                    

                                                                                                                                                                                    Doesn’t make sense …

                                                                                                                                                                                    The reason I don’t want a prefix is that we already have r'' like Python, and probably y"" for YSTR