Threads for fanf

  1.  

    Q: is it possible to make periodic tiling with this? Is it hard or would it be likely if you’re just laying them randomly? Is it possible to get stuck when tiling, forcing you to backtrack to avoid overlaps?

    1.  

      Q: is it possible to make periodic tiling with this?

      No! Hats (the previous paper) and Specters (this paper) only admit aperiodic tilings. This is in contrast to Penrose tilings (either rhombuses, or kits&darts) which can be tiled periodically unless you add additional matching rules (lines drawn on top of the tiles that must match up across tiles). A quote from the original paper about this: “In this paper we present the first true aperiodic monotile, a shape that forces aperiodicity through geometry alone, with no additional constraints applied via matching conditions.”

      Is it possible to get stuck when tiling, forcing you to backtrack to avoid overlaps?

      I suspect so? But very curious if anyone knows for sure.

      1.  

        The base shape (with straight edges) is “weakly aperiodic”, which (in the terminology of the paper) means that it can tile periodically if you allow both the tile and its reflection, but must be aperiodic if you disallow reflections (but allow translation and rotation). The spectre variant has curved edges which prevent tiles from making a tiling with their reflections, so it is aperiodic whatever planar isomorphisms you allow.

      1. 1

        [You have to jump through a clickwall to get the PDF but you don’t have to give them your email address]

          1. 3

            This took a weird turn from GPS to UTC, but otherwise I liked it.

            1. 1

              heh, I thought the weird turn was from the present tense (how things work now) to the past (how they came to be) but it seems to have been subtle enough :-)

            1. 2

              Starlark, an embeddable dialect of Python, specifically the Golang implementation starlark-go.

              I’m at the RIPE DNS Hackathon in Rotterdam. Our team is prototyping a mechanism for scriptable network measurements using the RIPE Atlas distributed probing system. At the moment Atlas supports a fixed collection of measurement types; scripting should make the measurements extensible, and allow data reduction on the probes.

              For this purpose we wanted a scripting language that’s reasonably friendly for data scientists, and which can be sandboxed so that the Atlas probes don’t become a programmable botnet, and which has a host language that’s quick for prototyping.

              1. 1

                I think that this really comes down to the important distinction between abstraction and obfuscation. I don’t believe that RPC is an abstraction on-top of message passing, I believe it is an obfuscation. Even their fix is a bit wrong, because as they said, for a different use-case in their code those values were correct. If they were simply passing protobuf messages back and forth rather than using an RPC obfuscation that they don’t fully understand, this bug would have been much easier to find, indeed probably would have never occurred. The weird thing, is that RPC generally doesn’t even reduce the LOC count vs simple message passing. You have a caller and a receiver and you have to deal with async just the same…

                1. 1

                  I don’t think the higher-level model is a problem in this case, but instead a combination of a couple of things:

                  • GRPC includes fairly complicated service discovery and failure recovery machinery; they fixed their problem by reconfiguring it. A message-passing system would still need service discovery and failure recovery, so it could still cause overloads in the same way.

                  • Their orchestration around restarts is ungraceful: they tear down the old service before traffic has moved over to the new service, and they heavily rely on their ingress layer to hold requests while the pods restart. They could also have solved the problem by starting the new pods first, reconfiguring service discovery, waiting for traffic to migrate, and finally tearing down the old pods.

                  (There is, in fact, a small DNS issue, in that GRPC ignores the DNS TTL and has its own configuration its DNS cache refresh timer; in this case the GRPC timer was longer than the DNS TTL so it made restarts slower, exacerbating the problem.)

                1. 5

                  This is a really good debugging deep dive, so many twists and turns!

                  1. 17

                    The haiku at the end has the wrong syllable count. LLMs cannot count syllables or letters in words reliably. I have been informed that this is an artifact of the tokenization process.

                    I think LLMs are neat, but a lot of “this time is different” just assumes that the current progress will continue. Maybe? I don’t have a prediction one way or the other. But I will say that at the current level of technology, if you lose your job, it was a bullshit job.

                    1. 4

                      It is indeed an open question whether the recent LLM progress was just a fluke, and it’s already approaching the ceiling, or is it just a baby step in a whole new field. For now, it seems like it’s the latter. GPT 2 -> 3 -> 4 were noticeable improvements, and there’s no sign of them stopping. We still have possibility of both developing more powerful hardware and throwing more data at them, as well as continued advancement in reducing model sizes and improving training. We’re also at inflection point where LLMs are useful for preparing and cleaning training data for themselves.

                      if you lose your job, it was a bullshit job.

                      That’s meaningless. There was a time when being a messenger was an important job. There was a time when being a watchmaker was a proper craft.

                      Is programmer a bullshit job? Lawyer? Pharmacist? Psychotherapist? Is Prompt Engineer a serious job?

                      1. 5

                        I mean “Bullshit Job” in the sense of the kind of job in David Graeber’s book of the same name and I mean if you lose your job to an LLM with the capabilities it has in May 2023. There are lots of jobs today that are made more efficient by an LLM, but if it can be made so efficient that it can be eliminated entirely, I don’t think the job needed to be done in the first place. I dunno, maybe some companies can go from having three people in marketing down to two, but I can’t think of anyone at my small company who we could eliminate today with an LLM. Everybody does a lot of different small things, and getting faster in one area just means we can do something else with the left over time.

                        One of my grandfathers was a traveling salesman, and part of his job was adding up the numbers for his sales. The adding part was made obsolete by Lotus 123, and travel part is being obviated by Zoom. I’m not totally sure, but I think his old job doesn’t exist anymore because the company went broke due to globalization.But now I have a job that didn’t exist then. I don’t think it makes sense to worry about technological unemployment. People have enough to do to keep us all busy!

                        1. 9

                          As I recall, Graeber deliberately defines a bullshit job as one that the person doing it thinks is bullshit. If you start defining other people’s jobs for them then that’s not quite what he was saying, besides being ruder and perhaps patronising.

                          1. 3

                            Ironically, I think many of Graeber’s bullshit jobs are fairly safe from AI: a lot of them are the kind where the employer wants warm bodies.

                            1. 1

                              I remember I was visiting San Francisco when I read Bullshit Jobs and the neighborhood I was visiting had little advertising banners for the neighborhood itself up on the lamp posts with a hashtag on them. (I can’t find a picture of it on my phone, but it was something dumb like “#feelmore in Filmore!”) I remember thinking that it was clearly a bullshit job to have to think up the hashtag for the campaign, because there was no way any human being would ever bother to tweet the hashtag or search for it. I bet you could get an LLM to spit out hashtags for a neighborhood very quickly.

                            2. 1

                              I think you’re stretching David Graeber’s Bullshit Job definition. We have obviously useful value-producing creative and intellectual jobs threatened by AI. Not just data entry assistants to middle-managers for regional branch compliance advisory departments, but doctors, graphic artists, and programmers that until recently were thought irreplaceable. I do expect that for now many will keep jobs to supervise AI and have things to do thanks to Jevon’s paradox, but if we get closer to AGI, it will turn things upside down.

                            3. 2

                              We still have possibility of … throwing more data at them

                              Do we? What data have they not used that they could? I find it hard to believe that a company with such enormous financial resources would not already have, to all intents and purposes, exhausted all possible training data.

                              We’re also at inflection point where LLMs are useful for preparing and cleaning training data for themselves.

                              Are we? How do we know that? Has it happened? Can you give a reference?

                            4. 1

                              I think there are some realistic jobs that could feel threatened. GPT-3 may not be great at spitting out correct python code (in my experience), but it has been exceptional when I ask it for very formulaic stuff. I’d expect copy writers, maybe people dreaming up marketing campaigns, legal interns typing up form letters, and anything that plays into its strength of “Make up something that sounds reasonable” should be concerned.

                              That said, I believe this will then add jobs onto people checking the output, doing editing, and the like. Similarly, for the image generating AIs, I could see smart artists adding a “Artisanal computer graphics made by a human” sticker onto their work and charging more as the flood of AI-generated fluff starts floating around.

                              LLMs will likely impact jobs that today are needed, but that doesn’t mean the job isn’t valuable to someone today.

                              I say this entrenched in the “ChatGPT couldn’t spit out a single working program for me” camp, seeing AI as an assistant, not a replacement, in any field that requires correctness (such as compiling code).

                            1. 10

                              We have a family account at Fastmail, using our family domain and my personal domain.

                              In a previous job I looked after email for Cambridge University, alongside my colleague David Carter. David developed a brilliant replication system for Cyrus IMAP, which allowed us to store everyones email in multiple sites. Great operational peace of mind. (In 2003 it was several years ahead of the state of the art.)

                              Bron Gondwana and the team at Fastmail adopted David’s tech, got it incorporated into upstream Cyrus, and made some great improvements to it. They are good supporters of open source and open standards.

                              So when I stopped doing email professionally, Fastmail seemed like a good choice: good people, good tech, and financially stable and reliable for decades.

                              1. 4

                                We have a family account at Fastmail, using our family domain and my personal domain.

                                It looks as if they stopped offering the family plans a few years ago, which is a shame because they looked great value. I’d like to stop self hosting email for my family at some point and they look like a good option, but the per-user quotas are likely to cause some headache.

                                I’d also really like to see a company like fastmail use SEV-SNP / TDX / ACCA to provide technical guarantees (with the client checking a remote attestation) that they can’t see your email.

                                1. 2

                                  Not sure what the family plan was, but if you have a standard account with custom domain names, you can then add basic accounts to share those domain names.

                                  So, let’s say that you need 3 users, that’s going to be $50 + $30 + $30 per year, or $9.2 per month.

                                  Not great, but not terrible.

                                  1. 2

                                    It’s a similar price to self hosting (though, hopefully, less work). It doesn’t look great in comparison to something like Microsoft 365, which gives me 6 users each with 50 GB of mail storage for £80/year (in addition to Office on multiple platforms and 1 TB of cloud storage per user). If I buy cloud storage, even with geographic replication, I’m looking at $0.04/GB/Month for hot storage (and, honestly, 90% of mail is cold and can be moved to a cheaper tier), so the incremental cost of adding more users is very small.

                                    Fastmail used to offer a suite of family plans, where you paid for the total storage amount that your family used and could share that between users. With their current plan, I’d probably have to put most people on their $50/year plan to get 30 GB of space, even though, between us all, we’re likely using under 50 GB.

                                    1. 1

                                      Microsoft 365’s pricing is very competitive, I’m also on a family subscription for their apps and OneDrive storage.

                                      On the other hand, the hidden cost to adopting well-integrated solutions from big corporations is lock-in. Cost-effectiveness is how people were lured into Google’s ecosystem, only to find later that they have to pay with their privacy, and that the competition is increasingly vanishing, such that more and more options are off the table. And the sad part is that well-integrated solutions are often inferior compared with their competition (e.g., Teams vs. Slack, OneDrive vs. Dropbox, etc.), but only big companies can afford the integration and the resulting cost-effectiveness.

                                      And, AFAIK, custom domains are no longer available for family users (see announcement). So, for people comfortable with the lock-in of an @outlook.com email address, might as well go for Gmail because Microsoft doesn’t strike as very different from Google (US-based, owns an advertising platform), and Gmail is technically the superior option.

                                      1. 1

                                        That’s fair. I mostly compare the cost of FastMail to the cost of self hosting. With the low price of VMs and storage, it looks high and (after the initial setup) I don’t really do much to maintain a mail server. I would expect a medium sized company to have better economies of scale. I used to run a mail system for a few hundred users when I was a student and that was about as much effort as running one for a single user, so I’d expect it to be a bit cheaper, even including a healthy profit. For a family, it is currently a bit more expensive.

                              1. 4

                                Sanitizers are not a security boundary…

                                1. 3

                                  A setuid executable linked with a sanitizer is a root privilege escalation vulnerability

                                  Also, the “no-names” are ISRG, the people behind Let’s Encrypt. (If you are a non-native speaker you might not know that “no-name” is an insult saying that someone is insignificant, has failed to get a (good) reputaion, to “make a name for themselves”.)

                                  (Even so, in my opinion sudo is a lost cause, it is solving the wrong problem in the wrong way, and usually a much simpler tool can do the job it is used for with much less risk.)

                                  1. 1

                                    I do not really understand what you really mean. Why rust borrow checker is security boundary?

                                    1. 3

                                      Sanitizers at best make bad behavior nominally obvious, hopefully, most of the time, by exhaustion or induction/deduction in some cases. Systems with type and CFG level knowledge can make it impossible by construction.

                                      1. 3

                                        Sanitizers are probabilistic bug detection tools. They are not designed to be robust in the presence of an active adversary. The Rust borrow checker is a type system property. It is (modulo soundness bugs) a deterministic predicate over some correctness properties of the program, which will hold irrespective of input (modulo bugs in unsafe snippets and FFI).

                                        1. 1

                                          Sanitizers are probabilistic bug detection tools

                                          gwasan maybe, but not asan.

                                          The Rust borrow checker is a type system property.

                                          I understand that.

                                          But I really do not understand why adding boundary checks at every memory access, like asan does, is not equivalent of some deterministic predicate about memory safety of running program.

                                          Asan ensures us that there will not be memory errors, because before any memory error really happens we will have abort().

                                          1. 3

                                            gwasan maybe, but not asan.

                                            Yes it is and I have no idea why you would think otherwise. ASan has no model of pointer provenance and so, at best, it deterministically catches linear overflows. It probabilistically catches other kinds of error (non-linear overflows, pointer injection, use after free).

                                            Asan ensures us that there will not be memory errors, because before any memory error really happens we will have abort().

                                            This is absolutely untrue. If I write a+b, where a is a pointer and b is an unchecked integer that comes from outside of the program (a very common kind of memory-safety bug that leads to vulnerabilities), then ASan will catch it if a+b lands in one of its guard regions. This will happen if the computed address points just after the end of an object, into unmapped memory, or hits a guard between two other allocations. This is great as a bug finding tool because it’s far more likely that fuzzing will trigger such a fault than that fuzzing will trigger out-of-bounds errors that are sufficiently large that they will hit an unmapped page. It is not a security tool, because an attacker who can craft the value of b can easily find a displacement that doesn’t fault but does cause the kind of memory corruption that can be used to elevate to arbitrary code execution.

                                            In contrast, any such thing in a safe language carries the bounds with it. A Rust (or JavaScript, or Go, or whatever) array / slice has a base and a bounds and any indexing is checked against that range and guarantees that it will raise an error if the array index is out of bounds.

                                            CHERI platforms also carry bounds and so get the same dynamic checking for C/C++ and our CHERIoT platform also provides deterministic temporal safety, so Rust has no significant confidentiality or integrity benefits there (though it can have significant availability benefits from preventing many of these bugs at compile time, rather than catching them at run time).

                                            1. 1

                                              an attacker who can craft the value of b can easily find a displacement that doesn’t fault but does cause the kind of memory corruption that can be used to elevate to arbitrary code execution.

                                              Ok, now I understand.

                                              CHERI platforms also carry bounds and so get the same dynamic checking for C/C++ and our CHERIoT platform also provides deterministic temporal safety, so Rust has no significant confidentiality or integrity benefits there (though it can have significant availability benefits from preventing many of these bugs at compile time, rather than catching them at run time).

                                              I also mention about such systems in my text, where suggest to compile C into some tagged memory arch, and interpret resulting code.

                                              1. 3

                                                I also mention about such systems in my text, where suggest to compile C into some tagged memory arch, and interpret resulting code.

                                                Now you’ve got the whole of QEMU or similar in your TCB. These emulators are a lot less well tested than real hardware and, as with sanitisers, are not intended to be security boundaries. QEMU is not a sandbox (QEMU + KVM uses KVM to enforce the sandbox boundary). Now an attacker has a choice of bugs in QEMU and bugs in your program to attack. This probably isn’t so bad for sudo, because most of the ways of breaking memory safety with CHERI QEMU that I know of require multiple threads but it’s still a huge pile of code that was never written with security in mind added to your TCB.

                                                You also mention web assembly, which has no memory safety guarantees within the sandbox (and, in fact, generally has less security than native code compiled with the default set of mitigations).

                                                Since joining lobste.rs, you have:

                                                • Submitted two stories, both to your own blog.
                                                • Posted eight comments, all to the threads about your own blog.

                                                Both of the blogs that you’ve posted have been half-baked ideas that you’ve clearly not thought through properly and the second one starts by denigrating people who have thought through some of the problems and come up with a different solution to you. I’d suggest that you:

                                                • Learn a bit more about the topics that you’re writing about before you write more.
                                                • Write down more of the ideas than ‘hey, wouldn’t it be cool if’ style blogs. Properly think through your ideas, don’t just throw out a few sentences and call it done.
                                                • Actually participate in the lobste.rs community rather than treating it as a promotion channel for your blog.
                                                1. 1

                                                  Now you’ve got the whole of QEMU or similar in your TCB. These emulators are a lot less well tested than real hardware and, as with sanitisers, are not intended to be security boundaries.

                                                  Yes, I also mention about this in my text.

                                                  Actually participate in the lobste.rs community rather than treating it as a promotion channel for your blog.

                                                  Yes, it is perfectly clear to me :)

                                    1. 3

                                      Common lisp has support for “custom-width integer types”

                                      For example: http://clhs.lisp.se/Body/f_by_by.htm

                                      I don’t have much to say on that apart that this is why we often see octet in common lisp where you would usually see byte in other languages.

                                      1. 6

                                        While “octet” does reinforce, I think seeing it often in Lisp may have more to do with Lisp culturally pre-dating the 8-bit byte by a couple decades. People who deal with networking standards / documents with traditions dating to before memory settled on being 8-bit addressable (e.g., the 4004, 6502, etc. were early to mid-1970s) also use “octet” just to be precise / specific.

                                        1. 2

                                          Octet was broadly used in PDP-11 lingo, as opposed to (16-bit) word, and probably in other mini systems.

                                          1. 2

                                            Common Lisp was heavily influenced by the PDP-10 lisps of the 1970s and 1980s; the PDP-10 was a 36 bit computer.

                                        1. 6

                                          Because capabilities are linear, they are not copyable.

                                          Capabilities were originally studied in the Cartesian-closed context, where they are copyable. Today, cryptographic capabilities, which are difficult-to-guess (“unguessable”) rather than unforgeable or uncopyable, are commonplace.

                                          Capabilites are, by default, irrevocable.

                                          This isn’t a limitation, but a good design choice. Revoking any sort of built-in capability is something that must be done via runtime API, rather than inside user code; users can build revocable tokens themselves using a single mutable cell, and Capability Myths Demolished claims that this pattern was known in 1974, around the time of the Lambda Papers.

                                          Every programming language needs a escape hatch. When interacting with the outside world, you have to do unsafe things.

                                          One of the goals of E was to show that this isn’t true. In Monte, a flavor of E that I helped build, there are no escape hatches; every foreign and unsafe feature of the outside world is carefully given a handcrafted wrapper, a process known as taming. We didn’t implement FFI. Network access, filesystem access, system timers and clocks, cryptographic routines, subprocessing, even examining caught exceptions was privileged, and capabilities have to be explicitly imported. The least safe thing in the prelude was access to the unit-test and benchmark runners!

                                          1. 1

                                            I want some kind of capability system for JS so bad…. just… by default don’t let code do anything and I have to explicitly pass in (or otherwise grant) capabilities for network access, fs access, etc. Even a rough, flawed system would be a huge boon. Do you have any recommendations?

                                            1. 2

                                              You want to search for “Secure ECMAScript”, often just called SES. E’s authors have been steadily improving ECMAScript for nearly a decade, too. Outside of SES, common style improvements are actually also capability-safety improvements: Use modules, freeze immutable objects, ponder WeakMap, etc.

                                              1. 1

                                                There was Caja developed by Mark Miller at Google. I gather it failed because it was too difficult to tame the browser APIs into a capability-secure subset.

                                                1. 1

                                                  That’s part of the elevator pitch for Deno.

                                                  1. 2

                                                    Deno’s permissions seem to be too coarse to me. AIUI they apply to the entire program, so they don’t help with things like restricting third party libraries.

                                                    1. 1

                                                      I agree, but it’s an important first step.

                                                      I’m not a JS expert, but given how dynamic it is, it might not be possible to create and enforce real capabilities in it (that sort of concern was raised in the article too.)

                                              1. 5

                                                It’s great to see someone writing this up, but a few comments:

                                                • The usual extension for preprocessed C is .mii, not .tu. Using this will let syntax highlighting work correctly.
                                                • The first figure with the pipeline is missing the assembler step. In clang, this is integrated, but historically it is not (and I think GCC keeps it separate).
                                                • Having the compiler and driver in a single binary and the linker in a separate one is historical: clang needed a gcc-compatible driver and already had a load of the argument-parsing machinery. There was an effort for a while to provide a universal compiler driver in the LLVM project, which would probably have been cleaner (to drive clang, flang, and so on), but it was abandoned.
                                                • cc is not just convenient, it is part of POSIX.
                                                • GNU Binutils is available on all of the listed platforms, but it isn’t the default on more than half of them.
                                                • The multi-file section is very *NIX-specific. Windows build systems typically make the opposite tradeoff (invoking the compiler with multiple translation units) because process-creation costs are higher (and, I think, the visual studio compiler will do some transparent sharing of common include processing if you do).
                                                • The authors of lld and mold might be surprised to learn that linking cannot be parallelised. The authors of LINK.EXE and mold would be surprised to learn that you need a full relink from scratch if a single input file changes.
                                                • The language detection section contradicts itself. In this example, the compiler (not the driver) is setting up the default search paths.
                                                1. 4

                                                  The usual extension for preprocessed C is .mii, not .tu.

                                                  GCC will use .i (traditionally for C) and .ii (for C++). .mii is used for Objective-C++.

                                                  The first figure with the pipeline is missing the assembler step. In clang, this is integrated, but historically it is not (and I think GCC keeps it separate).

                                                  Yeah, it’s separate in GCC. GCC’s code generator outputs assembly and then invokes as. This is actually a bit of a pain sometimes because not only do you need an assembler, but there are cases you won’t know the size of an instruction during codegen because the assembler might be able to change it.

                                                  1. 4

                                                    GCC will use .i (traditionally for C) and .ii (for C++). .mii is used for Objective-C++.

                                                    You’re right. I tend to default to .mii because then it doesn’t matter whether the input it C, C++, Objective-C, or C++ for syntax highlighting to work.

                                                    GCC’s code generator outputs assembly and then invokes as. This is actually a bit of a pain sometimes because not only do you need an assembler, but there are cases you won’t know the size of an instruction during codegen because the assembler might be able to change it.

                                                    This is even more true with the Plan 9 toolchain (which Go uses), which expands pseudos that contain relocations at link time. Depending on the distance / address of the target, you may need 1-3 instructions on modern architectures to materialise the address and the Plan 9 linker picks the shorter sequence. RISC-V tries to do the inverse and emit the inefficient sequence in the compiler and then ‘relax’ it back by deleting instructions and updating all other label addresses, which causes a huge amount more (complex) work in the linker than any other modern architecture / ABI.

                                                    1. 1

                                                      RISC-V tries to do the inverse and emit the inefficient sequence in the compiler and then ‘relax’ it back by deleting instructions and updating all other label addresses, which causes a huge amount more (complex) work in the linker than any other modern architecture / ABI.

                                                      We have examined this kind of approach in our linker work and it really seems RISC-V made a mistake here.

                                                      1. 3

                                                        it really seems RISC-V made a mistake here.

                                                        This statement works in almost any context.

                                                    2. 1

                                                      Yes, .i can be found un the 7th Edition cc source

                                                    3. 2

                                                      Nit: c99 is POSIX, not cc.

                                                      About parallel linking, that was in reference to using the historical linker, and lld and mold etc are mentioned later on.

                                                      1. 2

                                                        Ah, you’re right. I’m fairly sure cc was in POSIX 1997 but I can’t work out how to search that version.

                                                        1. 1

                                                          It was, but deprecated in favour of the c89 command. Putting the language revision in the command name seems like a mistake… https://pubs.opengroup.org/onlinepubs/007908799/xcu/cc.html

                                                          1. 1

                                                            Putting it in the name made some sense because you could detect c89 or c99 support by just checking for the file. C99 also introduced some breaking changes and so you generally didn’t want cc to compile with an unspecified dialect because that would break either old or new code.

                                                      2. 1

                                                        The authors of lld and mold might be surprised to learn that linking cannot be parallelised. The authors of LINK.EXE and mold would be surprised to learn that you need a full relink from scratch if a single input file changes.

                                                        I interpreted that part to mean that the end-result of the linking operation is a single entity, as opposed to compilation, where every compilation unit can be done in parallel independently of one another.

                                                        So it’s possible to link in parallel, but you still need to aggregate everything together in a single linked artifact. (This also seems to be strongly implied by the diagram for the linker chapter)

                                                        1. 2

                                                          That’s kind-of true, but there’s a lot of nuance. Most compilers do some form of separate compilation, but there’s a case to be made (especially on modern hardware) for doing whole-program compilation and some languages do. Modern C/C++ compiler have an option to do this, though they still build IR for translation units one at a time. Linking involves several logical steps that broadly fit into two categories:

                                                          • Resolving symbols
                                                          • Copying (the sections that contain) referenced symbols into the resulting output.

                                                          Both of these can be done somewhat incrementally and this is actually what mold does: it starts running before all object code is available. Resolving symbols can be done as soon as the symbol definition is available and sections can be copied into the output eagerly if there is space reserved for them.

                                                          1. 1

                                                            but there’s a case to be made (especially on modern hardware) for doing whole-program compilation and some languages do.

                                                            Absolutely. We are trying to get our customers to build with LTO on embedded projects because they are small enough that modern hardware can actually do the complex whole-program optimizations that were mostly dreamed about back in the 80s and 90s. (Although in practice, all you really need for the big benefit is to be able to get the whole call-graph into memory so you can inline effectively.) The largest programs for most customers will be maybe 100 compilation units with a total program size of about 1MB, with roughly 1000 functions. With current workstation hardware, that will easily fit in memory and can be analyzed quickly.

                                                      1. 5

                                                        I would add to this list:

                                                        • macro or template programming; the purest example in common use is m4
                                                        • command processors, which are characterized by literal strings being unquoted and variables having sigils

                                                        Tcl is a beautiful fusion of both. Current programming is infested with too many bad versions of them!

                                                        • SQL
                                                        • spreadsheets
                                                        1. 1

                                                          I think lisp covers the macro category well. Your others are worth calling out specifically though.

                                                          1. 5

                                                            I think textual macros have a significantly different flavour from Lisp macros, and a distinct history, going back to Strachey’s GPM in the 1960s.

                                                            1. 2

                                                              TRAC and SAM76 are other significant macro languages. (Ted Nelson teaches TRAC in his classic book “Computer Lib”.)

                                                        1. 5

                                                          I think there’s an underexplored happy place (that D doesn’t entirely reach, but sort of gestures at) involving a fusion of Algol, Self, ML and the unlisted eighth language concept, metaprogramming (templates, macros, :gag: C preprocessor): pervasive type inference, mutable state in objects, immutable values with rich built-in semantics (tuples, sumtypes), lambdas and functional algorithms at the coarse domain level but (opt-in) mutable state (!) at the lowest level. The two big weaknesses of the ML family (IMO) are coarse state mutation and very fine algorithms; tail recursion is cute but iteration will always be nearer to the CPU’s heart. And even a ML program does (as a matter of observable behavior) have mutable state when it’s executed; OOP lets you model this fact where insistence on purity at all levels does not.

                                                          Then you end up, in the hexagonal model [1], with three levels that almost use entirely distinct languages:

                                                          • at the infrastructure (RPC, IO) level, pervasive metaprogramming uses compile-time introspection to mostly autogenerate encode/decode logic, IO functions and automatically check the code against the interface spec.
                                                            • This is tested sparsely by running an instance of the program against wire-level mock services.
                                                          • at the application (state mutation) level, classes separate concerns and encapsulate mutations of different aspects of the application’s state, and also process inbound and outbound messages at a high level.
                                                            • This is tested with class mocking and stub services.
                                                          • at the domain (computation of business outcomes) level, pure type-inferred algorithms operate on immutable values, making them easy to test and understand. Here there is heavy use of ML features.
                                                            • This is tested with unittests.
                                                          • at the (not listed in the hexagonal model) implementation level, where the rubber hits the road, it’s C again, or rather Fortran: a moderately high-level representation of CPU features in a format easily digestible for the compiler. Opt-in mutation, opt-in pointers. Small units of work.

                                                          Addendum: I think OOP gets a bad rap because people try to use it at the domain level, not the application level. A warning sign is when you see a class hierarchy that’s deeper than two levels. The important insight that classes should encapsulate components of the service as a whole that can change state and trigger actions independently. They should never represent values.

                                                          [1] https://vaadin.com/blog/ddd-part-3-domain-driven-design-and-the-hexagonal-architecture

                                                          1. 5

                                                            I would be more precise: Actors are missing. Actors are like objects and also like lambda calculi, so actor languages are like Lisps or Self; but the semantics are usually concurrent in a way which the typical Lisp or Self descendant lacks. The ur-language for concurrent actors is presumably E.

                                                            1. 7

                                                              Or maybe Erlang?

                                                              1. 1

                                                                Oh yeah, I wasn’t even thinking of that.

                                                                Arguably actors live on most prominently in the microservices/event-sourcing design.

                                                            1. 14

                                                              See also: Why Lexing and Parsing should be separate (from the oil shell wiki)

                                                              notably,

                                                              separating lexing and parsing reduces the n in your O(n^3). In any real program, there are many more characters than tokens

                                                              1. 4

                                                                That wiki page is a great summary.

                                                                My main intuition is: lexing can be done in a fixed amount of memory with a finite-state machine (or equivalent regular language); parsing requires an arbitrary amount of memory (i.e. it needs the stack) to recognize recursive/nested structures.

                                                                1. 3

                                                                  (or equivalent regular language)

                                                                  That’s not true for many practical languages: things like nested comments or Rust-style raw strings are not regular. Conversely, lex-style lexer generators allow for custom state manipulation.

                                                                  So, while to the first approximation it is true that lexing is regular and parsing is context free, the reality is quite a bit more murky (eg, Rust lexing isn’t even context free).

                                                                  1. 5

                                                                    This is why I say

                                                                    • lexing is non-recursive
                                                                    • parsing is recursive

                                                                    The regular / context-free thing that’s taught is actually the wrong distinction. Oil’s lexer maintains a stack of “lexer hints”, so I guess it’s a pushdown automata (which is non-recursive).

                                                                    The lexer USES regular languages, but the lexer itself isn’t entirely a regular language. It has both lexer modes and lexer hints.

                                                                    (I don’t think I wrote about the hints that much, but it’s for figuring out what right paren ) means, which is a tricky problem in shell, e.g. due to weird unbalanced ALGOL-style case syntax, nested with command sub, etc. )


                                                                    Likewise with Rust, you can match raw strings non-recursively, but it’s not a regular language.

                                                                    1. 1

                                                                      A pushdown automaton is enough to parse a context free language. Lexer modes are equivalent to selecting between different lexers at run time.

                                                                      What is really fun is string interpolation, especially when the complete expression grammar is available inside an interpolation, including interpolated strings. This looks like arbitrary recursion, but i think the way it is usually implemented is "string${ is treated as a combination of the first part of the string literal and an opening bracket, and }string" is the corresponding close bracket and the remainder of the string. (mutatis mutandis for multiple interpolations)

                                                                      1. 2

                                                                        A pushdown automaton is enough to parse a context free language.

                                                                        I’ve written an entire series on Earley parsing, and I highly doubt that. I mean if that were true why would we need GLR and Earley in the first place?

                                                                        I’m not entirely sure but as far as I know a pushdown automaton (finite state machine + stack) can only parse LL grammars. I’m pretty sure it cannot parse all LR grammars, let alone all non-ambiguous context free grammars, let alone all context-free grammars.

                                                                        1. 1

                                                                          Oops, yes, I missed out the magic word “deterministic” before “context-free”, which in grammar terms roughly means lacking ambiguity.

                                                                          1. 1

                                                                            Err, not quite, actually:

                                                                            According to Wikipedia, “Every context-free grammar can be transformed into an equivalent nondeterministic pushdown automaton”, so in a sense you were exactly write: push down automata can parse all CFGs, even the ambiguous ones.

                                                                            However if we restrict ourselves to the deterministic ones, still according to Wikipedia, “there are non-deterministic unambiguous CFLs”, so some unambiguous grammars still cannot be parsed by a deterministic pushdown automaton.

                                                                            Still, I stand corrected: deterministic pushdown automatons don’t seem to be limited to LL grammars.

                                                                            1. 1

                                                                              Indeed, my weasel word “roughly” meant to say DCFGs are a subset of unambiguous CFGs.

                                                                              I wonder if the kind of tricks that regex engines use to get the speed of DFAs with the compactness of NFAs might be useful for implementing NPDAs - otherwise I doubt they would be an attractive implementation technique in the way deterministic pushdown automata are. Tho my very vague handwavy understanding of GLR suggests it is a bottom-up analogue of an NPDA.

                                                                              1. 2

                                                                                An LR(k) parser can be implemented using a DPDA. When there are ambiguities, the algorithm naturally extends to an NPDA.

                                                                                A GLR implementation is essentially an efficient interpretation of this NPDA, somewhat analogous to Thompson’s trick to recognize a regexp in linear time on an NFA (using a set of integers to represent all the active states, and advance them in lock step).

                                                                                A naive NPDA traversal would require maintaining a list of stacks (one per “thread”). Instead you maintain a DAG that represents all the stacks by sharing common chunks, so that as much work as possible can be shared between the different analyses.

                                                                2. 2

                                                                  Also, most syntax highlighters are based on lexing. So you’ll want a distinct lexing grammar, anyways.

                                                                  1. 1

                                                                    Though if your parser isn’t at least in the ballpark of O(n), you have way bigger problems than the size of the constant.

                                                                  1. 10

                                                                    The only reason to separate lexing and parsing into two distinct phases, it because it’s easier to write. Otherwise, doing the lexing “on-demand” is always better, as it allows the parser to provide context, or even switch entirely between lexers as necessary, which also solves the problem of composition that the article mentions.

                                                                    (still, it’s usually a good idea to use different DSLs to specify terminals vs rules, and different mechanism to extract them from the text)

                                                                    1. 2

                                                                      The only real reason to split it into two phases is because it performs better that way. Parsing is slow, lexing is fast. Do as much work in your lexer and only involve the parser when you need it.

                                                                      1. 2

                                                                        Parsing is not slow. Parsing (at least with all the parsers I’ve worked with) is nearly instantaneous O(1) stuff, and is stunningly fast even for large code bases.

                                                                        I’m not claiming that it’s impossible to build a slow parser, but that would seem to require some effort. (Edit: Like using regex or something.)

                                                                        1. 2

                                                                          Relative to lexing, it’s slow

                                                                          It’s not a bottleneck in many compilers, but certain workloads definitely expose it. For example:

                                                                          • C/C++ – I remember an early Lattner talk on Clang – a lot of it was about parsing speed relative to GCC
                                                                          • JavaScript and v8 – parsing is extremely optimized, because it adds latency to web pages.
                                                                          1. 3

                                                                            I just find it funny when people talk about lexing and parsing. They are like 1% of writing any compiler of any value, yet they get 99% of the research papers and blog posts and articles written about them. They represent less than 1% of the wall clock run time of a typical compiler. It would be like if all the press covered the original launch of Doom by examining only the icon that was installed on the Windows desktop, and never actually launching the game. It’s skin deep, at most.

                                                                            I understand that in 1970, lexing and parsing may have been a significant part of the cost of compilation. But back then, adding two numbers was a good excuse to run to lunch while the calculation was in progress. And compiles were done overnight.

                                                                            Nonetheless, these are fascinating subjects, and I still love working on lexers and parsers. And optimizing them. And reading papers about them. So obviously I’m a flaming hypocrite.

                                                                            (You can safely ignore this message. I’m tired.)

                                                                            Also, this is slightly out of date (the bnf has been updated a few times since this was last updated), but this is the Ecstasy lexer, written in Ecstasy: https://github.com/xtclang/xvm/blob/master/lib_ecstasy/src/main/x/ecstasy/lang/src/Lexer.x

                                                                            1. 5

                                                                              they get 99% of the research papers

                                                                              We must be reading different papers.

                                                                              1. 3

                                                                                99% of the teaching material people actually read, maybe?

                                                                                1. 2

                                                                                  I’ve been known to exaggerate 473% of the time.

                                                                                2. 3

                                                                                  Compilers aren’t the only thing that process source code! The bar has been raised in the last 10 years, which is probably why people talk about parsers more.

                                                                                  clang-format is huge, and now every language needs something like it. Tools like ASAN raised the bar and rely on the same location info. Debuggers use location info, etc.

                                                                                  JS processors like WebPack are absolutely bottlenecked on parsing (hence esbuild, etc.).

                                                                                  Also, languages have grown more elaborate syntax over the years.


                                                                                  Does Ecstasy have an auto-formatter, linter, and debugger support with accurate location info? What about a language server?

                                                                                  If not, people probably won’t use it. That functionality depends on a well-architected parser. It’s also a lot of work.

                                                                                  Most real languages also have at least 2 parsers. It would be nice if you could just write 1, but that seems to be an open problem.


                                                                                  (On the other hand, I kind of agree that it’s “obvious” why lexers and parsers should be separate, for essentially any language, and maybe weird to have a huge discussion about it. There are a bazillion production codebases out there that show you that. But there does seem to be some genuine confusion – there were production codebases on my wiki page that did otherwise, and then they split it up, and it was faster. I wrote the page because it wasn’t theoretical.)

                                                                                  1. 1

                                                                                    Does Ecstasy have an auto-formatter, linter, and debugger support with accurate location info? What about a language server?

                                                                                    Debugger is just a prototype (not supporting existing debuggers via dwarf or something, just our own debugger). But yes, accurate location info, type and variable info, break points, frames, evals, whatever, it’s in there.

                                                                                    We had a prototype of an IntelliJ IDEA plug-in, but gave up on that route; instead, we have a language server project now under way (not yet ready). But I’ve built custom language editors in the past with syntax highlighting etc. (previous life, Windows 4gl stuff; don’t ask, won’t tell).

                                                                                    On the other hand, I kind of agree that it’s “obvious” why lexers and parsers should be separate, for essentially any language, and maybe weird to have a huge discussion about it.

                                                                                    Ironically, this morning I learned a bunch about parser combinators from Jamie Willis. So now I have an entire new view on the possibility of not splitting a lexer out separately 🤣

                                                                            2. 1

                                                                              Actually, that’s not the only reason. Most parsers are constrained by the requirement to be context-free, mainly for performance reasons. So the fact that most lexers use regular expressions is very convenient, because they are capable of several things that parsers can’t do, such as back-reference (\1 etc.), and arbitrary lookahead/lookbehind.

                                                                              1. 3

                                                                                Backreferences make regexes non-regular, so the usual algorithms for reducing a lexer to a DFA do not work. I can’t remember off the top of my head if the same is true for lookbehind and lookahead. (I guess lookbehind is analogous to a backref; lookahead seems to be benign enough.) For example, flex and re2c have lookahead, but not lookbehind nor backrefs.

                                                                            3. 1

                                                                              The only reason to separate lexing and parsing into two distinct phases, it because it’s easier to write. Otherwise, doing the lexing “on-demand” is always better

                                                                              Wait a minute, the two aren’t mutually exclusive. Especially if what matters to you is the ease of writing: typically lexer.next_token() would be defined separately from your parser. Heck, it’s this very separation that allows your parser from switching lexers without causing your source code to be a complete mess: one file per lexer, another file for the parser, all neatly separated.

                                                                              It’s not a runtime separation of phases (lexing and parsing would actually be interleaved), but that hardly hurts source code clarity.

                                                                              1. 3

                                                                                No argument here. When I said two distinct phases I meant doing all lexing first, and then doing parsing.

                                                                                I agree that they can and should live in different components.

                                                                                Sorry if I was unclear in my phrasing.

                                                                            1. 2

                                                                              I have tried using lpeg (the Lua library for parsing expression grammars) for a complicated parsing task. The main thing that caused me trouble was how to handle whitespace in a way that is systematic without polluting the higher-level grammar, eg. space term space addop space term. I made much better progress with a traditional lexer/parser split. (The complicated task was a revamp of unifdef which has to deal with C’s early translation phases, which are lexically very tangled.)

                                                                              Also, wrt the discussion of scannerless parsers, PEGs have a way to express reject rules, and their ordered choice operator is a bit more systematic than lex’s disambiguation rule. On the other hand I don’t know if there are good linters for PEGs like there are for classic context free grammars.

                                                                              1. 2

                                                                                Yes, I get tripped up on PEGs for the same reason: what are distinct phases in my mind get complected together, and I have to remember that. It is really handy to have all the parsing rules in one place, but it more complicated. Really depends on the task at hand.

                                                                                I think the traditional way to handle what you’re talking about is to settle on a convention of consuming excess whitespace either immediately before or after reading in a lexical entity.

                                                                                1. 1

                                                                                  PEG libraries that I’ve used have an explicit token wrapper that means ‘match this with no whitepace’ and then they permit whitepace (or white space and things that match the comment rule, if you’re discarding comments) in any other matches. Does lpeg not have some equivalent?

                                                                                  1. 1

                                                                                    It isn’t that hard to write your own :-) TBH a large part of my problem was to do with the complexity of whitespace in the C preprocessor, and the requirement to keep it intact as much as possible. The fenceposts start to matter a lot, such as which tokens own leading or trailing space (and comments) and how space attaches to brackets.