1. 32
    1. 5

      the biggest example of this that i can think of is google’s “gcl” or “borgcfg” (two names for the same config language)

      it’s the language with the most lines of code at google, by a wide margin (though, much of that is probably generated/templated, not artisanally crafted). it’s flexible, concise, fairly intuitive to use, and it’s used everywhere pretty reliably

      it’s also a lazily-evaluated language with dynamic scope and inheritance. i don’t think anybody knows a concise description of how it resolves symbols, including the team that owns it. it’s usually described by example in documentation, and forensically in postmortems

      1. 2

        It does have a terrible reputation, although honestly I thought it was pretty OK?

        To me it’s basically JSON-like types (which eventually encode as protobufs) + lambda functions + some kind of template inheritance. Which is not that big of a language.

        One of my coworkers wrote thousands of lines from scratch … kind of as a “real programming language”, not cut and paste.

        And I took a cue from that and really learned the language (>10 years ago), and I don’t think it’s that bad.

        Once you understand what it evaluates to, the real hard part is understanding how Borg actually interprets that, and I think that’s what leads to the outages … But it’s true that the syntax is too uniform, and if you don’t understand the evaluation, then that can be 2 surprises on top of each other.

        Funny thing is that it seems very close to the Nix language (which I haven’t really used, but have read about)

        I think the big difference is that GCL is eager while Nix is lazy (?)

        Otherwise they are both expression languages for computing dicts and arrays, basically

        I think they both share the “dict composition” operator, which is basically like Python 3’s {**mydict, **otherdict}

        1. 2

          i also actually like gcl. it’s really straightforward, intuitive, and readable like 98% of the time

          i’m not sure about nix, but gcl does use lazy evaluation. the “alien artifact” aspect of it comes up when you mix the dynamic scope, laziness, and inheritance (they call the composition thing inheritance). it can be basically impossible to figure out what a variable’s value is going to be without running it.

          i had lunch with someone on the gcl team when i worked there. the team does plenty of work on the parser, and they add functions and features. but, he told me that they basically can’t touch some core parts of the runtime, because it’s a ~12 year old tangled mess written by one person. they can’t rewrite it, because they aren’t entirely sure what it does in the edge cases, and so much depends on it that they’re almost guaranteed to break something if they touch it

          (also, fun fact: lambdas are syntactic sugar for objects, and calls are sugar for inheritance. lambda(x) { return x + 1} becomes something like { x = args[0]; return = x + 1 }, and foo(2) becomes something like (foo { args = [2] }.return). they’re entirely in the parser, and don’t make it to the evaluation of the language)

    2. 4

      Some exercises for the reader:

      1. Are alien artefacts truly an anti-pattern, or are they a natural artifact of the engineering process?
      2. On what timescale is it worth it to derisk an artefact? Are you likely to benefit your career by doing so? If you aren’t, what does that tell us about the maintenance incentives for a team employing artefacts?
      3. To what extent do techniques like microservices and containers enable/bias towards the creation of alien artefacts? What does that tell us about the long-term sustainability of those techniques?
    3. 4

      My friend pointed me to a post which in all likelyhood is the prior art I alluded to in the first footnote, and I have updated it to point to Jon Eaves’ Building Alien Artifacts as a result. Part of me is slightly sad that I didn’t in fact come up with the term, but another part of me is really happy that writing about it allowed me to rediscover the likely origins of this term I’ve been using for years.

    4. 4

      I love the way you call them, “alien artefacts”.

      In my experience so far, their fate is invariably the same, they get rewritten as soon as they become a nuisance to business, and you cannot evolve them past a given point.

      In a minority of cases, I’ve seen them isolated, and their functionality augmented with modern code, in a Frankensteinian turn of events.

      1. 6

        I’d argue that anything that can be rewritten easily isn’t an “alien artifact” … the interesting cases are the ones that last decades because they are too useful to give up (produce too much revenue etc.), and too hard / risky to rewrite, or the organization lacks the expertise to rewrite

        1. 2

          I agree. I didn’t put it explicitly in the post, but one of the things I was alluding to with the name “alien artefact” is that they are rare. In my 20 years in this business I’ve only encountered a handful I would classify as such, and only at two of the six jobs I’ve held in that time.

          1. 3

            I’ve seen quite a few in the VFX industry, usually an arcane bit of Perl that’s both been code golfed to hell and is doing something arcane and inscrutable to begin with… if they didn’t have a commit history — always a single event, showing up fully formed from out of the ether — attached to the name of a long-retired but known human grey beard, I’d think they were from Ceti Alpha V.

            1. 1

              Funny you should mention that. I used to work at an feature animation studio with a coworker who had previously worked at a competing studio. He once told how the they’d basically lost the institutional knowledge for how to make any significant changes to their humanoid rigging system. Which led to a period with many of their shows having a very specific look to their humanoid animations. That was the first thing I thought of on seeing this.

              (To be fair, I think they must have either figured it out or replaced it, judging by more recent shows.)

    5. 3

      This reminds me of SKS Keyserver Network Under Attack, in particular

      The software is Byzantine. The standard keyserver software is called SKS, for “Synchronizing Key Server”. A bright fellow named Yaron Minsky devised a brilliant algorithm that could do reconciliations very quickly. It became the keystone of his Ph.D thesis, and he wrote SKS originally as a proof of concept of his idea. It’s written in an unusual programming language called OCaml, and in a fairly idiosyncratic dialect of it at that. This is of course no problem for a proof of concept meant to support a Ph.D thesis, but for software that’s deployed in the field it makes maintenance quite difficult. Not only do we need to be bright enough to understand an algorithm that’s literally someone’s Ph.D thesis, but we need expertise in obscure programming languages and strange programming customs.

    6. 1

      Supposedly the Windows NTFS driver is an example of this; the person who wrote it left and it’s gnarly enough that nobody dare change it anymore. Hence you don’t see features added to NTFS nowadays, in favour of making new filesystems like ReFS.

      1. 1

        While it’s possible this is the case, I’m not convinced MSFT would allow such an important component have such a big bus factor. NTFS is getting long in the tooth anyway so implementing a new filesystem would make sense in any case (to account for more computing taking part “in the cloud” as opposed to on physical hardware, for example)

    7. 1

      Any examples? I don’t think I’ve dealt with such artefacts, though I do try to avoid alienating future maintainers when I have to write something clever.

      1. 12

        The ultimate example might be TeX, which is one of the most popular typesetting systems in the world, and is also written by a reclusive professor in a preprocessor that generates a home-grown dialect of Pascal, which must be translated to C by a Perl program originally written in the 80s.

        More commonly, consider a tightly-optimized chunk of numeric processing in Fortran. It’s too portable to justify rewriting for new architectures, too friendly to compiler optimizations to justify rewriting for better SIMD, and there is nobody in the world who looks forward to the idea of studying five thousand lines (in any language) of dense physics simulation written by a PhD.

        1. 5

          TeX is a great example! It even has the accretion disc of adapters: LaTeX, dvips, ps2pdf—to name a few.

        2. 4

          Thanks! I was considering something like LAPACK as it can be the only Fortran in some high performance programs, used by people who know no Fortran, but that’s in large part because it just has a routine for everything already and those devs don’t need to modify it. TeX is aggressively hyper documented (as I performatively pull TeX: The Program off my bookshelf), even if it’s also hyper idiosyncratic.

          And in both of these cases, though possibly not your example of some one-off Fortran, these are public and there are still a few people around who wrote or understand them. The identifying feature of the alien artefact is, I think, that there’s no one to explain it, and it won’t speak for itself. Like the techniques excavated from archives of long Ascended civilizations in A Fire Upon the Deep, useful, but might as well be magic.

      2. 9

        If we take the definition as:

        • highly useful
        • complex
        • written by smart people that are no longer at the company
        • and therefore last a long time

        Then these examples come to mind (and I do think it’s a fun/interesting topic):

        • I remember when I worked at EA there was some lore about the game logic for a driving game, probably NASCAR. These are the kinds of games that sell millions of copies, and they grow 10,000 line functions to compute game mechanics people find “fun”. Games have some amounts of irreducible complexity, built up over time, and you have to put it somewhere … Nowadays the best practice is to make it more data-driven, and have gameplay engineers develop and test the logic. But back in those days it was very common for programmers to just write huge piles of C that did it. If the game sells year-over-year, then you probably have to drag along that code.

        • I met a guy who went to work on the Excel team ~10 years ago, after Excel itself was already 30 years old. He said the core computation engine (when to update a cell) had hand-coded assembly on every platform, basically because it’s both expensive and in the inner loop of every Excel user’s core workflow. It was technically wrong in both directions – it would fail to update cells that were dirty, and it would recompute cells that were clean. But you can’t change this because millions of spreadsheets in the wild rely on it. If you recompute more cells, it will be too slow for some monster spreadsheet. I think there were also huge numbers of interactions with every Excel feature – it’s NOT a clean incremental computation engine.

        • There is some lore about Google search ranking. There were huge functions in C++ written by some very smart people which happened to score highly on a lot of metrics. I remember Peter Norvig (former director of search quality, former director of research) said he couldn’t understand a lot of it. Contrary to popular belief, Google beat all its competitors in the early 2000’s without using AI. In those days, AI “didn’t work” for such problems, and like AI today, data quality is the competitive edge. It wasn’t really PageRank either – that’s just one signal and it became less useful over time from what I understand.

        • Speaking of AI, my knowledge is all old, but my impression is that the code for training is actually pretty small and tight, but the data is where all the secret sauce is .. So you probably have big curated pipelines of data cleaning (e.g. of Reddit posts) at OpenAI somewhere, which they don’t talk about. Although much like Google there is also the hidden work of human raters / human trainers.

        I think TeX and Fortran linear algebra are great examples as well. Just because they’ve lasted decades.

        I remember a boss who was surprised that basically all core R libraries use Fortran (and I think a lot of NumPy does too). He thought Fortran was a dead language, but no it’s still there because all those linear algebra algorithms have properties it takes a applied math professor to understand, not just a programmer.

        A fairly interesting project at Google in 2010’s was improving/developing a HIGHLY templatized C++ linear algebra library for use in machine learning, because the state of the art at the time was Fortran:


        If you look at the code, it doesn’t look like normal C++ at all. But supposedly it is clean and modular compared to the Fortran equivalents.

        For a possibly negative example I think LuaJIT is pretty “alien” in many ways, and written by a very smart person. It’s still used, but I think it’s not as popular because it’s hard to evolve or adapt to your needs. People have other options and they use those. So it may not be one of the “alien artifacts”.

        1. 3

          A couple more open source examples:

          (1) Vim


          How can the community ensure that the Vim project succeeds for the foreseeable future?

          Keep me alive.

          I haven’t looked at the Vim source code, but my impression is that there is like a 10,000 line main loop state machine that handles all the keypresses, and it’s almost all written by its creator.

          And it’s burned into the fingers of millions of users.

          Though Neovim has successfully extended and maintained Vim.

          (2) GNU Bash and GNU readline.

          Another highly useful, widely deployed program burned into people’s brains that has been maintained by one (very productive) person for ~30 years.

          As has been mentioned a few times, there are no change logs – the maintainer doesn’t seem to use VCS, and just dumps private changes to git with a script.

          There have been patches accepted over time, but it seems like many of them are not well integrated or well functioning (e.g. unicode in bash).

          The 2014 ShellShock vulnerability famously existed for >20 years before it was discovered by a bash expert “just thinking about it”.

          As another example, after nearly 7 years of https://oilshell.org/ , someone reported a bug that

          [[ -v a[key] ]]

          test if a key exists in an associative array. Also [ -v a[key] ] does that.

          This doesn’t appear to be documented anywhere in the bash manual, or the help builtin, under either [ or [[. But some people know about it and I guess rely on it …


          1. 4

            I have read (parts of) the GNU readline code, and while it may not be great, I certainly wouldn’t classify it as an “alien artifact”. It’s not that difficult to understand, if you’re familiar with the problem domain. But more importantly, GNU readline is extremely accessible – that is, it’s easy to poke at it in a testing environment and explore how it works, and the effects of changes on the code. Part of what makes an “alien artefact”, I would say, is the difficulty of exploring its interior workings. (Typically this happens because it’s embedded in a real-world business environment that lacks a testing framework or even a clear model of its requirements. There are many examples of these that you and I don’t know about, because they’re just some gnarly set of programs that keeps a part of some business running – until one day it doesn’t).

            1. 1

              Yeah it’s not that big (my memory is 30K-60K lines), but how many people have modified it outside the authors? (honest question)

              If anyone understands GNU readline and the problem domain, then it would be a nice contribution to implement the bind bash builtin in https://www.oilshell.org/

              I’m not sure who uses it, but I think the fzf integration does

              I don’t use either fzf or bind, but I know there are people who do, so that would be cool

      3. 4

        Linkers. Everyone depends on them, they’ve been around forever, there’s decent docs and even a big famous book written on the topic of linkers and loaders. Still, the major linkers in use today have been more or less unchanged for decades and hardly anybody dares touch them. When people encounter a limitation imposed by the linker, they never change the linker and always work around it - in the compiler, the build system, or elsewhere up the stack.

        Like for most alien artifacts, attempts to really change anything about linkers are usually complete rewrites, like LLD and mold.

        1. 1

          I remember when the original (?) GNU linker was rewritten as gold quite awhile ago:

          And then documented quite well: https://lwn.net/Articles/276782/

          It’s been awhile but not quite “decades”. Isn’t that what everyone uses now, e.g. on Debian / Arch / Nix? I’m not sure but I think so.

          And mold is very clean, modern code.

          It’s definitely a niche subject, but it seems like people besides the original authors definitely understand the code

          1. 1

            Alright, 15 years may not technically be “decades” but it predates my entire programming life, so it’s prehistoric to me.

      4. 3

        I’ve only come across them in medium to large companies that’s been around for a while. One of the defining characteristics is that the original authors are not around any more, and they were probably never subject to many eyeballs. So I think it’s unlikely you’ll find many in the wild—certainly I can’t think of any.

        1. 1

          Makes sense, but that makes it all the more helpful to use an example as well as the general rule, since most people won’t have one that comes to mind.

          1. 6

            Having put a bit more thought into it, I imagine the A+ language created by Arthur Whitney while at Morgan Stanley (MS) qualifies. I remember looking at the sources, and they certainly felt very alien to me. I found something that claims to be a stripped down version (the A in A+), and “esoteric techniques” doesn’t even begin to cover it IMO. Take a look at arthur.h for example.

            I heard of A+ when I was working at Morgan Stanley over a decade ago, as part of the “origin story” for why we were using KDB and K or Q, Whitney’s commercial offerings. I don’t think those qualify as alien artefacts at MS because they would have had commercial support - and presumably Arthur is still around to support those at Kx systems.

            1. 6

              “We switched to KDB and K/Q because we needed a more accessible platform” was not on my bingo card for today. Thank you friend. :)