1. 21
  1. 7

    Hm interesting, but this architecture assumes that the whole system can be brought down and brought up atomically? (atomic upgrade)

    I mention that issue in my blog post yesterday: https://www.oilshell.org/blog/2021/07/blog-backlog-1.html#fallacies

    Fallacy: You Can Extend Your Type System Across the Network. Large distributed systems can’t be atomically upgraded. (Example: the entire Internet, although it’s also true at much smaller scales.) This conflicts with type safety as a global property of a program.

    And I go on to link this comment: https://old.reddit.com/r/ProgrammingLanguages/comments/nqm6rf/on_the_merits_of_low_hanging_fruit/h0cqvuy/

    Reading between the lines (with orderd), you’re working at a trading firm. I can see why they can take down their systems. One reason might be that some markets actually close; another reason it’s not a system with users outside the company. Or it could just be like my bank where their website goes down on Saturday nights :) (annoying)

    I’m thinking of stuff more like Google or the whole Internet – you can’t just take the whole thing down and restart it :) It’s up forever and has to be incrementally upgraded.

    So it would be nice to make this distinction in the article. I would say that only small distributed systems can be atomically upgraded. To be clear, this has a huge effect, and if you can take advantage of it, you should! I am not sure I can use it, but it’s interesting.


    FWIW I think the middle of the article was very lucid and clear, with a good example. However the intro was difficult to read and understand. I had to read it 2 or 3 times to figure out what you’re talking about. It’s a bit abstract, uses some vague non-standard terms (single conventional program, single-program system), and the style is a bit stilted. If you want to reach more people, I would suggest writing like you talk.

    I hope that is helpful; I say that because I noticed several people including myself had problems reading an earlier post: https://lobste.rs/s/glu4bw/write_code_not_compilers

    (And this is coming from somebody who has worked with all these things: Python, MyPy, C++, Unix processes, distributed systems. I’m used to reading jargon, but jargon can be put in simple / natural prose, which builds up crisp definitions, etc.)

    1. 4

      but this architecture assumes that the whole system can be brought down and brought up atomically? (atomic upgrade)

      No - there are many approaches here, like hot-code-reload of modules in the Python program, but one very simple approach is to just restart the main program and keep the processes running, and have the new version of the program resume monitoring all the child processes (through some Linux magic) and then start gradually killing and upgrading them.

      Edit: To paraphrase your broader point though, you’re saying that this only works for systems where one entity ultimately controls the entire system, so for example this wouldn’t work for the internet at large (but it would work for Google). Yes, that’s true. Two points on that: 1. I think most systems have one entity ultimately controlling them, and 2. This is a general pattern in many places, where there’s a larger “decentralized” environment containing interacting centrally-controlled individual entities: Economies contain firms, for example. I (probably) wouldn’t want to centrally plan the entire economy, but centrally planning a firm inside the economy is very beneficial - it’s Coase’s theory of the firm. For the same reasons, my system might interact with other systems I don’t control, but I’d still want to be able to centrally control my system - and it’s practical to do so.

      However the intro was difficult to read and understand.

      Thanks for the feedback, I’ll work on it, for sure.

      1. 1

        So are you actually doing hot reloading of Python code at your job? I would be surprised because that’s a very limited mechanism and not well defined from what I remember (e.g. reloading native extensions, etc.).

        I think what you’re describing is interesting, but you’re overstating the applicability of it, e.g. in the first sentence:

        one can write a distributed system as a single conventional program

        This is too grand a claim. That’s why it’s important to be concrete rather than abstract in the intro. If you say you work at a trading firm, that’s going to give me a very different picture than if you say you worked on “web services” (Facebook, google, etc.) or games.

        All 3 of those domains use distributed systems in Python and C++, but they have very different architectures, based on the different nature of the problems.

        Are you talking about stateless subprocesses/services with all state in the central Python process that’s upgraded? I can see how that can be a very useful and simple pattern, but there many other architectures of distributed systems. All of these have different implications for upgrade.

        • Stateless processes and a consistent database (you still have to deal with schema changes)
        • Stateless processes and a inconsistent distributed database
        • Ephemeral state in a single master, stateless or stateful workers
        • Stateful workers with immutable / append-only state
        • Stateful workers with mutable state (a hard one, lots of tradeoffs to make here)

        I’m about to link this survey in a sibling comment, so maybe that will give some color on why I’m talking about the upgrade problem:

        http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.9679&rep=rep1&type=pdf

        1. 3

          @washort and I cracked the problem of Python code reloading a decade ago; they created Exocet and I refined that into bravo.plugin. We used PEP 302 import hooks to customize how the Python interpreter imports modules. I used this in the Bravo Minecraft server to hot-load core functionality implemented as plugins. You’re entirely right that native extensions are not supported, and additionally Bravo configured Exocet to disallow plugins from importing threading, asyncore, etc.

          The code has rotted a little, and importlib is now standard, so a Python 3 version would be simpler and mostly concerned with the zope.interface API and topological sorting. Ultimately, though, this is part of why we created Monte; we needed better treatment of modules as immutable functions, or even just as plain old data objects.

          1. 1

            So are you actually doing hot reloading of Python code at your job?

            Not necessarily, I just mentioned that as an example of how there’s many, many approaches to this problem. The simplest approach (which I can’t really say whether I’m using) is the one I outlined after that:

            one very simple approach is to just restart the main program and keep the processes running, and have the new version of the program resume monitoring all the child processes (through some Linux magic) and then start gradually killing and upgrading them.

            That’s a general and simple approach which requires no fancy technology and can be extended to work for any distributed system architecture.

            Are you talking about stateless subprocesses/services with all state in the central Python process that’s upgraded?

            No, not at all. There’s lots of state in the individual nodes. This approach is not at all tied to any particular distributed system architecture. Any of those architectures you listed could be managed just fine.

            1. 4

              OK well I’d like to read something about how you incrementally upgrade nodes in this architecture, where the state is, what the messaging formats are, which nodes message each other, etc. Without those details I’d still say you’re overstating the applicability of this pattern :)

        2. 4

          It’s fun to see that my slogan has come full circle; I started spreading this meme about type systems and networks years ago. Specifically, we cannot extend any value judgements across The Network. There’s two interlocking problems. First, we cannot enforce type judgements on serialized bytestrings. Second, we cannot assume control of Somebody Else’s Computer. This underlying logic is important because, for example, we cannot even enforce type judgements upon The Filesystem, since it has the same erasure to bytestrings!

          1. 4

            If you view the filesystem as a network where you send packets across time instead of space then it also has the Someone Else’s Computer problem. ;)

            I’m sort of only half joking about that.

            The past-you who wrote down those files is gone, so the future-you who receives them can only deal with what was sent, not what should have been sent. Future-you can’t update the types and have everything retroactively comply (in a useful way, at least).

            1. 2

              Did you write something about this? Where?

              This is a very old problem … It’s well known in certain circles but somehow not appreciated among current programmers – just like some of the other concepts in that blog post (e.g. policy vs. mechanism). Programming is a field with a short memory.

              I experienced problems caused by distributed system upgrade by working on a variety of systems at Google for 10+ years.

              I think around 2009 I read work on this from Sameer Ajmani, like this survey from 2002. The first sentence says the earliest work on distributed systems upgrade from 1983 (which sounds plausible to me; I imagine the distributed systems back then were small and unreliable by today’s standards):

              http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.9679&rep=rep1&type=pdf

              As far as I remember, he’s a prominent Go contributor; I was following the internal go-nuts@ mailing list before the initial release in ~2009, and probably saw his name there, and followed the links to his work. That’s not where I encountered the problems, but the framing is useful.

              I’d be interested in what you said about it, if there is some unique take that’s not already been expressed, etc.

              1. 2

                I mostly synthesized the slogan, “type systems don’t work across The Network,” while I was deep into Haskell. It was around the time that I wrote gemstone in 2013, not long before I left Haskell altogether. This slogan was designed to refute several common claims which Haskellers often made about correctness of distributed systems. (To be clear, Haskellers were not ignorant of the problems with libraries like cereal, but had apologetics for minimizing the apparent severity of the issues. Memes are powerful.)

                I was at Google immediately prior to that, though, and I recall being a Haskeller while at Google, so it’s quite possible that I adapted some Googler’s slogan or memes.

              2. 1

                This underlying logic is important because, for example, we cannot even enforce type judgements upon The Filesystem, since it has the same erasure to bytestrings!

                Just serialize to structured objects in memory that persist to disk. Not a new solution.

            2. 4

              @catern:

              The alternative is writing ten programs and a thousand config files for configuration management, service discovery, orchestration, etc. I believe the advantages of the single-program approach are self-explanatory.

              🤔

              Turtles all the way down:

              After a lecture on cosmology and the structure of the solar system, James was accosted by a little old lady.

              “Your theory that the sun is the centre of the solar system, and the earth is a ball which rotates around it has a very convincing ring to it, Mr. James, but it’s wrong. I’ve got a better theory,” said the little old lady.

              “And what is that, madam?” inquired James politely.

              “That we live on a crust of earth which is on the back of a giant turtle.”

              Not wishing to demolish this absurd little theory by bringing to bear the masses of scientific evidence he had at his command, James decided to gently dissuade his opponent by making her see some of the inadequacies of her position.

              “If your theory is correct, madam,” he asked, “what does this turtle stand on?”

              “You’re a very clever man, Mr. James, and that’s a very good question,” replied the little old lady, “but I have an answer to it. And it’s this: The first turtle stands on the back of a second, far larger, turtle, who stands directly under him.”

              “But what does this second turtle stand on?” persisted James patiently.

              To this, the little old lady crowed triumphantly,

              “It’s no use, Mr. James—it’s turtles all the way down.”

              1. 3

                Ah, while I enjoy the quote for sure, I’m not sure how it’s related? What’s the infinite regress here? Seems like a single step: many programs to one program, and then you’re done.

              2. 1

                I found this interesting (along with a lot of the related posts and others on the site) but it felt like it was being presented as an alternative to something, but I had a hard time figuring out what that something was. DSLs seems vague. I think I finally got the idea from noticing the punny file name “caternetes.html”. The DSLs being referred to are at least in part things like k8s configurations or terraform etc.? Is that the idea? Analogously to the the post about writing code not configuration?

                1. 2

                  Yes, that’s right. I kind of wanted to avoid being explicit about that, because I’m worried that if I call out any specific technology, fans of it will immediately dismiss this idea. :)

                  But, your comment made me finally add a little (partial) list to the conclusion: http://catern.com/caternetes.html#conclusion where hopefully it will be less threatening.