1. 19
    1. 4

      Very cool writeup Ntietz. I think as more and more applications diverge from the old-school request -> DB-work -> render-output webapp model, we’ll find ourselves “breaking the rules” more often.

      This type of architecture makes me happy – Erlang/Elixir programs can very often really capitalize on this pattern (see, for example, caching user-local data in a Phoenix Channel for the duration of a socket’s existence).

      1. 1

        Elixir and the BEAM definitely make this easy to do and can be used to great effect. I’m really excited to see what comes about with Phoenix LiveView (and the similar projects in other languages) leveraging connection state and lots of backend processing.

    2. 3

      In the spirit of breaking rules and going pretty far with just one machine, I wonder if a single machine that locally ran PostgreSQL and used an in-memory cache directly in the monolith would be even better. Sure, take periodic off-site backups, but a single bare-metal box from a provider like OVH can have pretty good uptime.

      1. 3

        A single machine can definitely take you very far. The biggest instance (pun intended) I know of doing this is Lichess, which runs on one rather beefy machine, but I am sure there are others that are bigger or equally/more well known.

        Unfortunately, that particular bet wasn’t one I could make for us ;)

    3. 2

      The memory space or filesystem of the process can be used as a brief, single-transaction cache. For example, downloading a large file, operating on it, and storing the results of the operation in the database. The twelve-factor app never assumes that anything cached in memory or on disk will be available on a future request or job[.]

      When a participant starts responding to a message, they open a WebSocket connection to the server, which then holds their exercises in the connection handler. These get written out in the background to BigTable so that if the connection dies and the client reconnects to a different instance, that new instance can read their previous writes to fill up the initial local cache and maintain consistency.

      Sounds like they are still following the rules by not relying on the state hehe

      I’m not surprised they had to. You can certainly run stateful things in Kubernetes, but the ease at which you can roll out new versions of containers means restarts are common. And even when running multiple replicas, restarts still kill open connections (terminationGracePeriod can help but still has limits).

      1. 3

        Well, you’re right, we’re kind of in the middle: we rely on per-connection state, but we don’t rely on it existing for a long time after the connection. We wanted to go there, too, but sticky routing was unfortunately not feasible for us.

    4. 2

      The Google Slicer paper(pdf) is a good read. I believe that many applications benefit greatly from an above-database stateful layer, especially at scale where hot rows and hot entity groups become a real concern, or when you find yourself doing things like polling a database for completion status.

      Stateful services aren’t right for every use, but when used well they greatly simplify your architecture and/or unlock really compelling use-cases.

      1. 1

        Oooh this looks great, adding it to my paper reading list.

    5. 2

      I wouldn’t call these stateful services. I’d call them services with built-in caching.

      So what if the word cache is different than the word webserver? They can be in the same stateless microservice! You don’t lose any data if one of these “stateful services” falls over and never gets back up, so it’s not really a traditional stateful service.

      You just remember not to go all the way to the techbro “5 lines of code per service” side. But nobody actually gets there, even if they preach it.

    6. 1

      This is a nice piece of work, and clarifies something that people forget about stateless services: the service can have state that requires warmup, just not authoritative state. If you’re using Hack or the JVM, your stateless service already has warmup from the JIT. Having a local read cache is a similar case. If you lose the host, a new host will have worse performance for users for some time until its cache is warm.

      I’d be curious to see a comparison of this approach for them vs trying VoltDB.

      1. 1

        I would also be curious to see usage of VoltDB compared with other options!