1. 22
    1. 13

      I really like how they measured and defined if it was good enough:

      Define your service level objectives (SLOs) and measure them via an observability system like Datadog. You usually want, at a minimum, an availability SLO (i.e., 99.9% of requests succeed) and a latency SLO (p99 latency is < 1s). Define your load objective. This is just the number of users you want to be able to support at a given time. If you’re launching a new product, ask marketing how much traffic they expect on launch day and double it. If there isn’t going to be a splashy launch, try to project out where you’ll be in, say, one year, and add a 10-20% buffer. Run a load test by spinning up a test cluster and writing some scripts to simulate real usage. Keep fixing bottlenecks and re-running load tests until you hit your objective.

    2. 6

      Kafka’s not a message queue, but ok. Message queues don’t suffer from head of line blocking.

      1. 1

        Yes, and what they were after is actually a persistent log and not a queue: “A logging system that aggregates structured and unstructured information about the pipelines and reports it to the user via a UI.”

    3. 4

      This is interesting, but I wish it went into a bit more depth on the actual implementation of publishers and subscribers. Are they using some off-the-shelf queue library that knows how to properly use Postgres as a queue, or did they roll their own?

      First, it mentions this change they needed to make during the load testing/validation phase:

      We had three problematic queries that we addressed one by one. We were able to completely eliminate one of them (it was unnecessary), and the other two were fixed by reducing a polling interval and by adding additional columns to the index.

      That seems to indicate that they were writing their own queries to use Postgres as a queue. I wish they’d share more details about the actual implementation and the Postgres-specific problems they had to solve!

      Later on, it mentions a Python queue library being an issue:

      We weren’t able to use Postgres for everything. Specifically, we had introduced Redis for some of our less important, transient event streams because the Python queue library we used at the time did not support Postgres. Redis ended up causing a number of incidents, so we are likely going to migrate this transient queue to Postgres.

      I guess this means they were using one queue library for these Redis-based transient queues and another (home grown?) library for Postgres-based queues.

      Compare this to Postgres Job Queues & Failure By MVCC and Transactionally Staged Job Drains in Postgres posts on the same topic, which go into far more useful detail about implementation problems and tradeoffs. The former, especially, walks through the actual queries a library uses to implement taking a message from the queue, how Postgres handles them at a low level, and how they can be improved.

    4. 4

      This is silly. ‘A message queue’ doesn’t equate to kafka. That a gigantic leap assumption.

      More importantly: Does postgres offer any queue functionality? Are they talking about just inserting and querying a large table? That cannot possibly be a better message queue than any system that properly implements a queue with O(1) pushes and pops.

      1. 2

        That cannot possibly be a better message queue than any system that properly implements a queue with O(1) pushes and pops.

        They say that given their requirements (SLOs), it costs less to use their existing postgres database than adding new infrastucture (e.g. kafka) for this.

        1. 3

          That’s not better by any stretch of the word. That is good enough for their use case.

          1. 6

            It was a better solution in their exact circumstance, taking the costs and effort into account. That’s the point of the article, and I buy it.

      2. 1

        ‘A message queue’ doesn’t equate to kafka.

        Totally true, but let’s take a moment to rejoice that ‘a message queue’ no longer equates to RabbitMQ.

    5. 3

      NATS seems to beat kafka as well: https://nats.io/blog/matrix-dendrite-kafka-to-nats/

    6. 2

      Need better flags for this kind of post.

      1. 2

        What flags would you suggest?