1. 22

    1. 11

      A big part of our queue problem at Heroku was the design of the specific job system we were using, and Ruby deployment. Because Ruby doesn’t support real parallelism, it’s commonly deployed with a process forking model to maximize performance, and this was the case for us. Every worker was its own Ruby process operating independently.

      This produced a lot of contention and unnecessary work. Running independently, every worker was separately competing to lock every new job. So for every new job to work, every worker contended with every other worker and iterated millions of dead job rows every time. That’s a lot of inefficiency.

      This has not been the practice for a long, long time, at least the past decade. Sidekiq works in much the same way you describe your Go code.

      Just heard from a long time @sidekiq@ruby.social customer who ran 22 billion jobs yesterday. 250,000 jobs/sec every second. 😮 https://x.com/getajobmike/status/1587965981200105474

      This was a problem of poor implementation, and lack of investment on Heroku’s part, not language choice.

    2. 5

      Welcome to the Postgres queue club! I like the Generic API you’ve come up with for jobs. I considered a similar one when I was designing neoq, and I still might go that route before v1!

    3. 3

      Congrats on the release!

      My team is also designing their next high-throughput work queue over Postgres. The main reservation people had was about PG operations on high-churn tables (costly schema migrations, optimize falling behind and transaction ID rolling over). We will alleviate these risks by considering the table disposable from the start.

      Like messages in a Kafka topic, tasks in a queue have a known expiration timeline. I always recommend considering Kafka topic (and clusters !) as disposable: don’t try to change settings in place, just create a new topic with the desired topology and move traffic to that new topic. Similarly, moving writes away from an unhealthy Kafka cluster (by failing over to a new cluster) has a high chance of allowing it to recover, enabling consumers to finish processing what it holds.

      We’ll apply the same methodology to that PG table for the queue: instead of trying to “save” a table with degraded performance, we’ll create a new table with the latest schema and start writing tasks to the new table. Workers will read from both tables, and the old table can be discarded once all tasks there have been processed.