Slightly off topic but I deal with this a lot when it comes to webhooks. So many systems will only attempt to send once and then the message is lost. So many junior developers write webhook endpoints that lose data on failure.
Queues across the internet are hard. That’s why I prefer to use periodic background polling in addition to webhooks. Polling is much slower and more resource intensive but a lot easier to make self-healing and resumable. Downtime is a delay rather than missing critical messages.
Sidekiq has a lot of code to deal with both of these issues.
Sidekiq does not guarantee ordering within a queue as I think that’s a terrible pattern. Developers don’t want total ordering of jobs within a queue, they want to know that Job A will fully execute before Job B. There might be 1000 other jobs in the queue that are completely independent of that ordering but we’ve screwed ourselves by forcing total queue ordering. Instead Sidekiq Pro provides a workflow API, Sidekiq::Batch, which allows the developer to author higher-level workflows for Job A -> Job B which provides the ordering guarantee.
For poison pills, we detect jobs which were running when a Sidekiq process died. If this happens multiple times, the job will be sent to the dead letter queue so the developer can deal with them manually. If they were part of a Batch, the workflow will stall until the developer fixes the issue and executes the job manually to resume the workflow.
The third mode, with both strict ordering and guaranteed delivery is obviously the safest. It’s also, in my experience, the most common.
In my experience, it’s very rare that you actually need strict ordering in your message queue. For instance, the delivery notifications example in the article is not strictly ordered end-to-end. Once your notifications are sent to the user, there is no guarantee that they will arrive in the same order (or on time, or even at all!). In that case, why maintain a strict ordering and the scaling limitations that come with it?
A global total order on messages (and guaranteed delivery in that order) can simplify reasoning about a system. For example, it can be useful to know that if you’ve seen a given message, you’ve already seen all previous messages (WRT the total order). Replication can also be easier with this guarantee: you can quantify how “far behind” one replica is WRT another, and if 2 replicas have the same “last seen message ID”, you know they’re identical.
Slightly off topic but I deal with this a lot when it comes to webhooks. So many systems will only attempt to send once and then the message is lost. So many junior developers write webhook endpoints that lose data on failure.
Queues across the internet are hard. That’s why I prefer to use periodic background polling in addition to webhooks. Polling is much slower and more resource intensive but a lot easier to make self-healing and resumable. Downtime is a delay rather than missing critical messages.
Sidekiq has a lot of code to deal with both of these issues.
Sidekiq does not guarantee ordering within a queue as I think that’s a terrible pattern. Developers don’t want total ordering of jobs within a queue, they want to know that Job A will fully execute before Job B. There might be 1000 other jobs in the queue that are completely independent of that ordering but we’ve screwed ourselves by forcing total queue ordering. Instead Sidekiq Pro provides a workflow API, Sidekiq::Batch, which allows the developer to author higher-level workflows for Job A -> Job B which provides the ordering guarantee.
For poison pills, we detect jobs which were running when a Sidekiq process died. If this happens multiple times, the job will be sent to the dead letter queue so the developer can deal with them manually. If they were part of a Batch, the workflow will stall until the developer fixes the issue and executes the job manually to resume the workflow.
Been using Sidekiq for years. Absolutely love it. Convinced my current employer to pay for Pro. :)
The ideal message delivery is a fiction, and building software that requires it is negligent.
What do you mean by “ideal message delivery”? Uniform atomic broadcast? If so, is every system built around Paxos or Raft “negligent”?
In my experience, it’s very rare that you actually need strict ordering in your message queue. For instance, the delivery notifications example in the article is not strictly ordered end-to-end. Once your notifications are sent to the user, there is no guarantee that they will arrive in the same order (or on time, or even at all!). In that case, why maintain a strict ordering and the scaling limitations that come with it?
A global total order on messages (and guaranteed delivery in that order) can simplify reasoning about a system. For example, it can be useful to know that if you’ve seen a given message, you’ve already seen all previous messages (WRT the total order). Replication can also be easier with this guarantee: you can quantify how “far behind” one replica is WRT another, and if 2 replicas have the same “last seen message ID”, you know they’re identical.