In my experience messaging platforms work in one of three ways:
“At least once delivery”
“At most once delivery”
“Exactly once”
When building systems that communicate via messaging, eventual consistency becomes the trade off. This is why building these systems with idempotency is crucial.
A message broker itself cannot guarantee exactly once delivery. A system can achieve exactly once, but it is complex and almost no system does it correctly.
It’s possible if and only if the whole system/chain supports idempotency.
Take emails for example. There is no support for idempotency in the protocol (to my knowledge) so even if you can guarantee that your system attempts to send the exactly once, there is no guarantee that it will reach the user exactly once.
You could, but the nodes can’t operate independently of each other, thus impacting throughput. They would have to form a quorum, and a node must suspend operation if it became partitioned.
The queue has no way to determine if the Worker is dead or just late in delivering an ACK. So a reasonable approach is to timeout and make the message available again. “At least once deliver” is how most of these queue solutions work I have found. And in that pattern its on the Worker to determine whether this message they are processing should be processed if it has already been seen, especially if the action not idempotent.
The alternative (within the same at-least-once pattern) would be for the worker to resolve the issue and govern the message visibility, but then that breaks if the Worker dies during processing.
This queue pattern is like half of the solution, it works but there is still some responsibility on the Worker to not just blindly process a message if a duplicated message would cause an issue.
The staple feature of SQS (and most queues) is exactly once delivery within a visibility timeout. It’s called a visibility timeout because no other workers can pull (or “see”) that message for some period after delivery.
For standard queues, the visibility timeout isn’t a guarantee against receiving a message twice. For more information, see Amazon SQS at-least-once delivery.
On rare occasions, one of the servers that stores a copy of a message might be unavailable when you receive or delete a message.
If this occurs, the copy of the message isn’t deleted on the server that is unavailable, and you might get that message copy again when you receive messages. Design your applications to be idempotent (they should not be affected adversely when processing the same message more than once).
What many people mean by exactly once delivery is instead exactly once processing: https://exactly-once.github.io/posts/exactly-once-delivery/. By results like the FLP theorem, exactly once delivery is impossible. But exactly once processing is not only possible but the name of the game!
One can have very fun bugs this way.
yes. you have to design your app for it. SQS was designed with eventual consistency in mind.
In my experience messaging platforms work in one of three ways:
When building systems that communicate via messaging, eventual consistency becomes the trade off. This is why building these systems with idempotency is crucial.
I thought that no distributed system could really guarantee exactly once delivery?
A message broker itself cannot guarantee exactly once delivery. A system can achieve exactly once, but it is complex and almost no system does it correctly.
See https://exactly-once.github.io/posts/exactly-once-delivery/
It’s possible if and only if the whole system/chain supports idempotency.
Take emails for example. There is no support for idempotency in the protocol (to my knowledge) so even if you can guarantee that your system attempts to send the exactly once, there is no guarantee that it will reach the user exactly once.
You could, but the nodes can’t operate independently of each other, thus impacting throughput. They would have to form a quorum, and a node must suspend operation if it became partitioned.
Kafka has exactly once delivery capabilities
See: https://docs.confluent.io/kafka/design/delivery-semantics.html#:~:text=Kafka%20supports%20exactly%2Donce%20delivery,processing%20data%20between%20Kafka%20topics.
If you work very carefully within certain boundaries, yep.
Right, but its trade-offs with any approach.
The queue has no way to determine if the Worker is dead or just late in delivering an ACK. So a reasonable approach is to timeout and make the message available again. “At least once deliver” is how most of these queue solutions work I have found. And in that pattern its on the Worker to determine whether this message they are processing should be processed if it has already been seen, especially if the action not idempotent.
The alternative (within the same at-least-once pattern) would be for the worker to resolve the issue and govern the message visibility, but then that breaks if the Worker dies during processing.
This queue pattern is like half of the solution, it works but there is still some responsibility on the Worker to not just blindly process a message if a duplicated message would cause an issue.
From the article:
From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues-at-least-once-delivery.html
What many people mean by exactly once delivery is instead exactly once processing: https://exactly-once.github.io/posts/exactly-once-delivery/. By results like the FLP theorem, exactly once delivery is impossible. But exactly once processing is not only possible but the name of the game!
I was looking for the bit where they implemented Paxos or Raft in a stored procedure but no dice :(
Interesting way to demonstrate the differing semantics though.