EDIT: Updated to simplify logic and remove additional table
We’ve run into this issue with tailing a table and are now experimenting with the following approach:
acquire a shared lock during insert, so they dont block each other
acquire an exclusive lock when reading the max id, to momentarily pause inserts
Details:
# inserts hold a shared lock
BEGIN
pg_advisory_xact_lock_shared(1)
INSERT INTO event_table ...
COMMIT
Shared locks dont lock each other so there is no impact on write throughput.
# when reading the max safe id,
# hold an exclusive lock to pause writes momentarily
BEGIN
pg_advisory_xact_lock(1)
SELECT max(id) FROM event_table
The exclusive locks waits for the current shared locks to be released and newer shared lock requests wait behind the exclusive lock. This ensures the current in flight inserts are committed and no new ones are started. So no new rows should appear with an id that is lower than the max(id) read above. We actually partition the lock space based on our use case and dont have a single global lock 1.
Very interesting use of shared and exclusive advisory locks!
Doesn’t the update max_event_id set ... serialize inserts, as each insert needs to lock the row on max_event_id? I guess reading this, it looks like “Serializing writes with a sequence table” from the post, but I’m probably missing something?
Updates dont lock rows in postgres because of MVCC. But I misreported what we are doing since I was confused - we’re not actually using the additional table, except to cache the max id. See the updated logic above. We just pause writes momentarily to read a max id that is guaranteed to not have new ids appear below it.
Also want to emphasize that we haven’t validated that it works - so we’re experimenting with it - but it seems sound in theory.
The post doesn’t describe a phantom read. It would be a phantom read if the two queries described happened in the same transaction – in that case, I believe a serializable transaction would indeed avoid the issue. But the two queries described in this post happen in two separate transactions (for example, two separate GET requests to an API).
Basically, when a transaction calls an insert, the sequence is incremented immediately. If the transaction aborts, it’s not rolled back. Serialized might make it even worse due to retries, that there is now random gaps and reorderings in transactions.
Postgres sequences can also be gappy for reasons that don’t depend on transaction rollback. As I understand it (haven’t actually read the code) to avoid having to update the sequence on every increment, the system pre-increments the sequence by 32, then stores the current value in the buffer cache, doling out new sequence values until they’re exhausted and it needs another 32.
If the system crashes or the sequence gets evicted from the buffer cache because of memory pressure, you’ll get a gap in observed values.
If you want to avoid gaps, you have to write your own sequence logic using some sort of upsert/locking logic.
I needed a bit of time to get my head around this, but I think I can sort of fit it into my head as a lease-based system based on advisory locks? That’s creative.
Idle thought: I think this delays the reader(s) until all earlier writers have committed. If your requirement is only to have a consistent tail, it may be interesting to consider serializing as a dedicated function. A toy sketch:
have writers dump entries into the table with no synchronization, with each row being written with reader_seq = NULL
periodically (every 100ms?) have a single “serializer” “fill in” reader_seq for all rows with reader_seq = NULL
have readers SELECT * FROM table ORDER BY reader_seq WHERE seq IS NOT NULL, i.e. hide all not-yet-serialized rows from the readers.
A full solution may need to avoid updating the just-written row (for performance), and I must admit to not thinking through whether READ COMMITTED is sufficient for the above. But you may enjoy thinking about it; in any case, I wanted to get it out of my head. ;-)
Fun – this is another solution we landed on, I should have mentioned it in the post. The throughput is actually pretty decent, much better than using a sequence table. I like it quite a bit.
The biggest bummer is that it (1) relies on a process to always be running (2) hammers the database and (3) there’s no back pressure on writes (so this process can fall behind).
So reading the comments and remembering the docs, sequences exist to guarantee that you can draw a fresh number every time until the sequence overflows and does not guarantee that there will be no gaps.
I don’t remember the recommended way to handle sequence wraparound.
EDIT: Updated to simplify logic and remove additional table
We’ve run into this issue with tailing a table and are now experimenting with the following approach:
Details:
Shared locks dont lock each other so there is no impact on write throughput.
The exclusive locks waits for the current shared locks to be released and newer shared lock requests wait behind the exclusive lock. This ensures the current in flight inserts are committed and no new ones are started. So no new rows should appear with an id that is lower than the max(id) read above. We actually partition the lock space based on our use case and dont have a single global lock
1.Very interesting use of shared and exclusive advisory locks!
Doesn’t the
update max_event_id set ...serialize inserts, as each insert needs to lock the row onmax_event_id? I guess reading this, it looks like “Serializing writes with a sequence table” from the post, but I’m probably missing something?Updates dont lock rows in postgres because of MVCC. But I misreported what we are doing since I was confused - we’re not actually using the additional table, except to cache the max id. See the updated logic above. We just pause writes momentarily to read a max id that is guaranteed to not have new ids appear below it.
Also want to emphasize that we haven’t validated that it works - so we’re experimenting with it - but it seems sound in theory.
Late edit: To correct myself, updates can lock. MVCC helps with reads, which get a snapshot of the data, but writes can still use locking. Details: https://www.postgresql.org/docs/current/explicit-locking.html
This seems to describe a phantom read, so what about running the reads inside a serializable transaction?
Good question!
The post doesn’t describe a phantom read. It would be a phantom read if the two queries described happened in the same transaction – in that case, I believe a serializable transaction would indeed avoid the issue. But the two queries described in this post happen in two separate transactions (for example, two separate GET requests to an API).
Serializable doesn’t fix this, Sequences (such as serial) ignore transaction isolation entirely (somewhat).
Basically, when a transaction calls an insert, the sequence is incremented immediately. If the transaction aborts, it’s not rolled back. Serialized might make it even worse due to retries, that there is now random gaps and reorderings in transactions.
Postgres sequences can also be gappy for reasons that don’t depend on transaction rollback. As I understand it (haven’t actually read the code) to avoid having to update the sequence on every increment, the system pre-increments the sequence by 32, then stores the current value in the buffer cache, doling out new sequence values until they’re exhausted and it needs another 32.
If the system crashes or the sequence gets evicted from the buffer cache because of memory pressure, you’ll get a gap in observed values.
If you want to avoid gaps, you have to write your own sequence logic using some sort of upsert/locking logic.
Yep, and it’s pretty obvious otherwise every insert into a table would have to be completely sequential.
10 github stars, I guess Sequin just started?
Correct! Just open sourced last week. We’re looking to help folks deploy, feel free to DM if you have a use case
I needed a bit of time to get my head around this, but I think I can sort of fit it into my head as a lease-based system based on advisory locks? That’s creative.
Idle thought: I think this delays the reader(s) until all earlier writers have committed. If your requirement is only to have a consistent tail, it may be interesting to consider serializing as a dedicated function. A toy sketch:
SELECT * FROM table ORDER BY reader_seq WHERE seq IS NOT NULL, i.e. hide all not-yet-serialized rows from the readers.A full solution may need to avoid updating the just-written row (for performance), and I must admit to not thinking through whether READ COMMITTED is sufficient for the above. But you may enjoy thinking about it; in any case, I wanted to get it out of my head. ;-)
Fun – this is another solution we landed on, I should have mentioned it in the post. The throughput is actually pretty decent, much better than using a sequence table. I like it quite a bit.
The biggest bummer is that it (1) relies on a process to always be running (2) hammers the database and (3) there’s no back pressure on writes (so this process can fall behind).
So reading the comments and remembering the docs, sequences exist to guarantee that you can draw a fresh number every time until the sequence overflows and does not guarantee that there will be no gaps.
I don’t remember the recommended way to handle sequence wraparound.
As I understand it, SEQUENCE defaults to bigint, i.e. a 64-bit integer. Isn’t it sufficient to “just” buy ECC memory and ensure your CPU actually works? Or am I missing something?
What if you do more than 2^64 inserts on a table with an autoincrementing id?
It’s 2^63 because bigint/sequences are signed. Still, if you did 1M transactions per second:
2^63/ (1,000,000 * 60 * 60 * 24 * 365.25) => 292,271 years to exhaust
I can do 1M transactions per second in my sleep 😛