As someone with only tangential exposure to databases, the way offsets affect performance was new to me. (I thought the point of alternatives was simply a matter of offering some stability to the result set.)
When using a pointer to a specific row as the basis of the pagination, how do you normally handle having to remove the row? Do you soft-delete it, or just sacrifice bookmarked URLs referencing it?
As for using an ordinal aspect, such as a timestamp, one thing that can happen — and has coincidentally happened to me recently with a tool I’m using — is ending up with more than N records sharing a timestamp, where N is the number of results per page, thus compromising the pagination.
In this case you need this “tie breaker” column, so you order by timestamp, id or something like that, and then use a pair of [timestamp, id] as a cursor.
Nice, I wish there was a benchmark as suggested elsewhere. The article confused me a little because of how it uses the word cursor (I thought it was referencing database cursors, which is an alternative for offset pagination as well, but it’s just talking about row IDs…). To me this is more keyset pagination, no?
I’ve used keyset pagination instead of simple offset with great results. Although it was tricky to implement in cases… when you start adding multi sorts and forward / reverse pagination, it gets a little complicated making sure the query is properly bounding with the “previous” result set.
Also, being able to use id as a bound for the next page only works sometimes (if you use v1 UUIDs for example, which have time encoded in them, it can work, but v4 UUIDs it can break).
It is a keyset pagination, somehow this term never got to my conscious. :)
Also, being able to use id as a bound for the next page only works sometimes (if you use v1 UUIDs for example, which have time encoded in them, it can work, but v4 UUIDs it can break).
Yeah, id in this case was just a simple example, I assumed a simple autoincrement integer field. :)
IME performance measurements very wildly because you’re going from an O(n) operation to an O(log n) operation. If your clients never advance past page 2 it’s a wash, but then it only takes one person reading pages 200-300 to take a LOT more server time than expected.
I think you can implement next & prev given a cursor. For next, advance the cursor by pageSize items. For prev, reverse the ordering in the DB query, fetch pageSize*2 items and discard the first pageSize.
I think it’s reasonable to mix this with limit/offset pagination to provide a mixed feature set. e.g. if a client asks for page 40 right off the bat then you can use an expensive offset for that. Then if they’re subsequently requesting pages 41 onward, use the cursors for those.
but then it only takes one person reading pages 200-300 to take a LOT more server time than expected.
And if anyone thinks nobody scrolls through 200-300 pages - they’re up for a surprise. On many sites some people will scroll through every single page to the bottom, be it some fashion, e-commerce, photos, etc.
People often advertise infinity scroll as a feature designed to retain user attention and increase engagement but ignore that this UX implementation is a way to handle cursors transparently.
When you need high availability guarantees usually search and aggregations query gets distributed across different specialized systems and results are federated to be presented. It’s impossible to present consistency grantees that regular offset pagination can offer, like in a real database.
We ran benchmarks at my place of employment. The read time increases linearly in relation to the offset, and is problematic for some of our customers with large numbers of entities. We have a small minority of calls that take over 500ms due to large offsets, which is terrible. This ruins our p999 times. The benchmarks were run sequentially & randomly, doesn’t seem to affect the performance much (Postgres)
On the other hand, using a cursor is constant in relation to the offset.
Unfortunately we’re going to have to go through a deprecation process now to sort this out :(
I went off to the article in the hopes to learn how to do pagination “correctly”.
Unfortunately, TFA makes points only about performance. The points are valid and good in that respect, though.
My hunch is that it’s quite impossible to have a reasonably performant (in time and space) paginated API design, because you would need to maintain a snapshot of the list of things to be paginated for every single client. In many system, you do not know whether your client is still live and still interested in that list, so you need to keep these snapshots around for a long time.
I think paginations only kinda sorta work for human consumption, especially when update rate is slow, say for a blog. For machine consumed lists, pagination is a recipe for disaster. As a client it’s typically impossible to get a coherent list, as a server it’s too expensive to provide such coherent views.
This approach always gives you a plausible list of items. It tends to not give you items repeatedly if someone is adding new items to the start of the result set.
If you want a snapshot, ask for a snapshot. Add data versioning to the API. Page through the list asking for the rows as they existed as of the time you started querying. This works fine for browsing say wikipedia or a svn repo. :)
That’s the thing that makes cursor pagination good if, and only if, you’re showing in something like chronological or reverse chronological order. Algorithmic order, like on Lobsters, really can’t give you good pagination without taking a full snapshot.
Permanent? I haven’t used cursors, but thought they were generally closed after a timeout to avoid leaking memory and disk space. PostgreSQL’s documentation on Returning Cursors seems to support that idea: “The cursor can be closed by the caller, or it will be closed automatically when the transaction closes.” This means that URLs that reference cursors are impermanent.
Oh, I just noticed that what the article calls “cursors” are not database-native cursors, but row IDs inserted into the WHERE clause of an SQL query. Yes, that type of cursor does make paginated URLs more stable, you’re right.
As someone with only tangential exposure to databases, the way offsets affect performance was new to me. (I thought the point of alternatives was simply a matter of offering some stability to the result set.)
When using a pointer to a specific row as the basis of the pagination, how do you normally handle having to remove the row? Do you soft-delete it, or just sacrifice bookmarked URLs referencing it?
As for using an ordinal aspect, such as a timestamp, one thing that can happen — and has coincidentally happened to me recently with a tool I’m using — is ending up with more than N records sharing a timestamp, where N is the number of results per page, thus compromising the pagination.
The row doesn’t need to exist, it’s just a boundary for a WHERE clause. This query will work correctly regardless of what exists in the DB:
This pattern is also applicable to non-SQL stores like S3, which provides the
start-after
parameter to ListObjectsV2 for the same functionality.In this case you need this “tie breaker” column, so you
order by timestamp, id
or something like that, and then use a pair of[timestamp, id]
as a cursor.Nice, I wish there was a benchmark as suggested elsewhere. The article confused me a little because of how it uses the word
cursor
(I thought it was referencing database cursors, which is an alternative for offset pagination as well, but it’s just talking about row IDs…). To me this is more keyset pagination, no?I’ve used keyset pagination instead of simple offset with great results. Although it was tricky to implement in cases… when you start adding multi sorts and forward / reverse pagination, it gets a little complicated making sure the query is properly bounding with the “previous” result set.
Also, being able to use
id
as a bound for the next page only works sometimes (if you use v1 UUIDs for example, which have time encoded in them, it can work, but v4 UUIDs it can break).Here you go
Direct link to the benchmark results embedded in that article: https://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way/42
Here’s a page about a similar pagination technique it calls “seek”, with a benchmark showing that it is much faster than “offset” pagination when visiting later pages: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
It is a keyset pagination, somehow this term never got to my conscious. :)
Yeah,
id
in this case was just a simple example, I assumed a simple autoincrement integer field. :)IME performance measurements very wildly because you’re going from an O(n) operation to an O(log n) operation. If your clients never advance past page 2 it’s a wash, but then it only takes one person reading pages 200-300 to take a LOT more server time than expected.
I think you can implement next & prev given a cursor. For next, advance the cursor by pageSize items. For prev, reverse the ordering in the DB query, fetch pageSize*2 items and discard the first pageSize.
I think it’s reasonable to mix this with limit/offset pagination to provide a mixed feature set. e.g. if a client asks for page 40 right off the bat then you can use an expensive offset for that. Then if they’re subsequently requesting pages 41 onward, use the cursors for those.
And if anyone thinks nobody scrolls through 200-300 pages - they’re up for a surprise. On many sites some people will scroll through every single page to the bottom, be it some fashion, e-commerce, photos, etc.
Why would anybody think that? I’ve done it myself! :)
Even if a human being would never do this, a search engine crawler definitely could.
People often advertise
infinity scroll
as a feature designed to retain user attention and increase engagement but ignore that this UX implementation is a way to handle cursors transparently.When you need high availability guarantees usually search and aggregations query gets distributed across different specialized systems and results are federated to be presented. It’s impossible to present consistency grantees that regular offset pagination can offer, like in a real database.
What are the actual gains in performance? A benchmark would be really interesting.
We ran benchmarks at my place of employment. The read time increases linearly in relation to the offset, and is problematic for some of our customers with large numbers of entities. We have a small minority of calls that take over 500ms due to large offsets, which is terrible. This ruins our p999 times. The benchmarks were run sequentially & randomly, doesn’t seem to affect the performance much (Postgres)
On the other hand, using a cursor is constant in relation to the offset.
Unfortunately we’re going to have to go through a deprecation process now to sort this out :(
Here is more information on the topic including a reference to slides with benchmarks (see page 42 for the comparison).
I added some really simple comparison and a link to Markus’ article to a post.
Great! Thank you :)
I went off to the article in the hopes to learn how to do pagination “correctly”.
Unfortunately, TFA makes points only about performance. The points are valid and good in that respect, though.
My hunch is that it’s quite impossible to have a reasonably performant (in time and space) paginated API design, because you would need to maintain a snapshot of the list of things to be paginated for every single client. In many system, you do not know whether your client is still live and still interested in that list, so you need to keep these snapshots around for a long time.
I think paginations only kinda sorta work for human consumption, especially when update rate is slow, say for a blog. For machine consumed lists, pagination is a recipe for disaster. As a client it’s typically impossible to get a coherent list, as a server it’s too expensive to provide such coherent views.
This approach always gives you a plausible list of items. It tends to not give you items repeatedly if someone is adding new items to the start of the result set.
If you want a snapshot, ask for a snapshot. Add data versioning to the API. Page through the list asking for the rows as they existed as of the time you started querying. This works fine for browsing say wikipedia or a svn repo. :)
That’s the thing that makes cursor pagination good if, and only if, you’re showing in something like chronological or reverse chronological order. Algorithmic order, like on Lobsters, really can’t give you good pagination without taking a full snapshot.
Cursors also make the urls more permament and fit for bookmarking
Permanent? I haven’t used cursors, but thought they were generally closed after a timeout to avoid leaking memory and disk space. PostgreSQL’s documentation on Returning Cursors seems to support that idea: “The cursor can be closed by the caller, or it will be closed automatically when the transaction closes.” This means that URLs that reference cursors are impermanent.
Oh, I just noticed that what the article calls “cursors” are not database-native cursors, but row IDs inserted into the
WHERE
clause of an SQL query. Yes, that type of cursor does make paginated URLs more stable, you’re right.Naming is confusing, DB’s cursors and “cursor pagination” are completely different and unrelated things.
Well, you can create “previous” link quote easily, what is hard is to allow user to jump anywhere they want within the table