I’m mostly interested in why that SQL query was locking the table and spiking the CPU for 15 seconds more anything else. I’m not too surprised with the overall conclusion – I feel like Postgres is pretty amazing and oftentimes doesn’t get enough credit.
The author mentioned “PostgreSQL doing some linear scan through data.” This can happen for a few reasons, all of which boil down to one idea: writing a query planner is hard as fuck.
Since he mentions he had a correct index, I’ll take some stabs in the dark. But we would need to see the explain of the slow query to be sure.
Guess 1: the query was performing a fast join (like a hash join) that fit in working memory. But some production joins had enough estimated matches that the estimated hash size was greater than the connection’s working memory, so it fell back to a nested loop join (dog slow). Postgres usually actually does partitioned hash joins, so maybe it estimated the query would take too many partitions, and triggered the fallback. Forcing the query to use the index would make postgres do the faster join anyway, and just partition it more. This could be exacerbated by guess 2.
Guess 2: the table statistics were wrong, causing the planner to estimate it would match X rows when it really would match Y. This can easily happen with high churn data. As an example (that results in the inverse problem), suppose you have a timestamp key that’s set to insert time. Then your query planner makes a histogram of that key, for estimating how many values land in certain ranges. The tail end of that histogram will be now(). As time passes and more records are inserted, that one tail bucket becomes larger and larger, without the query planner realizing it until it updates statistics. So it continues to think “rows from <analyze_time> until now()” is X, when it’s really increasing constantly. This particular issue doesn’t cause too many problems though, it’s just an easy example of how stats can be wrong.