Hi! I wrote this a couple years ago—I still stand by most of it though I’d say my stance on this stuff has (as one might hope) softened/become a bit more nuanced since. I have another post that’s something of a spiritual successor to this one, if that’s something that interests anyone: Understanding Cost Models.
Pretty good overview of why databases work the way they do and how queries are planned. This is the kind of post I wish I had read years ago when I just got started with databases.
What is the future of this, you think? It’s interesting to see the ratio of “queries” our analysts write these days - anything non-trivial they generally move off of SQL and write in Python instead..
I think a significant portion of the value we get out of query planners at work isn’t “answer this query efficiently”, but “answer it while others concurrently change the data and maybe even aspects of the schema” - eg dealing with the concurrency on the OLTP side.
But there are downsides - the black box nature makes a lot of devs never learn what happens under the hood, leaving them stranded when it comes to debugging time (“what do you mean ‘heap scan’, I’m doing a join?!”).
Plans changing suddenly causes operational unpredictability - I had a query in Postgres today where if you gave it a 1-entry array as a parameter it’d take the index and finish in milliseconds, but if someone sent an array of 2 entries or more it’d opt to heap scan a billion-row table, saturating IOPS and sinking the database..
It’s interesting to see the ratio of “queries” our analysts write these days - anything non-trivial they generally move off of SQL and write in Python instead
Why do you think this is? A simple matter of comfort level in Python vs. SQL? Are they doing things that are difficult or impossible in SQL but doable in Python? Better debugging experience? More consistent performance?
For me it’s mostly a variant on the first of those reasons. I’m reasonably comfortable with complex SQL thanks to a stint working on Oracle data warehouses in a past life, but I find myself pulling logic out into application code sometimes too. Not because it can’t be done in SQL, but because the code will need to be maintained by other people on my team whose SQL knowledge is more at the “CRUD and an occasional join” level. Several times now, for example, reviewing a PR of mine has been the first time a reviewer has ever seen PARTITION BY, and I then have to point them to documentation about it or explain it. Which is fine on one level (I learn new things from reviewing other people’s code too!) but I’d rather not saddle the team with code that nobody else really understands.
I assume other people reach similar conclusions, and the SQL in their codebases thus trends downward in complexity over time.
My experience is the opposite - I find that colleagues are often curious about unfamiliar SQL constructs, and in reviews thankful if I point out that it can be done faster or with less code if they use SQL more effectively.
Overall this causes the team to learn as a whole and our collective SQL skills to increase, which allows us to tackle more hairy problems down the road.
Of course, some things are done more elegantly in the programming language, and we will use it in such cases. I tend to draw the line at stored procedures (unless they provide a measurable benefit of course) or views. That’s usually better done (more maintainable and discoverable) by pulling some logic into the programming language.
Hi! I wrote this a couple years ago—I still stand by most of it though I’d say my stance on this stuff has (as one might hope) softened/become a bit more nuanced since. I have another post that’s something of a spiritual successor to this one, if that’s something that interests anyone: Understanding Cost Models.
Pretty good overview of why databases work the way they do and how queries are planned. This is the kind of post I wish I had read years ago when I just got started with databases.
Thanks for the kind words!
What is the future of this, you think? It’s interesting to see the ratio of “queries” our analysts write these days - anything non-trivial they generally move off of SQL and write in Python instead..
I think a significant portion of the value we get out of query planners at work isn’t “answer this query efficiently”, but “answer it while others concurrently change the data and maybe even aspects of the schema” - eg dealing with the concurrency on the OLTP side.
But there are downsides - the black box nature makes a lot of devs never learn what happens under the hood, leaving them stranded when it comes to debugging time (“what do you mean ‘heap scan’, I’m doing a join?!”).
Plans changing suddenly causes operational unpredictability - I had a query in Postgres today where if you gave it a 1-entry array as a parameter it’d take the index and finish in milliseconds, but if someone sent an array of 2 entries or more it’d opt to heap scan a billion-row table, saturating IOPS and sinking the database..
Why do you think this is? A simple matter of comfort level in Python vs. SQL? Are they doing things that are difficult or impossible in SQL but doable in Python? Better debugging experience? More consistent performance?
For me it’s mostly a variant on the first of those reasons. I’m reasonably comfortable with complex SQL thanks to a stint working on Oracle data warehouses in a past life, but I find myself pulling logic out into application code sometimes too. Not because it can’t be done in SQL, but because the code will need to be maintained by other people on my team whose SQL knowledge is more at the “CRUD and an occasional join” level. Several times now, for example, reviewing a PR of mine has been the first time a reviewer has ever seen
PARTITION BY
, and I then have to point them to documentation about it or explain it. Which is fine on one level (I learn new things from reviewing other people’s code too!) but I’d rather not saddle the team with code that nobody else really understands.I assume other people reach similar conclusions, and the SQL in their codebases thus trends downward in complexity over time.
My experience is the opposite - I find that colleagues are often curious about unfamiliar SQL constructs, and in reviews thankful if I point out that it can be done faster or with less code if they use SQL more effectively.
Overall this causes the team to learn as a whole and our collective SQL skills to increase, which allows us to tackle more hairy problems down the road.
Of course, some things are done more elegantly in the programming language, and we will use it in such cases. I tend to draw the line at stored procedures (unless they provide a measurable benefit of course) or views. That’s usually better done (more maintainable and discoverable) by pulling some logic into the programming language.