I always find complex joins like this hard to read.
Here’s an alternative approach using a cascade of where … in expressions:
select CustomerID, Name
from Customers
where CustomerID in (
select CustomerID
from Orders
where OrderID in (
select OrderID
from OrderDetails
where ProductID in (
select ProductID
from Products
where CategoryID in (
select CategoryID
from Categories
where Name = 'Beauty'
))))
I actually think this is easier to follow.
I’d be interested to compare query plans and see if this can be as efficient as the solution proposed in this article.
Part of this is cultural I think. Joins feel more “relational” and more “correct” than a cascade. I also don’t think you see as many examples of cascade queries as you learn SQL, which most people do in an ad-hoc way. Joins also work everywhere, and some databases might have limitations with cascades (I can’t think of one but I don’t know - is it ANSI SQL?)
I found my enjoyment of writing SQL improved substantially after I gave myself permission to lean more on tricks like this rather than try to figure out the right joins for everything.
It’s definitely a “try it and look at the query plan” thing, but I’ve run into cases where PostgreSQL’s optimizer did a worse job on “in (subquery)” than on an equivalent join.
I think in theory the optimizer should have enough information to be able to detect when the two styles are semantically equivalent, but in practice, it has a limited CPU budget and can’t cover every possible case. (Or maybe it’s just that nobody happens to have added the detection logic to the optimizer yet.)
Yeah, I really like CTEs for making this kind of thing easier to follow.
with category_ids as (
select CategoryID
from Categories
where Name = 'Beauty'
),
product_ids as (
select ProductID
from Products
where CategoryID in (select CategoryID from category_ids)
),
order_ids as (
select OrderID
from OrderDetails
where ProductID in (select ProductID from product_ids)
),
customer_ids as (
select CustomerID
from Orders
where OrderID in (select OrderID from order_ids)
)
select CustomerID, Name
from Customers
where CustomerID in (
select CustomerID from customer_ids
);
My understanding is that in most database engines views and CTEs are effectively the same thing under the hood - or at least they result in identical query plans.
One way of thinking about CTEs is that they are way of defining a view that exists only for the duration of the current query.
Why doesn’t the server consider all three plans for the original DISTINCT query, and then choose a non-sorting plan as the cheapest?
Is it hard for the server to recognize that the queries are equivalent?
It seems like you could “push down” the district operator into the Customers scan (because it only mentions Customer fields) and then eliminate it because CustomerID (the PK) is already distinct.
Planning also costs CPU cycles, so that might be one reason it’s not evaluating more options. But more likely this is just a missing optimization. It happens to be pretty obvious to us from our perspective, but it might be harder in the general case.
I’ve seen this so many times over the years that I consider distinct in any query with a join to be a giant red flag. It’s almost always being (mis)used this way…
Great example with the reordered query plan at the end. Query plans are very useful, and something that I’ve found many engineers are not at all comfortable with.
Do you happen to have a good resource on how to read query plans and how to act on them? Like maybe a library of common optimizations.
I know how to generate a query plan when I suspect a slow query but I’m always a bit confused when I have to read it even with visual tools. I understand full scans should be avoided, and higher values are worse, but it’s not always clear what to do as a response.
For example I know I have queries with DISTINCT (which I will rewrite on Monday), but with just a query plan I wouldn’t have thought of using EXISTS as a fix.
Maybe a query plan linter would help? In this case returning something like « suspicious usage of DISTINCT with JOINs, you might want to consider using EXISTS instead <link to details + examples> »
I don’t unfortunately - I think it might be difficult to write something both simple and generic. The execution plan is going to depend on the specific sql engine and ultimately will require some familiar with those details beyond just what the query itself says.
I have used SQL Prompt Pro from Redgate in the past which gives some sql code analysis features - that’s the closest thing that I’m aware of to the sql litter that you’re thinking about.
When I worked on a research database I came up with the idea of storing 128-bit hash codes (instead of the actual keys) to optimize DISTINCT, but I’m not sure if any production database uses this approach.
I always find complex joins like this hard to read.
Here’s an alternative approach using a cascade of where … in expressions:
I actually think this is easier to follow.
I’d be interested to compare query plans and see if this can be as efficient as the solution proposed in this article.
Part of this is cultural I think. Joins feel more “relational” and more “correct” than a cascade. I also don’t think you see as many examples of cascade queries as you learn SQL, which most people do in an ad-hoc way. Joins also work everywhere, and some databases might have limitations with cascades (I can’t think of one but I don’t know - is it ANSI SQL?)
Yeah, this pattern doesn’t seem very common.
I found my enjoyment of writing SQL improved substantially after I gave myself permission to lean more on tricks like this rather than try to figure out the right joins for everything.
You could pop this cascade off of a stack and turn it into a CTE.
I write a bunch of ad-ho queries for my job and I do it in this style all the time. It’s great!
I especially find it easier to exploratively build up a query from parts this way.
It’s definitely a “try it and look at the query plan” thing, but I’ve run into cases where PostgreSQL’s optimizer did a worse job on “in (subquery)” than on an equivalent join.
I think in theory the optimizer should have enough information to be able to detect when the two styles are semantically equivalent, but in practice, it has a limited CPU budget and can’t cover every possible case. (Or maybe it’s just that nobody happens to have added the detection logic to the optimizer yet.)
I do this but with cascading CTEs.
Yeah, I really like CTEs for making this kind of thing easier to follow.
My DRY instinct wonders: why factor out the subqueries as CTEs rather than views?
My understanding is that in most database engines views and CTEs are effectively the same thing under the hood - or at least they result in identical query plans.
One way of thinking about CTEs is that they are way of defining a view that exists only for the duration of the current query.
I ran a quick experiment and the view and CTE queries did both appear to produce identical query plans, at least in SQLite https://chat.openai.com/share/93aaca2f-4769-48b8-83fd-4ccf7efa14f9
Why doesn’t the server consider all three plans for the original DISTINCT query, and then choose a non-sorting plan as the cheapest?
Is it hard for the server to recognize that the queries are equivalent?
It seems like you could “push down” the district operator into the Customers scan (because it only mentions Customer fields) and then eliminate it because CustomerID (the PK) is already distinct.
Planning also costs CPU cycles, so that might be one reason it’s not evaluating more options. But more likely this is just a missing optimization. It happens to be pretty obvious to us from our perspective, but it might be harder in the general case.
I’ve seen this so many times over the years that I consider
distinct
in any query with a join to be a giant red flag. It’s almost always being (mis)used this way…I fixed one of these the other day. Solved it the exact same way.
Great example with the reordered query plan at the end. Query plans are very useful, and something that I’ve found many engineers are not at all comfortable with.
Do you happen to have a good resource on how to read query plans and how to act on them? Like maybe a library of common optimizations.
I know how to generate a query plan when I suspect a slow query but I’m always a bit confused when I have to read it even with visual tools. I understand full scans should be avoided, and higher values are worse, but it’s not always clear what to do as a response.
For example I know I have queries with DISTINCT (which I will rewrite on Monday), but with just a query plan I wouldn’t have thought of using EXISTS as a fix.
Maybe a query plan linter would help? In this case returning something like « suspicious usage of DISTINCT with JOINs, you might want to consider using EXISTS instead <link to details + examples> »
I don’t unfortunately - I think it might be difficult to write something both simple and generic. The execution plan is going to depend on the specific sql engine and ultimately will require some familiar with those details beyond just what the query itself says.
I have used SQL Prompt Pro from Redgate in the past which gives some sql code analysis features - that’s the closest thing that I’m aware of to the sql litter that you’re thinking about.
I’ll give that a look, thanks!
When I worked on a research database I came up with the idea of storing 128-bit hash codes (instead of the actual keys) to optimize DISTINCT, but I’m not sure if any production database uses this approach.