Articles like this always seem to talk past the point.
Assuming you have enough data that you actually need horizontally-scaled data storage (a very big assumption – many situations would truthfully be fine without) then doing a join requires coordination over the network, and that’s always a trickier proposition to scale.
Really, this article is an argument against the blanket statement(s) “well, we can’t use joins in our RDBMS because it’s expensive” as well as the follow up resume-driver “we can’t use joins because when we get to such-and-such a scale we’ll need to horizontally scale anyway” – both of which are notably false.
So the article has a generally correct message, but it’s not really about NoSQL itself so much as not needing it in the first place. Which, I mean, fair enough, but it’s a bit of a clickbait title.
It also ignores some join use cases that can really get messy fast – in fact, all the examples use a constraint on the primary join key which is kinda cheating. There are some joins which are just frustrating to scale at all (fanouts and selectivity through join tables) so yeah, keep your schemas simple folks.
Assuming you have enough data that you actually need horizontally-scaled data storage (a very big assumption – many situations would truthfully be fine without) then doing a join requires coordination over the network, and that’s always a trickier proposition to scale.
At that level of scale, isn’t the choice between “coordination over the network” at the database layer or at the code layer if you need joins?
Or is the choice to avoid joins entirely and just have denormalized data with the issues that come along with that?
Sort of. Joins at database layer has to be more strict and the partial failure mode has to be treated more rigorously. Joins at application layer has more flexibility, you can decide to placehold that section, or use stale data, or report error only about that section. You can also choose to denormalize at write time.
From that perspective, joins at database layer lost too much context on how to handle failures and has to rely on more strict coordination mode which could be expensive.
Of course, these are all from designing for non critical / consumer application perspective.
To make this point they would need to increase the table sizes enough so they would no longer fit within memory or disk on a single machine. I guess scale means different things to different people, but this is the definition I’d expect someone to mean in this context.
As per barakmich’s comment I agree the “joins dont scale” argument isn’t appropriately answered with a query fitting on a single machine. I also believe you could rewrite the query to remove the join.
select date_trunc('month',ORDER_DATE) order_month, USER_ID, sum(amount), count(*)
from ORDERS
where USER_ID=42
group by order_month, USER_ID
;
Essentially, yes this is correct. Or at least that’s what the query optimizer deduced it could do.
There’s one slight modification to what you wrote vs. what the join as written in the article returns – if USER_ID=42 does not exist in the other table, it returns nothing.
So it is an according-to-Hoyle relational algebra join, but one that’s really easy to optimize. You can see it in the explains in the article:
Index Only Scan using users_pkey on public.users (cost=0.29..4.30 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
Output: users.user_id
Index Cond: (users.user_id = 42)
Which is the aside-check that verifies existence of the key. You’d get pretty much the same performance even if you asked for some details from that table (say, selecting last_name), since it’s pulling that row anyway.
A somewhat more indicative join would be WHERE last_name=Smith (with appropriately loaded data, Smith being statistically prevalent in North America, for example) – so we wouldn’t know a priori how many USER_IDs match, and there’s potential fanout.
Still a one-to-many join – many-to-many is still undiscussed – but at least it’d be something that made the join do some real lifting.
Articles like this always seem to talk past the point.
Assuming you have enough data that you actually need horizontally-scaled data storage (a very big assumption – many situations would truthfully be fine without) then doing a join requires coordination over the network, and that’s always a trickier proposition to scale.
Really, this article is an argument against the blanket statement(s) “well, we can’t use joins in our RDBMS because it’s expensive” as well as the follow up resume-driver “we can’t use joins because when we get to such-and-such a scale we’ll need to horizontally scale anyway” – both of which are notably false.
So the article has a generally correct message, but it’s not really about NoSQL itself so much as not needing it in the first place. Which, I mean, fair enough, but it’s a bit of a clickbait title.
It also ignores some join use cases that can really get messy fast – in fact, all the examples use a constraint on the primary join key which is kinda cheating. There are some joins which are just frustrating to scale at all (fanouts and selectivity through join tables) so yeah, keep your schemas simple folks.
At that level of scale, isn’t the choice between “coordination over the network” at the database layer or at the code layer if you need joins?
Or is the choice to avoid joins entirely and just have denormalized data with the issues that come along with that?
Sort of. Joins at database layer has to be more strict and the partial failure mode has to be treated more rigorously. Joins at application layer has more flexibility, you can decide to placehold that section, or use stale data, or report error only about that section. You can also choose to denormalize at write time.
From that perspective, joins at database layer lost too much context on how to handle failures and has to rely on more strict coordination mode which could be expensive.
Of course, these are all from designing for non critical / consumer application perspective.
To make this point they would need to increase the table sizes enough so they would no longer fit within memory or disk on a single machine. I guess scale means different things to different people, but this is the definition I’d expect someone to mean in this context.
As per barakmich’s comment I agree the “joins dont scale” argument isn’t appropriately answered with a query fitting on a single machine. I also believe you could rewrite the query to remove the join.
Essentially, yes this is correct. Or at least that’s what the query optimizer deduced it could do.
There’s one slight modification to what you wrote vs. what the join as written in the article returns – if USER_ID=42 does not exist in the other table, it returns nothing.
So it is an according-to-Hoyle relational algebra join, but one that’s really easy to optimize. You can see it in the explains in the article:
Which is the aside-check that verifies existence of the key. You’d get pretty much the same performance even if you asked for some details from that table (say, selecting
last_name
), since it’s pulling that row anyway.A somewhat more indicative join would be
WHERE last_name=Smith
(with appropriately loaded data, Smith being statistically prevalent in North America, for example) – so we wouldn’t know a priori how many USER_IDs match, and there’s potential fanout.Still a one-to-many join – many-to-many is still undiscussed – but at least it’d be something that made the join do some real lifting.