Nice hack! One has to wonder, though, how it would work with Postgres.
We had this case, albeit years ago on an olden version of MySQL, where a slightly more complex JOIN with aggregate action, but less data than in the post, would not be “web-scale”.
I tested it rogue with Postgres, for which it was a piece of piss, but because we were stuck with MySQL in production, we had to implement denormalization through the app :(
It’s bad too, last I tried. Postgres has a hard time lifting the JOIN above the DISTINCT. I had to use subqueries with PG10 when I tried a similar “select distinct and join on the distinct rows” at work. Took the join down from a few hundred thousand tuples to less than 100. :-(
No. There are no SQL operations that are “not optimal” in SQL. A query is a description of a result, not the execution strategy. Any execution strategy can be chosen by a SQL engine, and obviously optimizers aim to choose the optimal one.
There is only one situation where using text streams is objectively better: the operation you want to do isn’t implemented efficiently by your SQL engine, which is a problem with your engine, not SQL itself. There are even SQL engines that operate on text streams directly, e.g. Hive, or q.
The author has a different problem, they don’t know enough about SQL / their SQL engine to make this query work. Which the author actually states in the post.
As experimenting with each suggestion could easily take at least half a day, I proceeded with a way I knew would work efficiently and reliably.
This query is an “optimal” case where SQL typically excels: most SQL engines are perfectly capable of using this exact execution plan. Given that personal constraint, they made a reasonable choice to run the query outside of their database using an approach they understand.
Nice hack! One has to wonder, though, how it would work with Postgres.
We had this case, albeit years ago on an olden version of MySQL, where a slightly more complex JOIN with aggregate action, but less data than in the post, would not be “web-scale”.
I tested it rogue with Postgres, for which it was a piece of piss, but because we were stuck with MySQL in production, we had to implement denormalization through the app :(
It’s bad too, last I tried. Postgres has a hard time lifting the JOIN above the DISTINCT. I had to use subqueries with PG10 when I tried a similar “select distinct and join on the distinct rows” at work. Took the join down from a few hundred thousand tuples to less than 100. :-(
Is there a list of SQL operations that are optimal in SQL and that are not?
What are the operations that SQL truly excels? What are the operations better handled with commands for text streams/files?
No. There are no SQL operations that are “not optimal” in SQL. A query is a description of a result, not the execution strategy. Any execution strategy can be chosen by a SQL engine, and obviously optimizers aim to choose the optimal one.
There is only one situation where using text streams is objectively better: the operation you want to do isn’t implemented efficiently by your SQL engine, which is a problem with your engine, not SQL itself. There are even SQL engines that operate on text streams directly, e.g. Hive, or q.
The author has a different problem, they don’t know enough about SQL / their SQL engine to make this query work. Which the author actually states in the post.
This query is an “optimal” case where SQL typically excels: most SQL engines are perfectly capable of using this exact execution plan. Given that personal constraint, they made a reasonable choice to run the query outside of their database using an approach they understand.
Reading posts like this one make me want to learn more about the available Unix/Linux commands.