1. 43

Note that I’m not asking why the relational model is successful; instead, I wonder why almost every implementation of the relational model takes SQL as their input.

Let me explain why I think SQL shouldn’t be successful. As illustrated in an article I shared eariler, the abstraction barrier laid by SQL is so leaky that any serious DBA must examine the query plan of their SQL statements to avoid writing hopelessly inefficient queries. Ultimately, while SQL claims to be a declarative language, contemporary RDBMS’s are far away from being smart enough to deliver that promise, so programmers/DBA still have to process a good understanding of their data and worry about details in the imperative execution of their queries, mostly by “reverse engineering” the query plan they want to craft a SQL statement, or even worse, stirring the SQL until the query plan start looking right.

I’m not an expert in databases, but I have been tortured by SQL recently, and I naively think an imperative DSL for writing queries would help improve the situation. As an extreme example, perhaps an RDBMS can allow programmers to compose query plans explicitly? Surely the initial learning curve would be steep, but getting rid of an unnecessary indirection should boost productivity in the long term. However, I’m not aware of any system like this. It looks like people just use SQL even if it has long become a hairy beast.

So why is SQL so successful despite its shortcoming? I hope this is not a stupid question :)

  1.  

  2. 23

    The abstraction barrier laid by SQL is so leaky that any serious DBA must examine the query plan of their SQL statements to avoid writing hopelessly inefficient queries.

    In 95% of cases, it’ll do a Good Enough (tm) thing until data gets large enough…and the tooling is there to debug and optimize those cases. The really gnarly queries I’ve seen tend to be caused by application-level wrappers over SQL that do dumb things.

    Let me explain why I think SQL shouldn’t be successful.

    “The difference between theory and practice…” as the old joke used to run.

    I believe that SQL is successful for the following reasons:

    • It’s sufficiently well specified–none of the problems of C that helped encourage new dialects.
    • It had and has many competing implementations.
    • It models the problem domain of munging sets with declarative language “well enough”.
    • It recognizes useful datatypes for its end users (business folks)–sure, no binary trees or monads or whatever, but if you need to handle money, or time, or time ranges, or more exotic stuff (hi Postgres!), SQL has your back.
    • It is a refinement of a problem space over half a century, going back to the original paper. It’s theoretical roots are strong.
    • It benefits from bigger machines (vertical scaling), clever programming (density scaling), and in most implementations additional machines (horizontal scaling).
    • The amount of knowledge around how to make a query work and make it fast is immense.

    As an extreme example, perhaps an RDBMS can allow programmers to compose query plans explicitly?

    If we wanted to do that we’d just write the data management in our application. Frequently, that’s an acceptable answer.

    Surely the initial learning curve would be steep, but getting rid of an unnecessary indirection should boost productivity in the long term.

    The massive productivity hit for switching to some as-yet-undefined-and-unproven language seems wrong. I can get an MBA or a senior customer service person to write SQL reports for themselves to scratch itches…and if the reports run slow, I run it on a larger machine. If it still doesn’t work, I can rewrite the query. This is an extremely productive escalation strategy in a business context.

    It looks like people just use SQL even if it has long become a hairy beast.

    A chisel, axe, or hammer from Roman times looks a great deal like what I’d buy at a hardware store today. Sometimes, we’ve gotten a tool “good enough”, and additional fiddling is wasted effort.

    1. 14

      I can get an MBA or a senior customer service person to write SQL reports for themselves to scratch itches…and if the reports run slow, I run it on a larger machine. If it still doesn’t work, I can rewrite the query. This is an extremely productive escalation strategy in a business context.

      This is the best answer, and frankly the only one that really matters. SQL wasn’t designed for application programmers or (what would become) DBAs. It was designed for business people who were willing to learn spreadsheets and would be willing to learn (some) SQL. The fact that we hackers mostly use it too is just a testament to how suitable it is for most uses most of the time.

      1. 12

        SQL is awesome, I replaced 9k lines of a Java finance system with an albeit only-understandable-by-a-haskell programmer 200 line sql query. Two days for the SQL including tests. The productivity of SQL can be nearly infinite in the right hands.

    2. 20

      SQL is a narrow waist. Almost all databases support SQL. Almost all ORMs and data analysis tools emit SQL. If you make a new database that doesn’t support SQL it’s very hard to get adoption because users would have to abandon all their existing tools. If you make a new data analysis tool that doesn’t emit SQL it’s very hard to get adoption because users won’t be able to use their existing databases.

      an imperative DSL for writing queries would help improve the situation

      This is essentially what Spark and Flink are. Both of them have also recently added a SQL layer on top.

      The upsides of having a declarative layer are that the database does a lot of the optimization work for you, in many cases can do a better job than a human while still letting you write readable code, and when the data changes it can reoptimize without you having to rewrite all your code.

      This is pretty similar to the situation with assembly and structured languages. They will often generate much worse assembly then you would write by hand, but they allow much more productivity and make it easy to port software between different ISAs. Occasionally you might want to spot-check a hot loop or maybe even use a little inline assembly.

      SQL has a lot of deficiencies as a language, but whatever replaces it will probably have the same separation between query and plan. The only thing I can see changing is better hinting and debugging tools.

      (It would be nice to be able to submit plans directly, but that would also constrain the database to never changing the plan interface and that has been frequently necessary over the last few decades to take advantage of changes in hardware. There is some research work on the subject but it doesn’t feel like a solved problem yet.)

      1. 3

        Adding to this, most traditional RDBMS’s had a more imperative API via cursors. I don’t have a link ready to backup my position, but every opinion I’ve heard throughout my 14 year career is to avoid cursors whenever possible. I’ve replaced a lot of cursor usage with set operations (e.g. INSERT...SELECT) for tremendous speed gains (and readability, imo).

        One reason for the performance of standard set operations is that a lot of thought has gone into them. The engine makes decisions with a lot of knowledge about data structures, storage hardware, and profile of current data (e.g. cardinality of a join match). OTOH in Spark, it isn’t uncommon for me to reach for RDD operations because the SQL query engine isn’t as developed (this is changing, of course).

      2. 12

        SQL maps very closely to the underlying set theory, which makes it very easy to reason about.

        1. 5

          I do not think it maps closely to set theory at all. There’s the Null value, rows can be duplicated, etc. People have been criticising SQL since the 80s because it’s not close to set theory.

          1. 2

            There are variations on set theory that account for duplicates.

        2. 8

          I’d say that debugging SQL queries requires bypassing the abstraction barrier.

          I don’t see why that iindicates a problem or limitation with SQL, though. If I singlestep a Kotlin program, that doesn’t indicate a defect in the language Kotlin, at least not in my eyes. Why should examining a query plan indicate a defect in SQL? Rather, looking at what’s happening in different ways and correlating the intent with ≥1 ways of seeing the effect seems to me to be an indispensable part of debugging (and of iterative development generally).

          This doesn’t mean that I will defend e.g. Oracle. In my experience, oracle queries grow messy. Someone would improve the query by adding specific instructions to the query planner, and over time the queries grew less readable. Layer violations have no place in source code. I’ve also used two other SQL databases for which the same was true.

          But I’ve also used Postgres, and there, I found that the process of examining the query plans and optimising the queries generally led to improved queries. To queries that more clearly, more succinctly expressed the original intention.

          1. 5

            I too hated SQL when I first started with it. It grew on me though and I think it actually gets a lot of undeserved flak.

            any serious DBA must examine the query plan of their SQL statements

            A lot of people who aren’t DBAs write sql. And anyone who writes sql should know how to view and read a query plan. I would also suggest that knowing the database your working with (oracle, mysql, postgres, etc.) and how to write effective and efficient queries for it is no different, IMO, than telling me that I should know how to effectively write and debug efficient python programs. Every programming language has its own quirks and ins and outs. Frankly, again IMO, sql is fairly easy comparatively. Query plans are honestly pretty easy to read. For example, compare them to using gdb.

            I naively think an imperative DSL for writing queries

            If you look around the net enough you’ll find various people have come up with such things. They do exist though almost no one uses them. At this point sql is very entrenched and unseating it is going to take a genuine revolution in how to do it versus a couple of steps better.

            So why is SQL so successful despite its shortcoming?

            A large part of this, IMO, is it’s age. sql is old. All the OGs of the database world used it. And all the upstarts used it because the OGs did.

            1. 3

              A lot of people who aren’t DBAs write sql. And anyone who writes sql should know how to view and read a query plan.

              I think in most businesses these points are in conflict. The only people looking at query plans are developers and DBAs (where they still exist.) Many of the people writing SQL are in business areas. They’ll possibly be using the SQL tools that developers have, but more likely they’ll be running some tool focused on report development. Smart DBAs will setup replicas and restrict access so there is risk is reduced. It’s only when these queries get moved into a more formal setting that developers and DBAs start optimizing.

              That for me, is the crux of why SQL is successful. You have a language that flourishes is a relatively non-technical environment outside of IT, yet can be directly translated into a high performance system most of the time. Much software development involves throwing it all away when these transitions happen, but SQL (and the implication of RDBMS use) allows a relatively smooth transition. For this reason, use of anything other than a relational database should be exposed to reasonable scrutiny in a business setting. Startups, ML, search, etc. are all special areas compared to the regular businesses which is all around us. It’s often hard to appreciate that when we work in tech, like optimizing, and are enthusiastic about learning new things.

            2. 5

              A lot of comments here are drilling into the specifics of SQL, and maybe they have a point, but imho the reason SQL keeps getting chosen is that all the potential clients already know SQL and are used to it. Also, inventing a new language is hard, and full of potholes, and database companies are usually not ambitious in that way.

              1. 4

                For quite a while there were competing languages for RDBMS, the most famous was QUEL, which was the original language of the father of Postgres: Ingres. More info here: https://www.holistics.io/blog/quel-vs-sql/

                1. 4

                  any serious DBA must examine the query plan of their SQL statements to avoid writing hopelessly inefficient queries.

                  Well, that very much depends on the quality of the data model (not only on that, of course, but a bad data model is like an original sin, which makes everything else crack and crumble). Defining a bunch of tables and using SQL does not automagically make you a Relational database designer (let alone a “designer”).

                  Serious DBAs must of course read query plans, but only to optimize queries that are amenable to optimization. “Hopeless inefficiency” is not typically a SQL problem: it has much deeper roots. SQL is just a programming language, after all (and an expressive one!).

                  I naively think an imperative DSL for writing queries would help improve the situation.

                  If I may: yes, that’s naive. Go read this: https://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95.html. If not all, at least the anecdote about the “Query Game” (The birth of SQL > Prehistory). Then, consider the parabola of “NoSQL” systems. Take Hadoop, for example (anyone remember it?): look how the queries were written to begin with (spoiler: pages of Java code); look what the did next (spoiler: “invent” declarative languages); find how those languages were designed (spoiler: based on Relational principles—more or less). Not to mention transactions….

                  There is inherent complexity in data management problems, as for any other algorithmic problem. That can’t be bypassed by choosing a different toolset. But it can be tamed with the right mindset :-)

                  1. 4

                    As a SQL lover, I have a public service announcement: most people limit themselves to some old dialect of SQL, which gives it a bad rep. https://modern-sql.com/video

                    1. 3

                      Put me in the same category of people who had this complaint for 10+ years, but then just learned to suck it up (not that I’m an expert in SQL, but I’m learning). It’s especially annoying that some queries require superlinear behavior by their semantics but they don’t look very different from well-behaved queries.

                      This is exactly why NoSQL was a trend for many years (when “scalability” was the buzzword), although users have realized what they’ve lost (NewSQL, etc.)

                      For offline analytics, I view SQL as more of an extraction language (to select the data you want), and R or Python can be the analysis language. That can be a good split and it can also speed things up incredibly. Doing analytics logic in SQL often leads to very long and opaquely slow queries.

                      Code Comparison: What Is a Data Frame? (In Python, R, and SQL)

                      On the other hand the authors of this paper view things from an SQL lens and have the opposite viewpoint. So there are problems everywhere. It is kind of funny this is one of the oldest subfields in computer science and it still has pretty big usability problems in practice.

                      Towards Scalable Dataframe Systems

                      The complexity of the [pandas] API and evaluation semantics also make it difficult to apply traditional query optimization techniques. Indeed, each operator within a pandas “query plan” is executed completely before subsequent operators are executed, with limited optimization, and no reordering of operators or pipelining (unless explicitly done so by the user using .pipe). Moreover, the performance of the pandas.DataFrame API breaks down when processing even moderate volumes of data that do not fit in memory …


                      Related thread: https://lobste.rs/s/r5qaap/introducing_preql_new_relational

                      It seems like optimizing across CTEs is a new thing in Postgres and it’s not necessarily done in other engines (sqlite, etc.) That is unfortunate as CTEs make the code more composable. So you have to trade off the structure of the code for performance, which I’d say is another “smell”.

                      I also agree it’s a “narow waist” – on the one side you have applications, and on the other you have storage engines, query planners, etc. The compatibility isn’t perfect, but many “big data” systems are SQL compatible.

                      1. 3

                        I don’t think going back to imperative DB access and writing the query and the plan together would help. That’s what was done before SQL. I would love relational NoSQL though.

                        SQL was ahead of its time being a declarative relational language, but it’s also massively of its time. It was critiqued then too; RMv2 by Codd is worth a read to see what it got right and wrong (and I didn’t always agree with Codd).

                        1. 3

                          Obviously I don’t know about your particular situation, so please don’t take offense at this, but most of the times I’ve seen people have major problems with SQL, the real problem was that they did a half baked job with the data model and assumed they could toss everything in a RDBMS and just run SQL without thinking about it.

                          In other words, as the programmer/DBA, you SHOULD have a good understanding of your data and how it maps into a relational model, just like you need a good understanding of your problem domain in order to write code in other languages. You have to understand the tools you’re using.

                          Not to say a good model magically solves all of the problems, but it definitely helps to have a good starting place.

                          Personally, after using MongoDB for a while and having a little experience with other NoSQL databases, I’ll take SQL any day of the week. Reading through a book on SQL or data modelling helps a lot.

                          1. 3

                            SQL is supported by everything and, I think especially at its inception, works very well compared to other options.

                            That being said, I am very surprised at how to this day there isn’t more demand for “let me directly write the query plan tuple”. You wouldn’t ship a programming language that can’t do C interop (or like… direct to native mechanisms), but somehow we have gotten this far without anyone saying “here’s a special syntax. If you provide this we will run exactly the query plan you provide”.

                            I feel like whoever gets this will quickly find an audience among programmers coming from a more FP paradigm.

                            To be clear, I think that, for example, the innards of Postgres are super good. I just don’t feel I need the query planner, and with the right DSL, describing queries as “traverse this index” rather than “traverse this table and please use this index pretty please” would feel more natural and would remove a class of bad SQL.

                            1. 3

                              Often the people writing SQL are not database experts. They’re people like business analysts, data scientists, and even domain experts like accountants and supply chain experts who don’t know or care about the performance of the database. Then DBAs can come in and fine-tune the SQL as needed. As such, SQL is a lingua franca between different disciplines. I definitely agree it could be improved – for example, if there were a standard way of specifying those optimizations that DBAs do as extensions to a query – but it’s at least decent for its purpose.

                              1. 2

                                and I naively think an imperative DSL for writing queries would help improve the situation.

                                That’s because SQL is a bad DSL.

                                The entire point of a DSL is to abstract away how something works, reducing cognitive burden. That is not true of SQL except for the simplest of queries. It is instead just an abstraction layer hiding how things function, leading to nothing but your aforementioned problems.

                                Meaning, I do agree a simple imperative DSL would work better in the vast majority of complex cases.

                                All that SQL ever does is force me to think “How would SQL actually pull this information, check indexes, join them, etc?” as I write my SQL. That’s actually an increased cognitive load of translating my desired goal into the SQL that would result in it.

                                1. 2

                                  A significant benefit to the abstraction of SQL from the query plan that you don’t seem to be keen on is that the ideal plan is often different for different datasets, even with the same schema. Postgres, for instance, will usually give you sequential scans on small tables versus index scans on large ones. On your developer machine, the sequential scan performs wonderfully because the table is actually in memory in the disk cache. On the server, the sequential scan performs badly because there are millions of rows. But you don’t have to worry about this because Postgres is using statistics about the actual tables to inform the planner, so on the server it chooses an index scan. This can be irritating the first time you see it, because you just made an index on your machine and you don’t know why Postgres isn’t using it. But in general you have to diagnose performance issues in situ rather than in another database with the same schema.

                                  I guess I’m saying, I don’t think I’ve ever seen Postgres come up with a bad plan for no reason. If the plan is inefficient, there’s usually a reason, and usually it doesn’t take a lot of work to figure out why: a missing index is probably 80% of the cause, followed by an overly-procedural query or a configuration issue like insufficient work_mem or out-of-date statistics. In the ~18 years I’ve been using Postgres professionally, I can say it was definitively a configuration issue about twice. So I don’t see anything to get worked up about, personally.

                                  1. 1

                                    I don’t know, I guess it means the production query planners are doing a good (-enough) job. I do like the idea of a manual query planner, I feel like its language would be more or less an assembly language for SQL. If you could write your own statistics-based decision trees, that’d be even cooler. Not to beat the planner, necessarily (it should be unlikely, right?), but to explore more possibilities over the same basic query.

                                    I wonder, though, if a new language were to come along, would/should it still resemble SQL? I’m partial to languages that look like Elm. Maybe a lisp will suffice. Just not XML.

                                    1. 1

                                      It’s in the box, there is no choice.

                                      1. 2

                                        SQL is as old as C, give or take a year.

                                        C was also in the box, there was no choice, butt hat didn’t prevent dozens of other comparable language from taking over most of C’s use.

                                      2. 1

                                        why is SQL so successful despite its shortcoming?

                                        Human and machine efficiency:

                                        1. Human efficiency: If I learn e.g. MongoDB, FaunaDB, or Datomic’s proprietary languages, and they go out of business, I have literally wasted my time. If I learn SQL, I have an industry-standard skill I can re-use, probably for decades. Multiply that across thousands of people working with “data stuff” across the entire world.

                                        2. Machine efficiency: We have decades of research papers and industrial experience in optimizing SQL query performance. Some one-off DB with its own query language may or may not be able to take advantage of all that has been learned over the past 40+ years about SQL query optimization. The fact that someone has invented their own query language does not mean they don’t have this knowledge, but it’s a signal that they may not. (Which feeds back into point 1: I need to think your DB is worth spending my time learning, so you need to signal to me in various ways that you know what you are doing. SQL is one of those signals, since it is data’s lingua franca).

                                        1. 1

                                          SQL is (more than) good enough in most cases. Sometimes you can hit the limits of the particular query planner and confuse it, but this is quite rare compared to the cases where SQL server very well.

                                          any serious DBA must examine the query plan of their SQL statements to avoid writing hopelessly inefficient queries

                                          DBAs maintain the database instances and they can (and usually do) point to the slow queries and maybe give some hints – but it is developer’s job to write efficient queries and designer’s job to create reasonable structures in the database. The work starts with designing the schema and processes around it – writing the queries is just the last step.

                                          P.S. When I saw the title of this question, I expected some critique from relational purist’s point of view. SQL really can be seen somehow impure or a compromise. But it is rather a philosophical question than a performance issue.

                                          1. 1

                                            SQL is modeled on top of relational algegrab, which underpins relational databases - and then they put a ton of new features on top as part of their implementation. There is a solid mathematical foundation for how things should work, and then there’s 50+ years of experience developing software on top of this foundation.

                                            SQL is not pretty or slick, but it’s powerful. There are a lot of ORMs and other database engines/layers that can simplify it, and 95% of the time it’s perfectly performant because you do not need crazy queries. You can have entire, fully-fledged web platforms that have nothing but straightforward queries.

                                            But when you do need the power, to pull out a bunch of data and compare it and manipulate it at the source - lest you pull in 2GB of data that needs to be aggregated in-memory for a single HTTP request - it really shines. And I would argue that any developer worth their salt should know SQL well, at least if they have more than a few years of experience. Good data modeling and knowing how your database works (not all relational databases are created equal) can go a long way, even with sub-par code architecture. I also think that’s the same reason SQL is still going strong after all these years - a lot of other query languages fall short at a certain complexity or are specialized for very specific use cases.

                                            I have seen too many N+1 queries or other completely insane queries in otherwise nice codebases, where the developer has pulled out all the stops to make the code efficient, when it could be fixed by altering the query with a 99% performance increase.

                                            1. 1

                                              The improvements I would like to see over and above what SQL offers just are not compelling enough for anyone except hackers to attempt or even care about. Unless it was delivered to me on a silver platter, I wouldn’t even care.

                                              1. Full support of the relational algebra. I think some implementations (PostgreSQL?) do implement relational division. MS SQL does not…you have to write some hacky SQL.
                                              2. A native query language that is ordered like the order of operations the server actually implements. I believe LINQ is closest to this, but I would never default to using LINQ because of the overhead.
                                              3. A more granular approach to nulls. I haven’t wrapped my head around this in a while, but I believe one can make the case for 2 logical nulls. Better yet would be eliminating nulls for option type.
                                              4. Improved type support…better support of user defined types, for instance.