1. 17
  1.  

    1. 14

      They are absolutely compilers. :-P

      But that’s ok, compilers are databases too.

      1. 3

        Yikes, a wild nerdsnipe attacks!

        Most compilers are batch programs, which is rather different from the kind of SQL OLTP or OLAP systems we think of as databases.

        But then I think back to the early decades of computing and realise that most database programming back then was batch processing: the database was stored on a stack of cards or a mag tape that was mounted by an operator who had to carry it across the room from the library to the tape drive before the job could run. Fun fact! Perl was designed to do the same kind of things as COBOL - talking to databases, producing reports, running as a daily batch cron job, etc. I suppose from that point of view, libraries and archives (.a files) are specialized database files.

        There’s a funny episode in the history of Ada. Its traditional (1980s) compilation model was to keep its library of compiled module interfaces in something a lot more like a database than trad unix compilers. Then along comes gnat and applies the trad unix model to ada, and finds that it’s cheaper to extract the module interface from the source than from a library database. https://dl.acm.org/doi/abs/10.1145/197694.197708 I do not know if the designers of C++ modules avoided ada’s mistake.

        Anyway, compiler-as-database is much more true in a modern sense for query-oriented compilers like Roslyn and tsc and LSP language servers in general.

        And (recalling the recent discussion about the boundary between compilers and interpreters) in dynamic languages it’s often the case that the compiler is updating a database (hash table) of global bindings.

        1. 2

          And (recalling the recent discussion about the boundary between compilers and interpreters) in dynamic languages it’s often the case that the compiler is updating a database (hash table) of global bindings.

          It’s also often the case that they’re JIT compilers, whose databases are a lot more complex than just a hash table of global bindings.

          Also, they can be seen as similar to database query planners, in the sense that they use stats to decide how to execute code. (Profile-guided batch compilers do that too!)

      2. 3

        This is very unhelpful

        1. 6

          But it is funny and there’s some truth in it.

        2. 3

          Sorry.

    2. 11

      Maybe I am way too deep in legacy/bespoke code, but the vast majority of the SQL I deal with (and have for many years) was absolutely written by humans.

      1. 5

        I found that line a bit dubious myself. I’m sure the amount of SQL being churned out by popular ORMs is absolutely huge and probably growing much more quickly than the amount of human-written SQL, but to assert that the human-written SQL is “vanishingly” small seems like a stretch.

    3. 11

      First of all, SQL, from the get-go, doesn’t have a notion of the execution order of things.

      Like Haskell! Or Prolog!

      A vanishingly small amount of SQL is written by humans.

      Like Postscript!

      1. 1

        Haskell

        Why not? Apart from IO, there are guards, that depend on ordering. The compiler can’t in general check if guards overlap.

        ghci> :{
        ghci| f | True == True = 42
        ghci|   | False == False = 43
        ghci| :}
        ghci> 
        ghci> f
        42
        

        If this is not execution order, what would you call it?

        1. 2

          Well, the point of my comment is that SQL isn’t as special a language as the author seemed to think. In that sense, a couple exceptions don’t invalidate my point. I bet there are a couple places in SQL where subexpressions are guaranteed to be executed in some order too.

          There’s a way of viewing Haskell where even do and guards don’t have “a notation of execution order”, though. The guard above is equivalent to:

          if True == True then f 42 else (if False == False then f 43 else undefined)
          

          And

          do
              x
              y
              z
          

          is defined to be equivalent to (not sure if I’m getting this exactly right, but you get the idea):

          x >>= \x -> (y ==> \y. z y)
          

          In both cases, the required “execution order” can actually be viewed as purely functional code with nesting.

          1. 2

            Oh ok, I understand. I suppose that the assertion is that if expressions don’t have side effects, and most things are expressions, then the language mostly doesn’t have side effects. I guess I misunderstood it because I think of expression evaluation as embedding an order of operations. Which is apparently where I went wrong.

            1. 1

              Another way to think of it is in terms of equivalences. The blog post gave some equivalences between SQL expressions. In Haskell, these expressions are always equivalent, for any subexpressions X and Y:

              X + Y === Y + X
              f(X, Y) === f(Y, X)
              

              This is unusual! Very few languages have these equivalences.

              1. 1

                On the other hand, haskell loses a lot of equivalences because of laziness and in particular the use of syntactic dependencies. (Fast and loose reasoning isn’t actually correct.)

                1. 1

                  I don’t know if this. Could you give an example of an equivalence lost to laziness? And what’s a “syntactic dependency”?

                  1. 1

                    Suppose we define:

                    times Z _ = Z
                    times (S x) y = plus y (times x y)
                    

                    In a strict language, this commutes (assuming acyclicality), but in haskell, times Z bot is Z, but times bot Z is bot.

                    1. 1

                      Ooh, interesting! Thanks for the example.

    4. 7

      I’m not sure I agree. The front end of a database is very much like a compiler, with lexing, parsing, a re-write system, optimizer, and outputting either a lower level language interpretable by the database, or even machine code (like was done in System R).

      Understanding compilers, will absolutely help you here, especially declarative or functional programming languages that don’t have a notion of execution order, but where it’s still possible to encode data dependencies.

      Databases do persistently store data and allow retrieval, so they aren’t the same as compilers, but they have enough compiler tech in them that you could accurately say they contain a compiler!

      1. 3

        Yeah exactly … I always thought this meme was silly.

        Yes there is overlap, because most databases contain a programming language – SQL. But some don’t, I think Mongo.

        When two things overlap, it doesn’t mean they’re the same.

        1. 3

          A programming langage does not have to be turing complete, to say nothing of being good.

          If a mongo query was a direct list of concrete operations on storage it might be argued otherwise, but a mongo query compiles town to a plan (or several which then get pruned out).

          1. 1

            I mean I’d say a database is a compiler as much as a shell is a compiler.

            Sure they have some things in common, but it’s not really an interesting analogy IMO

            Or at least people should say exactly why they think a database is a compiler, not repeat unhelpful memes without references

            If it were “a database HAS a compiler”, sure that’s less objectionable

    5. 4

      but… wait…

      I see the point being made here and it’s a good point, the heuristics by which compilers decide what kinds of transformations are okay are not the same as the heuristics by which databases should decide that. I think that honestly I come down on the side of treating databases more like compilers rather than less, but it’s a legit topic to discuss

      but the example doesn’t shine light on that; if anything it obscures it

      for a start, neither of those transformations is valid and not for a deep reason but because they do have different semantics

      the semantics of an ON clause are clear and well-defined. the “when” of it is not, so the article’s technically right, but that’s only because time is a bad metaphor for SQL’s execution model

      the author suggests that the confusion is caused by random() not being a pure function, ie. depending on things other than its inputs. I think that’s a red herring. I think the confusion is because random() doesn’t care about the values in the rows at all!

      therefore, I’d like to work through an example with a function that does care about the values, and is also pure. I think when doing that, it’s obvious that these transformations are disallowed. consider:

      SELECT count(*) FROM one_thousand AS a
      INNER JOIN one_thousand AS b ON gcd(a.id, b.id) = 1
      

      (this is a close match to the original query in the article, I just changed random to gcd and broke it into two lines for readability)

      it should be obvious that there’s no way to turn that into something in the form

      SELECT
        count(*)
          FROM
        (SELECT * FROM one_thousand AS a WHERE gcd(a.id, ????) = 1)
          INNER JOIN
        one_thousand AS b ON true
      

      because the join “hasn’t happened yet” in the subquery, so b isn’t visible to it. that’s clearly defined from the spec.

      similarly, you also can’t transform it to something with structure analogous to

      SELECT
        count(*)
          FROM
        (SELECT * FROM one_thousand AS a WHERE gcd(a.id, ???) = 1)
          INNER JOIN
        (SELECT * FROM one_thousand AS b WHERE gcd(???, b.id) = 1)
          ON true
      

      because, again, neither subquery can see the other. the semantics here ARE well-defined.

      I don’t have time to go through it in detail right now but I am pretty sure from my math background, and from working through it in my head, that if you were to actually run the various versions of the query with random() a few thousand times, all three of the author’s queries would give results drawn from different possibility distributions. (I can’t quite prove to my own satisfaction that the third one is different from the first, but it feels different. the second definitely is not the same as either of them.)

      probability math is its own complex subject which trips people up a lot, but the weird behavior of the author’s examples is precisely due to the semantics of where in the process the operation happens - the part that SQL specifies quite explicitly! it’s exactly analogous to why my three gcd examples don’t mean the same thing.

      now, if the value of random() were constant - depending neither on other values in the query nor on side-effects outside the query’s control - then the transformations would be valid. for example, if the original ON clause were just ON true, all three queries would mean the same thing, both to anyone observing their output and to the formal semantics of SQL. but that’s not the case, so the transformation is disallowed.

      hope this actually helps, it can be hard to tell with these things whether I’m making it better or worse :)

      1. 1

        just replying to myself to add: I figured out the probability distributions, more or less. I think they’re normal distributions but I couldn’t swear to that; at any rate…

        the first one gives values centered around 1000*1000*.05, ie. 50,000.

        the second one also gives values centered around 1000*1000*.05 but, interestingly, it only gives values that are divisible by a thousand. this is because rows are discrete, you can’t have part of a row, so you get a spreading-out effect like when you’ve gotten too aggressive with the levels tool in photoshop.

        the third one gives values centered around (1000*.05)*(1000*.05), ie. 2,500

        again, the intuition for this is that in both the random() example and the gcd() example, it matters whether you’re making a choice about a pair of rows that have already been put into correspondence, or whether you’re going to put them into correspondence afterwards at some point.

        (edit: inadvertent markdown syntax)

    6. 3

      First of all, SQL, from the get-go, doesn’t have a notion of the execution order of things. Impure functions mess with that notion a bit, but the problem here is the impure functions—they’re the culprit here.

      Ok, we can blame random() instead of the optimizer or the SQL language. But then what behavior should we expect from random()?

      • Evaluate once for the whole query.
      • Evaluate once per batch of rows.
      • Evaluate once per partition.
      • Evaluate once if there are fewer than 1000 rows; otherwise evaluate once per row.
      • Always return 0.4.

      It wouldn’t be useful to permit all of these, so how do we define it in a way that’s both useful and permits optimizations?

      1. 9

        I think the argument in this post is dubious at best. Obviously the optimizations that a database query engine performs should be semantics-preserving in the same way that compiler optimizers are. In the case of an ON clause in a join, the semantics are that the expression is evaluated on the cartesian product of the rows from the tables that are being joined. In the previous article, SQLite and CockroachDB optimize incorrectly because they assume that a function that doesn’t depend on the contents of one row is effectively constant for any row from that table. This isn’t true for random(), but I don’t think there are many functions that are weird in a similar to random() so it’s not a good example from which to generalize like this article does. The date_part() example is not an example of a side effect: it’s a straightforward dependency on the contents of a row, which is the thing that random() lacked, so it won’t break the push-down optimization in the same way.

        1. 3

          If you push down date_part you might evaluate it on rows that it would not get evaluated on if you do not push it down because the other side of the join might be empty.

          1. 3

            Oh right, now I get the point about empty joins - I thought it was performance, I missed the effect on correctness. There are lots of partial functions … I am dismayed that the Postgres docs for mathematical functions don’t appear to say what happens if you divide by zero or take the logarithm of a negative number (maybe I am looking in the wrong place?). Anyway, I was wondering if they raise an error or return NULL. In the context of a join, it would make the planner’s job easier if errors became NULL; maybe there’s a wrapper function to NULLify errors.

            1. 1

              They raise errors. You could easily write your own safe_div and safe_log functions, and more importantly declare them immutable the same as would be for PG-native versions, but in a where context you could even more easily check in another predicate before proceeding to divide or take the log since ((x <> 0) and (n / x > 1)) only needs to fail the first branch.

        2. 2

          Obviously the optimizations that a database query engine performs should be semantics-preserving in the same way that compiler optimizers are.

          If the semantics aren’t defined (different databases don’t agree on the them), what’s there to preserve? This is essentially undefined behaviour. It’s not as harmful as UB in C, but very similar in that the optimizer can produce different results because the results you got the first time around weren’t guaranteed anyway.

          Similarly, if you use LIMIT 1 without an explicit ordering, you might get different results at different times, even with the same query and the same data because the optimizer might select a different plan (after a VACUUM or an ANALYZE, for example).

          One could say these are cracks in the declarational facade that databases try to put up. The abstraction leaks and the underlying implementation/execution strategy shines through.

          1. 2

            You can have a nondeterministic semantics. That is a perfectly fine thing to do. The specification describes a range of legal behaviours and intermediate states, instead of a single legal behaviour; and a well-formed program should ensure that, under all legal behaviours, it ends up in the same final state (or, rather, that all allowable final states maintain application invariants; for example, two newly added rows could get added in different orders and so have different ids, but that probably doesn’t matter).

          2. 1

            I genuinely think it’s clearly defined as-is, per my long post elsewhere in the thread.

      2. 4

        I guess my point there, that I should have made more explicitly, was that I think answering this question in a coherent way is incompatible with a declarative language. Or at least the ones we are stuck with today. I could imagine a language designed to have a sane answer to this question and still have a lot of the good properties that a language like SQL has, but I don’t know what it would look like and I’m not sure SQL is it.

        1. 4

          Yeah, I agree random() doesn’t fit well into the language. :(

          But I bet you could rescue it! You could find a definition that explains the behaviors people expect, but is still declarative.

          For example here are some things I expect about random():

          • multiple random() calls in the same expression are independent
          • random() calls on separate rows are independent
          • When a subquery uses random(), the subquery result for different rows of the outer query are independent.
          • multiple requests are independent (even though the query text is the same).

          So you could define random() as a function that takes a whole bag of rows, and marks each row with a value such that the overall distribution comes out right. Or you could think of it as a function that maps each unique row ID to an independent random value–even though these dependencies aren’t visible in the user syntax.