1. 6

  2. 2

    This is a long article and I didn’t digest all of it, partly because I’m not great at reading Ruby (or SQL for that matter).

    This bit jumped out at me, and seems similar to the points in the recent Against SQL about the SQL language being non-compositional:

    Sequel’s join does not correspond to a algebraic join of its operands. Instead, its specification looks like “adds a term to the SQL query’s FROM clause”

    To be fair… There is a way to use SQL (and, sometimes, those libraries) so as to avoid the problem described here. It amounts at using SQL in a purely algebraic way. Unfortunately, that way is not idiomatic and leads to complex SQL queries, that may have bad execution plans (at least in major open-source DBMSs

    I wonder if @jamii is famliar with this work and has any comments on it?

    Related past posts:

    Successor to Alf which still seems active: https://github.com/enspirit/bmg

    1. 2

      The interface is roughly similar to spark, flink, kafka streams, differential dataflow etc. It seems like the main contribution here is being able to compile that to sql without introducing crazy nested queries that the planner handles poorly. I’d be interested to see how that’s done.

      This kind of query-as-a-library interface is nice because you get to pawn a lot of the work off on to the host language. The ‘Reconciling heterogeneous type systems’ section near the bottom hints at the downsides. Functions in the host language are totally opaque, so they can’t be optimized or sent to the db. This also means that you can’t really do subqueries, which I find are often the easiest way to write many queries.

      I still haven’t decided whether it’s worth it in the long run to have to build a whole language just to get subqueries though. I’m experimenting with both approaches - language-integrated in https://github.com/jamii/dida and a full language in https://github.com/jamii/imp.

      1. 1

        Thanks for the info, the language vs. library issues make sense. I was more interested in how they avoid composition pitfalls of other DB access layers / SQL compilers.

        AFAIK spark and kafka don’t compile to SQL and instead implement their own subset. (There were several quirky SQL subsets at Google too for big data, and at some point people tried to unify them under a single language. Not sure if that succeeded.)

        But probably the barrier to that is my Ruby and SQL knowledge, as mentioned.

        BTW I loved the “Against SQL” article – here was my comment, pointing to some resources about dplyr, Tidy Data, and Data Frames, including critiques by the database community (the article mentioned pandas but R is where those ideas originated):


        I only use dplyr with the “native” engine; the SQL engine does seem to be a bit hacky. They don’t really deal with the semantic issues.


        Perfect translation is not possible because databases don’t have all the functions that R does. The goal of dplyr is to provide a semantic rather than a literal translation: what you mean rather than what is done. In fact, even for functions that exist both in databases and R, you shouldn’t expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, mean() loops through the data twice. R’s mean() also provides a trim option for computing trimmed means; this is something that databases do not provide.

        BTW Oil will likely grow a type for table-like data: https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil

        However I would view it as “pre” structured or “pre” relational – i.e. a way of cleaning/preparing/filtering data. Analysis can be done by “proper” relational systems or R / Pandas.

        1. 2

          here was my comment,

          Ah, it’s impossible to read hn comments without the ‘unread’ marker that lobsters has. All the people who comment before reading get in there first and clutter up the comments and then it’s too late to find the thoughtful comments.

          I was more interested in how they avoid composition pitfalls

          That does seem like the rub for compiling to sql. For my own projects I think it will be easier to write my own execution engine but target eg sqlite storage. But I can see how compiling to sql is appealing.

          AFAIK spark and kafka don’t compile to SQL and instead implement their own subset.

          Yeah, the library version is roughly equivalent to directly specifying a query plan, and their sql subsets compile down to that. I think the library versions are interesting, because they are so much lower effort both to implement and adopt compared to a new query language. But flink et al are limited to jvm languages. I’m playing around in dida with trying to be a good citizen in many runtimes - there are zig, node and wasm<->js bindings so far. I’m not sure how practical this is compared to starting an entire new language but then at least getting to share more code between projects.

          pointing to some resources about dplyr, Tidy Data, and Data Frames

          I’ve used r dataframes a tiny amount, but I’m much more familiar with pandas. I have seen the “Is a dataframe just a table” and your “What is a dataframe” before - both were very useful.

    2. 1

      The upstream gem hasn’t had releases for quite a few years. More recently, the relational pipes tools can interoperate with many formats and also embody relational algebra. This sort of tooling is powerful and I wish that we had more of it.

      1. 1

        I linked to this successor which looks active: https://github.com/enspirit/bmg