1. 19

abstract:

I describe the circumstances in which I obtained, in 1978, a good answer to a burning question: how can E.F. Codd’s relational model of data of 1970 [2] be properly embraced by a computer language? Considering that the answer to that question (“An Algebra of Relations for Machine Computation”) was publicly available in 1975, I wonder why it all went wrong and suggest some possible reasons.

But this paragraph on page 11 is perhaps a better abstract, or at least one that explains the title better:

That situation therefore prompts us to ask, why isn’t SQL truly relational? And, given that it isn’t, why did the industry at large so readily embrace it? Well, we can easily forgive the industry at large, I suppose, because in spite of all its defects, SQL was a hugely significant advance on what we had before—its implementations did offer a fairly intuitive declarative query language without the need for loops or pointer chasing, and they did offer a modicum of physical data independence. But why didn’t its inventors come up with a truly relational language? Why did they neglect the obvious need for each of a table’s columns to have a name? Why did they allow a table to have two or more columns with the same name? Why did they decide to allow more than one copy of the same row to appear in a table? Why did they invent NULL and introduce that third truth value, UNKNOWN?

  1.  

  2. 7

    I’ve found lecture slides, also by Hugh Darwen, that explain how SQL is not relational: http://www.dcs.warwick.ac.uk/~hugh/TTM/TTM-TheAskewWall-printable.pdf

    I recommend reading the lecture slides first: knowing the painful ways in which SQL is not relational made me a lot more invested in the question Why are there no DMBSs which are relational?

    1. 1

      For those wondering “if he dislikes NULL so much, with what does he propose to replace it with?” see https://www.dcs.warwick.ac.uk/~hugh/TTM/Why-Are-There-No-Relational-DBMSs.pdf

      I don’t find this particularly compelling given that

      • NULLs are far more convenient
      • You can organize data like he proposes and still keep NULL around if you want

      A more compelling argument would be to make it easy to extend the types to add things like Maybe or Either. But I suspect he would not be in favor of that because it isn’t relational enough…

      1. 3

        It has some overlap with some of the “data-oriented design” ideas of not having NULLs – https://www.dataorienteddesign.com/dodbook/node4.html.

        I think yeah it depends on the situation. If a type / domain has a natural sentinel value that helps. In general though it’s felt to me like if some key is not always related to some attribute it seems to make sense to pull that attribute out into its own relation / table at the design level. If the data is quite normalized then that’s likely the only non-key attribute so you could ask if the tuple / row should just be removed.

        In what way do you think NULLs are convenient? In many cases it’s inconvenient bc. you have to make sure to check for NULL or watch out for what would happen for NULL values when doing sums etc. But without NULLs the logic of all the operators is clear.

        1. 3

          I think I agree with both of you: on one hand nulls can be convenient and the correct way to code things, but on the other hand SQL seems to have some rather muddled operator logic around nulls.

          I can completely get behind NULL as missing data, the sort of value you deliberately insert to indicate ‘not known’ rather than ‘empty-ish’, and then handle in your analysis. But that fondness may be because I come from R: there, missing values (NA) are an inescapable fact of data analysis, so lots is known about how to handle it (use methods robust to missing data, or impute missing values, or drop incomplete observation), and the tools are present.

          I think SQL is not as good about this, because it uses NULL to indicate different things:

          • missing data. Which is why 3 + null = null.
          • empty sets, or “this represents zero rows” arising from a join. select sum(t.column1) from (values (3), (null)) as t — returns 3.
            • And now SUM(X) + SUM(Y) is not the same as SUM(X+Y). Ouch. (From Lex de Haan’s collection, cited in Darwen’s slides).
            • (Longer example for a row with null representing zero rows. Start with a supplier table and join it to supplied parts, which produces a table with (a) one row like ('s1', 4.25) per (supplier, delivered part price), but also (b) rows ('s5', null) for suppliers who delivered no parts. And then SUM assumes any NULLs indicate ‘this row represents no rows’ instead of ‘this data is missing’. So SQL treats NULLs as 0 in a sum, and treats them as ‘NA’ during addition.)
          • SQL might use NULL for even more purposes than this? I don’t know. This is plenty pain to be going on with.

          On nulls that indicate empty sets / “this group has zero rows”: the third manifesto suggests that ‘group by supplier’ should not be a modifier, but an operation that returns a grouped table with one row per supplier, where each row contains cells (supplier: a string, supplier_data: a nested table). Then it is obvious how to handle suppliers who delivered no parts: they have one row like every other supplier, and their supplier_data is a table with zero rows. So I think the NULL problem is tangled with SQL’s lack of support for ‘relation/table/set-of-rows valued cells’.

          I would be very interested in a DBMS that supports nested tables, even if only during analysis! In R, they have been a pleasure to work with. More reading:

          Final thought: I suspect tidyr + dplyr might satisfy Dewars requirements for a relational language “D”, albeit imperative rather than declarative. It supports grouped/nested data frames; and rowbinding tibble(a=1, b=2) and tibble(b=20, a=10) does the right thing. If it does count as a “D”, it is probably the most widely used implementation.

          1. 2

            FWIW in C. J. Date’s writings usually it seems like the author is into relation-valued attributes being a thing.

            I don’t personally have strong opinions on any of these design decisions yet. Just been exploring the space because I want to build a collaborative UI tool for relational data + queries + triggers at some point and have been thinking about the design + tech of that. It’s also very related to ideas about data layout for game engines which is something I’ve worked on for a while.

            1. 2

              Thanks for the pointer at Date’s writing; so far I had only turned up writings of Darwen’s. I’m curious what he has to say about relation-valued attributes. (Table-valued attributes?)

              Me, I put my head into this rabbit hole for two reasons:

              • Hopes of understanding SQL better (Learning the model SQL ought to have, and how best to approach it with SQL is still a good outcome. This has worked before: knowing Mercurial’s consistent model helps me a lot when I work with Git’s decidedly more footgunny interface.)
              • Prolog is oops! all relations, I’m having fun with Prolog, this research is about relations, ergo I expected to have fun. (This logic may not be watertight, but it’s been fun so far.)

              I want to build a collaborative UI tool for relational data + queries + triggers at some point and have been thinking about the design + tech of that.

              chinhands If you feel like it, tell us about your hopes and musings?

              1. 2

                I haven’t dug too far into the relation-valued attributes side of Date’s writings, but it seems like he thinks it’s essential to his formulation of the relational model given that attributes can contain values of any type, and relations are values of relation type. “Updateable tables” are really just variables of a particular relation type (he calls them “relvars” for short). Date’s version of “GROUP” produces relation-valued attributes, I think.

                And yeah I feel you on the ideal / consistent model. I think just the core set of operators: project (i.e., “select columns” – including rename and extend), restrict (i.e., “where” or filter) and natural join seems pretty complete (the “SPJR algebra” on page 56 here: http://webdam.inria.fr/Alice/pdfs/Chapter-4.pdf – also a book worth checking out – the rest of it touches on Datalog-y (and maybe some Prolog-y?) things too). And key, foreign key constraints. Re: logic I think the idea of the “relation predicate” (each relation as having an associated predicate, with its tuples denoting all instantiations of the predicate that evaluate true) may be relevant. There are some good connections between the relational model and logic.

                Date’s “SQL and Relational Theory” has decent coverage of a lot of this including that logic part. The first appendix has a pretty neat summary of the model, arguing for it / building it up from first principles.

                Re: the tool – I was imagining a nodes / dataflow UI, like Blender’s geometry nodes editor or Max/MSP; but where the values flowing through the pipes are relations and the operators are those three core relational algebra ones. They form a closure so everything flowing is a relation. Nodes could visualize relations as tables at any point. An issue with these dataflow UIs is often the representation of “containers” of data and multiplicity, and I think the relational model seems like a good fit for addressing that.

                1. 2

                  An issue with these dataflow UIs is often the representation of “containers” of data and multiplicity, and I think the relational model seems like a good fit for addressing that.

                  Gods, yes, multiplicity/cardinality is so essential to understanding data flow/analyses, and surprisingly tricky to communicate clearly, especially once nesting gets involved. When data wrangling and analyzing, I especially keep coming back to these two questions:

                  • What is the unit of observation? What does each one row in this table represent? E.g. ‘each row represents an (observer, 10-minute interval) combination’.
                  • And what is the scope? Which rows are in/not in this table? E.g. ‘table contains all rows where at least 1 species was observed’.
                  • Optional third annotation concerns columns and types. Properly annotating that the ‘species’ column contains an array per cell, or that ‘timestamp’ is still a string, can solve so much time.

                  How were you thinking of showing multiplicity? I’ve only ever seen it visualised once, and that might be very far from what you have in mind.

                  The visualisation I’m thinking of: WinBUGS (software for Bayesian modeling) needs to represent that some bits of the model have a variant per (grouping variable). E.g. some params are different for each (patient); some for each (patient, risk); some for each (risk) and some for each (patient, day). WinBUGS does this by drawing rectangles with annotations like for(k in 1 : Q) in the corner. Here’s an example WinBUGS graph from Deslandes and Chevret: “Joint modeling of multivariate longitudinal data and the dropout process in a competing risk setting: application to ICU data”. I can’t quite figure out all the letters, but AIUI

                  • Z[i] is per-patient longitudinal data, so the for(i in 1 : N) plate/rectangle encompasses all per-patient params.
                  • for(k in 1 : Q) might be ‘one variant per failure type’?
                  • and for(j in 1 : M) might be ‘one variant per censoring mechanism’?

                  Speaking as a statistician, 1-letter variables and subscripts are grand for compactness, but in publications I much prefer word-named variables for at least the domain entitities.

                  P.s. the INRIA paper looks v. interesting, I’ll come back when I’ve had a chance to read it.

                  1. 2

                    WinBUGS is a cool reference, I’ll have to check that out. Looking at the image it’s piqued my interest already. Generally open to looking at references of all types! Probably the closest thing I’ve found is DFQL (https://apps.dtic.mil/sti/pdfs/ADA246086.pdf) – but even that is a bit different because it seems more about visualizing the query AST / grammar itself; while I’m kind of more interested in visualizing the tables and dataflow, with query elements spread out across nodes or all within one node however the user sees fit (I’m imagining the nodes are just text boxes to start, so you could write a complex query within one box – I have more to say on the text query language itself).

                    For multiplicity I was actually thinking to start with the very simple route of: just display each relation as a table node in the UI. So you’ll see 5 rows in the node if there are 5 tuples in the relation. Probably will get paginated if there’s a lot of rows. And I’m expecting the whole UI to be very ‘real-time’ so if what you’re modeling is a simulation / game you’ll see values get updated or rows added / removed on the fly while the simulation is running. The tables / pipes themselves always have a multiplicity of 1 (i.e., users are creating tables and piping them into queries manually (or opening existing documents / receiving edits from other collaborators)) which is the aspect of the relational model that made it seem well suited to this – a constant number of tables, queries and triggers can handle any multiplicity of the domain entities.

                    This is in contrast to needing a “dynamically changing” number of pipes and nodes eg. needing N pipes to model N entities in the simulation and thus needing to see pipes be added and removed dynamically while it runs. The pipes carry relations and can thus represent data with multiplicity while being a single pipe. If you wanted to eg. model and render multiple scenes as separate windows where each scene has many actors, you could have an actors table, a scenes table, and an actorId-sceneId pairing table; and use joins / aggregation to construct the relations to feed into graphics nodes for rendering. But the number of nodes and pipes is still constant.

                    I don’t think this is the maximum extent of it, but starting here, and then adding things in context as they show up for particular applications. I have some I’m hoping to try – starting with notes / mind-mapping, maybe simple chat / messaging and ranging to graphics visualizations or simulations if things get there. Usability is a primary concern and designing the UI in the context of people trying to do concrete things with it seems key.

                    BTW, I’ll shoot you a DM so we can continue discussing this. :)

    2. 2

      The author, Hugh Darwen, co-authored with CJ Date a design for a query language to replace SQL, called Tutorial D, based on relational algrebra. Darwen also created an open source implementation: https://reldb.org/c/ So now there is a relational DBMS.

      1. 1

        There is also Project M36

      2. 1

        Whoa, it’s weird to see Hugh Darwen come up after I so recently learned about their research. For those who are interested: I was exploring the connection between the Data Oriented Design book’s “Existential Processing” pattern and “6th normal form” (Wikipedia, helpful StackOverflow). Both which seem to be related to avoiding NULLs.

        1. 1

          I’m not very well versed in the theory so can someone please confirm/deny that the whole issue with SQL is that it’s kinda a superset of the theoretical relational algebra? From a cursory reading, it looks to me like (concrete modern implementations, e.g. PostgreSQL) SQL is looser but in principle has everything needed to implement a relational model in a particular database and the whole issue is that there are parts that break the theoretical (mathematically rigorous?) model.

          1. 1

            I really enjoyed Knowledge Representation in Bicategories of Relations as a summarizing overview of relational algebras within a unified framework. It is oriented around ologs but can be appreciated on its own merits too.

            I would expect a modern programming language built around these concepts to replace functions with relations and allow users to specify tables as custom types. But note that the paper directly shames my expectation:

            [Treating functions as relations and using the mathematical theory of relations] obviously isn’t computationally feasible as the cardinality of the relation…is far too big, even for modern machines.

            Emphasis added. I routinely write miniKanren queries which should converge, but take too long; if my query takes more than a second or two, I assume that I won’t get an answer and go back to the editor.

            To directly answer your question: There are issues with SQL besides being a superset of relational behaviors. Non-trivial queries must be written with the database engine in mind; we have to consider how intermediate results will be materialized and collated. This is exactly analogous to the difficulty in composing relations in other settings: we have to consider the possibly-infinite range of intermediate values! This issue might be inherent to computing in relational settings, and might generalize beyond bicategories of relations to allegories.

            1. 2

              Non-trivial queries must be written with the database engine in mind

              Isn’t this the case with ISBL and BS12, which, presumably, are relational?