abstract:
I describe the circumstances in which I obtained, in 1978, a good answer to a burning question: how can E.F. Codd’s relational model of data of 1970 [2] be properly embraced by a computer language? Considering that the answer to that question (“An Algebra of Relations for Machine Computation”) was publicly available in 1975, I wonder why it all went wrong and suggest some possible reasons.
But this paragraph on page 11 is perhaps a better abstract, or at least one that explains the title better:
That situation therefore prompts us to ask, why isn’t SQL truly relational? And, given that it isn’t, why did the industry at large so readily embrace it? Well, we can easily forgive the industry at large, I suppose, because in spite of all its defects, SQL was a hugely significant advance on what we had before—its implementations did offer a fairly intuitive declarative query language without the need for loops or pointer chasing, and they did offer a modicum of physical data independence. But why didn’t its inventors come up with a truly relational language? Why did they neglect the obvious need for each of a table’s columns to have a name? Why did they allow a table to have two or more columns with the same name? Why did they decide to allow more than one copy of the same row to appear in a table? Why did they invent NULL and introduce that third truth value, UNKNOWN?
I’ve found lecture slides, also by Hugh Darwen, that explain how SQL is not relational: http://www.dcs.warwick.ac.uk/~hugh/TTM/TTM-TheAskewWall-printable.pdf
I recommend reading the lecture slides first: knowing the painful ways in which SQL is not relational made me a lot more invested in the question Why are there no DMBSs which are relational?
For those wondering “if he dislikes NULL so much, with what does he propose to replace it with?” see https://www.dcs.warwick.ac.uk/~hugh/TTM/Why-Are-There-No-Relational-DBMSs.pdf
I don’t find this particularly compelling given that
A more compelling argument would be to make it easy to extend the types to add things like Maybe or Either. But I suspect he would not be in favor of that because it isn’t relational enough…
It has some overlap with some of the “data-oriented design” ideas of not having NULLs – https://www.dataorienteddesign.com/dodbook/node4.html.
I think yeah it depends on the situation. If a type / domain has a natural sentinel value that helps. In general though it’s felt to me like if some key is not always related to some attribute it seems to make sense to pull that attribute out into its own relation / table at the design level. If the data is quite normalized then that’s likely the only non-key attribute so you could ask if the tuple / row should just be removed.
In what way do you think NULLs are convenient? In many cases it’s inconvenient bc. you have to make sure to check for NULL or watch out for what would happen for NULL values when doing sums etc. But without NULLs the logic of all the operators is clear.
I think I agree with both of you: on one hand nulls can be convenient and the correct way to code things, but on the other hand SQL seems to have some rather muddled operator logic around nulls.
I can completely get behind NULL as missing data, the sort of value you deliberately insert to indicate ‘not known’ rather than ‘empty-ish’, and then handle in your analysis. But that fondness may be because I come from R: there, missing values (NA) are an inescapable fact of data analysis, so lots is known about how to handle it (use methods robust to missing data, or impute missing values, or drop incomplete observation), and the tools are present.
I think SQL is not as good about this, because it uses NULL to indicate different things:
3 + null = null
.select sum(t.column1) from (values (3), (null)) as t
— returns 3.SUM(X) + SUM(Y)
is not the same asSUM(X+Y)
. Ouch. (From Lex de Haan’s collection, cited in Darwen’s slides).('s1', 4.25)
per (supplier, delivered part price), but also (b) rows('s5', null)
for suppliers who delivered no parts. And then SUM assumes any NULLs indicate ‘this row represents no rows’ instead of ‘this data is missing’. So SQL treats NULLs as 0 in a sum, and treats them as ‘NA’ during addition.)On nulls that indicate empty sets / “this group has zero rows”: the third manifesto suggests that ‘group by supplier’ should not be a modifier, but an operation that returns a grouped table with one row per supplier, where each row contains cells (supplier: a string, supplier_data: a nested table). Then it is obvious how to handle suppliers who delivered no parts: they have one row like every other supplier, and their supplier_data is a table with zero rows. So I think the NULL problem is tangled with SQL’s lack of support for ‘relation/table/set-of-rows valued cells’.
I would be very interested in a DBMS that supports nested tables, even if only during analysis! In R, they have been a pleasure to work with. More reading:
tidyr
vignette on working with nested tablesFinal thought: I suspect tidyr + dplyr might satisfy Dewars requirements for a relational language “D”, albeit imperative rather than declarative. It supports grouped/nested data frames; and rowbinding
tibble(a=1, b=2)
andtibble(b=20, a=10)
does the right thing. If it does count as a “D”, it is probably the most widely used implementation.FWIW in C. J. Date’s writings usually it seems like the author is into relation-valued attributes being a thing.
I don’t personally have strong opinions on any of these design decisions yet. Just been exploring the space because I want to build a collaborative UI tool for relational data + queries + triggers at some point and have been thinking about the design + tech of that. It’s also very related to ideas about data layout for game engines which is something I’ve worked on for a while.
Thanks for the pointer at Date’s writing; so far I had only turned up writings of Darwen’s. I’m curious what he has to say about relation-valued attributes. (Table-valued attributes?)
Me, I put my head into this rabbit hole for two reasons:
chinhands If you feel like it, tell us about your hopes and musings?
I haven’t dug too far into the relation-valued attributes side of Date’s writings, but it seems like he thinks it’s essential to his formulation of the relational model given that attributes can contain values of any type, and relations are values of relation type. “Updateable tables” are really just variables of a particular relation type (he calls them “relvars” for short). Date’s version of “GROUP” produces relation-valued attributes, I think.
And yeah I feel you on the ideal / consistent model. I think just the core set of operators: project (i.e., “select columns” – including rename and extend), restrict (i.e., “where” or filter) and natural join seems pretty complete (the “SPJR algebra” on page 56 here: http://webdam.inria.fr/Alice/pdfs/Chapter-4.pdf – also a book worth checking out – the rest of it touches on Datalog-y (and maybe some Prolog-y?) things too). And key, foreign key constraints. Re: logic I think the idea of the “relation predicate” (each relation as having an associated predicate, with its tuples denoting all instantiations of the predicate that evaluate true) may be relevant. There are some good connections between the relational model and logic.
Date’s “SQL and Relational Theory” has decent coverage of a lot of this including that logic part. The first appendix has a pretty neat summary of the model, arguing for it / building it up from first principles.
Re: the tool – I was imagining a nodes / dataflow UI, like Blender’s geometry nodes editor or Max/MSP; but where the values flowing through the pipes are relations and the operators are those three core relational algebra ones. They form a closure so everything flowing is a relation. Nodes could visualize relations as tables at any point. An issue with these dataflow UIs is often the representation of “containers” of data and multiplicity, and I think the relational model seems like a good fit for addressing that.
Gods, yes, multiplicity/cardinality is so essential to understanding data flow/analyses, and surprisingly tricky to communicate clearly, especially once nesting gets involved. When data wrangling and analyzing, I especially keep coming back to these two questions:
How were you thinking of showing multiplicity? I’ve only ever seen it visualised once, and that might be very far from what you have in mind.
The visualisation I’m thinking of: WinBUGS (software for Bayesian modeling) needs to represent that some bits of the model have a variant per (grouping variable). E.g. some params are different for each (patient); some for each (patient, risk); some for each (risk) and some for each (patient, day). WinBUGS does this by drawing rectangles with annotations like for(k in 1 : Q) in the corner. Here’s an example WinBUGS graph from Deslandes and Chevret: “Joint modeling of multivariate longitudinal data and the dropout process in a competing risk setting: application to ICU data”. I can’t quite figure out all the letters, but AIUI
Speaking as a statistician, 1-letter variables and subscripts are grand for compactness, but in publications I much prefer word-named variables for at least the domain entitities.
P.s. the INRIA paper looks v. interesting, I’ll come back when I’ve had a chance to read it.
WinBUGS is a cool reference, I’ll have to check that out. Looking at the image it’s piqued my interest already. Generally open to looking at references of all types! Probably the closest thing I’ve found is DFQL (https://apps.dtic.mil/sti/pdfs/ADA246086.pdf) – but even that is a bit different because it seems more about visualizing the query AST / grammar itself; while I’m kind of more interested in visualizing the tables and dataflow, with query elements spread out across nodes or all within one node however the user sees fit (I’m imagining the nodes are just text boxes to start, so you could write a complex query within one box – I have more to say on the text query language itself).
For multiplicity I was actually thinking to start with the very simple route of: just display each relation as a table node in the UI. So you’ll see 5 rows in the node if there are 5 tuples in the relation. Probably will get paginated if there’s a lot of rows. And I’m expecting the whole UI to be very ‘real-time’ so if what you’re modeling is a simulation / game you’ll see values get updated or rows added / removed on the fly while the simulation is running. The tables / pipes themselves always have a multiplicity of 1 (i.e., users are creating tables and piping them into queries manually (or opening existing documents / receiving edits from other collaborators)) which is the aspect of the relational model that made it seem well suited to this – a constant number of tables, queries and triggers can handle any multiplicity of the domain entities.
This is in contrast to needing a “dynamically changing” number of pipes and nodes eg. needing N pipes to model N entities in the simulation and thus needing to see pipes be added and removed dynamically while it runs. The pipes carry relations and can thus represent data with multiplicity while being a single pipe. If you wanted to eg. model and render multiple scenes as separate windows where each scene has many actors, you could have an actors table, a scenes table, and an actorId-sceneId pairing table; and use joins / aggregation to construct the relations to feed into graphics nodes for rendering. But the number of nodes and pipes is still constant.
I don’t think this is the maximum extent of it, but starting here, and then adding things in context as they show up for particular applications. I have some I’m hoping to try – starting with notes / mind-mapping, maybe simple chat / messaging and ranging to graphics visualizations or simulations if things get there. Usability is a primary concern and designing the UI in the context of people trying to do concrete things with it seems key.
BTW, I’ll shoot you a DM so we can continue discussing this. :)
The author, Hugh Darwen, co-authored with CJ Date a design for a query language to replace SQL, called Tutorial D, based on relational algrebra. Darwen also created an open source implementation: https://reldb.org/c/ So now there is a relational DBMS.
There is also Project M36
Whoa, it’s weird to see Hugh Darwen come up after I so recently learned about their research. For those who are interested: I was exploring the connection between the Data Oriented Design book’s “Existential Processing” pattern and “6th normal form” (Wikipedia, helpful StackOverflow). Both which seem to be related to avoiding
NULL
s.I’m not very well versed in the theory so can someone please confirm/deny that the whole issue with SQL is that it’s kinda a superset of the theoretical relational algebra? From a cursory reading, it looks to me like (concrete modern implementations, e.g. PostgreSQL) SQL is looser but in principle has everything needed to implement a relational model in a particular database and the whole issue is that there are parts that break the theoretical (mathematically rigorous?) model.
I really enjoyed Knowledge Representation in Bicategories of Relations as a summarizing overview of relational algebras within a unified framework. It is oriented around ologs but can be appreciated on its own merits too.
I would expect a modern programming language built around these concepts to replace functions with relations and allow users to specify tables as custom types. But note that the paper directly shames my expectation:
Emphasis added. I routinely write miniKanren queries which should converge, but take too long; if my query takes more than a second or two, I assume that I won’t get an answer and go back to the editor.
To directly answer your question: There are issues with SQL besides being a superset of relational behaviors. Non-trivial queries must be written with the database engine in mind; we have to consider how intermediate results will be materialized and collated. This is exactly analogous to the difficulty in composing relations in other settings: we have to consider the possibly-infinite range of intermediate values! This issue might be inherent to computing in relational settings, and might generalize beyond bicategories of relations to allegories.
Isn’t this the case with ISBL and BS12, which, presumably, are relational?