1. 13
  1.  

  2. 17

    Let me see if I understand correctly: SQL is insecure because mainstream programming languages don’t have good interfaces to SQL databases?

    1. 13

      I read it as, SQL is insecure because bad developers will end up using it incorrectly.

      1. 17

        Bad programmers will write bad code using any tools, but that’s really besides the point. OP argues that SQL is insecure because:

        • Prepared statements are difficult to use correctly, which is an unarguable fact.
        • Not using prepared statements leads to SQL injection, which isn’t true.

        Raw SQL strings, prepared statements and ORMs are all interfaces between databases and programming languages. Unfortunately, none of them is perfect:

        • Raw SQL strings are insecure for obvious reasons.
        • Prepared statements are a chore to use and they don’t buy you that much security, because you’re still manually supplying strings.
        • ORMs are normally safer than either SQL strings or prepared statements, but they give up much of the expressive power of relational algebra.

        The real problem is types. Most programming languages don’t have sophisticated enough type systems to model the operations of relational algebra. (But some do!) In particular, nominal types don’t help. For example, if you have two classes Customer and Order, there is no type-level operation that can produce a third class corresponding to select * from Customer C join Order O on C.CustomerID = O.CustomerID. This is a very sorry state of affairs, and it absolutely isn’t SQL’s fault.

        1. 6

          Prepared statements are a chore to use and they don’t buy you that much security, because you’re still manually supplying strings.

          How so? The strings you do supply to prepared statements are incapable of changing the pre-prepared parse tree, which is the big insecurity of smashing random strings together.

          I agree I’d rather use an ORM with correct-by-design types, though.

          1. 4

            If a new maintainer of your software needs to add a field of a weird type to the query, will they learn how to add that flavor of field to the prepared statement, or will they interpolate a string in just this once?

            1. 1

              I know what I would do with Go+Postgres: add a type conversion from string $1::json (etc) in the SQL, then marshal the data to string right before the query and give a string to the driver in Exec() or Query().

              I’m actually not sure what the “correct” way to do that would be. One that popped out from the documentation is to implement sql/driver.Valuer on a local typedef or something like that. But that’s a massive pain in the behind and also depends on driver internals.

            2. 1

              Nobody forces you to supply a string that is consistent with the database schema.

              1. 1

                That’s … not something you’re going to solve short of banning users and their dang input.

                1. 9

                  That isn’t really true. Ur/Web rules out invalid queries at compile time. But this requires two things:

                  • The type checker must be aware of the database schema. This is the easy part.
                  • The type system must be capable of expressing the structure of arbitrary derived relations (in the relational model’s sense of “relation”). This requires row polymorphism at the very least.
                  1. 1

                    Oh you’re talking about the strings submitted for the prepared statements, not the user input filling in the ?s. I misunderstood and was talking about runtime input.

                    1. 2

                      Yes, I was primarily talking about the strings submitted for the prepared statements. However, even the user input filling in the ?s is often less statically checked than it could be. Will Java’s type checker complain if you attempt to suply an int where the database would expect a varchar? Ur/Web’s will.

            3. 3

              How exactly can you argue in quantitative terms the difficulty of using prepared statements?

              Isn’t the difficulty of a thing somewhat subjective?

              This whole post seems like… satire

              1. 11

                Nobody guarantees that the result of preparing a statement will be meaningful according to the database schema. That’s the difficulty.

                Contrast with Ur/Web, where the type-checker makes sure that your SQL statements make sense.

                1. 6

                  You beat me to it. I was going to add Opa language, too, as it raises the bar vs common options. One could throw in memory-safe languages like Component Pascal or concurrency-safe languages like Eiffel or Rust. Like the web languages, these simply don’t allow specific classes of problems to occur unless the developer goes out of their way to make it happen. Always good to design languages to knock out entire classes of common problems without negative impact on usability if possible.

              2. 2

                The real problem is types. Most programming languages don’t have sophisticated enough type systems to model the operations of relational algebra.

                Types, yes. Type systems, no.

                Q has tables, and operations that work on tables. There’s no reason a lesser language like PHP couldn’t do this, it’s just that PHP programmers don’t do this.

                Prepared statements are a chore to use and they don’t buy you that much security, because you’re still manually supplying strings.

                If you move your authentication into the database (like with row-level security) then your web-layer can simply authenticate against the database and run the prepared queries like an RPC. The biggest problem I see people have with prepared statements is if they have inadequate tooling and don’t invest in it. (Migrations are a dumb and painful way to program, and while commercial offerings are much better, open source is very popular)

                1. 3

                  My day job is to maintain a rather large ERP system. You know, the kind where the typical table has 40-50 fields and the typical primary key has 5 fields. The kind where people are afraid of altering existing tables, because who knows what queries might be affected, so they create another table with the same exact primary key, whose rows are intended to be in 1-to-1 correspondence with the original table, even though that will only make things harder in the long run and we know it.

                  This tremendous amount of pain is the price of the lack of coordination between language and database. If there were an automatic, convenient way to determine what parts of our application have to be changed in response to a given change in the database, I estimate that we could be twice as productive, while at the same time creating less technical debt. This is precisely the problem type systems solve.

                  1. 3

                    This tremendous amount of pain is the price of the lack of coordination between language and database.

                    I’m not disagreeing with that: Having the business logic in the same language as the database is another way to obtain that coordination, and it offers far more benefits:

                    A large amount of pain is had in synchronising the continuous single history of “the business database” with the many-branches of modern software development. Building directly on top of the database, and solving the problems that you need in order to do that eliminates pain that you never thought possible, like writing migrations or having to maintain test databases. A type system doesn’t help me get there.

                    I estimate that we could be twice as productive

                    Using the same language for your database and your application wins much more than 2x. I would say it wins 10x or even 100x.

                    The kind where people are afraid of altering existing tables, because who knows what queries might be affected, so they create another table with the same exact primary key

                    Really the goal should be to have the data in the correct shape. KDB is column-based, and column-based data stores are useful here because you don’t usually want to alter the existing table. You want to hang another column on there, or you want another rollup/index somewhere. That’s cheap (microseconds) in KDB.

                    Having the database contain your program also means you can easily to analytics on which queries touch which columns, which increases bravery significantly (and safely!).

                    My day job is to maintain a rather large ERP system.

                    I have a similar database, although in addition to those fat business data tables that is ingested from a bunch of Oracle/Siebel databases, it also contains very tall analytics data growing at a rate of around 300m web events per day and around 60k call records per day.

                    KDB also has the advantage of being quite a bit faster than other database engines, so it wouldn’t surprise me if I’m dealing with more data than you.

                    If you don’t know KDB/Q, you should look into it. Ur/web+postgresql is great, but it has nothing on commercial offerings.

                    1. 2

                      A large amount of pain is had in synchronising the continuous single history of “the business database” with the many-branches of modern software development.

                      Right. We need a notion of “time-evolving schema”, allowing new data to have a different structure from old data, while at the same time allowing queries to be meaningful across schema versions. As far as I know, that problem hasn’t been satisfactorily solved yet.

                      Building directly on top of the database, and solving the problems that you need in order to do that eliminates pain that you never thought possible, like writing migrations or having to maintain test databases. A type system doesn’t help me get there.

                      You piqued my curiosity. Let’s say you have a language where tables are first-class values. Altering the structure of a table amounts to changing its type. (As opposed to inserting, updating or deleting rows, which amounts to constructing a different value of the same type.) How do you validate that every part of your application that depends on this table is compatible with the new version, without type checking?

                      KDB is column-based, and column-based data stores are useful here because you don’t usually want to alter the existing table.

                      This is a physical implementation detail. I don’t want to worry about that.

                      KDB also has the advantage of being quite a bit faster than other database engines, so it wouldn’t surprise me if I’m dealing with more data than you.

                      I’m not too worried about the amount of data I need to process. I’m worried about the complexity of the logical constraints the data must satisfy in order to make sense. Logical errors can manifest themselves even with modest amounts of data.

                      If you don’t know KDB/Q, you should look into it.

                      I will.

                      1. 2

                        Right. We need a notion of “time-evolving schema”, allowing new data to have a different structure from old data, while at the same time allowing queries to be meaningful across schema versions. As far as I know, that problem hasn’t been satisfactorily solved yet.

                        Tooling can help a lot, though, and may be good enough. There is commercial tooling (like Control for Kx) which is basically an IDE for your database, complete with multi-user version control. It has the disadvantage of being an online tool, but it provides hints of what the correct solution might look like to me.

                        This is something I’ve been thinking about for a while.

                        Let’s say you have a language where tables are first-class values. Altering the structure of a table amounts to changing its type.

                        However adding a column doesn’t affect code that doesn’t use the column.

                        How do you validate that every part of your application that depends on this table is compatible with the new version, without type checking?

                        Static analysis remains possible without type systems provided you don’t learn column names from the network (and if you do, your type system would be incomplete anyway).

                        This is a physical implementation detail. I don’t want to worry about that.

                        I know you don’t, but removing abstraction is reduces program size (and therefore bugs), and increases program speed so much that I think it’s often worth thinking about the fact we are meat programming metal. Bugs mean fixes, which is programming we didn’t plan for, and slowness generates heat that harms the environment. And so on.

                        If you want to change the type of a column from an 64-bit unix-seconds to a 32-bit time and a 32-bit date (KDB has native date types, btw), you have to decide:

                        • Do you want to rewrite (potentially) multiple terabytes? What about the degradation in service? How long do you need both fields?
                        • Can you create an alias/helper for the other type so that accessing the variable containing the date will actually use the 64-bit value for old records, but the 32-bit value for new records? What’s the performance impact?
                        • Can you modify the existing code to support both versions?

                        And so on. These are real considerations that affect a real system. If we could only sit in our purely-software universe and have enough abstraction, we could make our decisions on what makes better software (asking for a date and getting a date is probably better than doing arithmetic on seconds – and what happens when the calendar changes, anyway) but someone has to solve them, and unfortunately a type system doesn’t actually solve these problems.

                        A type system only helps with the same part of the problem that tooling solves: Static analysis can find the code, and having a real table “type” means you just use a couple in-memory copy of some of the rows you the programmer believe are representative, which then form your tests for regression tracking.

                        However having views and a real table type (i.e. doing the database in your programming language) means (performance) testing is easier, there’s a migration path for the data, and you’ll have a good handle on what the real user-impact is.

                        I will.

                        Awesome. It is not easy to get into without a commercial need, but the #kq channel on freenode contains people willing to help answer questions. It’s not as high-volume as #ocaml so you might have to wait for the earth to turn and someone in the right timezone to wake up :)

              3. 11

                That’s a good statement of a important point. If the simplest, most obvious way to use a tool isn’t secure, we must consider the system fundamentally insecure because that’s what will happen in practice. The programmer’s UX of security concerns is vital.

              4. 6

                That seems mostly reasonable to me. Using SQL in PLs where the default way to use it is by passing in ordinary strings that contain code is indeed insecure. Imagine if mainstream PLs had us defining and calling functions by calling eval() on strings all over the place: I would expect that to lead to terrific quantities of horrid security problems too. I accuse that passing a string to sqlite3_exec() or mysql_query()or PQexec() is equally as scary as passing a string to eval() because RDBMS query languages are either powerful enough to execute arbitrary code or complicated enough to inevitably have bugs that can be leveraged into arbitrary execution.

                I’ve seen an interesting alternative in one of C J Date’s older books, “An Introduction to Database Systems”. He has examples of relational queries embedded directly into a language that looks like PL/1, where the queries are actually fully parsed at compile time. I think they had all the niceties, like references to ordinary lexical scoped variables in the queries turning into code that does all the correct binding at runtime and everything.

                I’m thinking that one could make a much safer language be just as convenient as doing broken string concatenation is in current PHP, by using quasiquoting, reader macros or just straight up embedding SQL’s entire grammar into the PL’s own grammar in an expression context. I’d identify “PHP with mysql_query() replaced by quasiquoting” as a safer PL than “current PHP”.

                Another strategy for making SQL injection harder to write by accident that I’ve seen is in the postgresql-simple library for Haskell. The query execution functions accept a string-like type called Query for which there is an IsString instance, so you can switch on the OverloadedStrings pragma and write code like execute connection "INSERT INTO dogs VALUES (? ?);" (name, cuteness) ­— so the correct, parameterised-query pattern is easy and convenient to write. At the same time, the incorrect string-concatenation code is still possible but much less convenient, so you’re much less likely to write it it. While you can build Query objects from strings, the syntax to actually do that is longer and involves looking up more stuff than the syntax for putting parameters in your queries.

                IIRC there are also quasiquoters that let you write that as something looking like [sql|INSERT INTO foo VALUES (${name}, ${cuteness});] as an expression and automatically turn that into the above parameterised-query.

                In all of the above, anywhere I refer to “PHP” you may instead read “any PL in which you use SQL by passing an ordinary string to a function or method with query or execute in the name”, i.e. very nearly all of them. PHP only does slightly worse than average here because mysql_query() comes bundled with the runtime but you have to install an ORM on purpose, whereas plenty of other PLs come with neither SQL bindings nor an ORM so it’s almost equally difficult to install the ORM or the raw SQL binding.

                1. 5

                  Imagine if mainstream PLs had us defining and calling functions by calling eval() on strings all over the place: I would expect that to lead to terrific quantities of horrid security problems too.

                  This gets to the heart of my position. Very well said.

              5. 9

                I use Haskell SQL libraries (Persistent + Esqueleto) that makes these problems impossible, through a well typed interface.

                There’s also a Haskell library that sets up prepared statements for you automatically: http://hackage.haskell.org/package/hasql

                1. 4

                  Every language has a library or tool that makes it easy to avoid SQL injection. This has been true for decades. Yet here we are, still paying the high cost of a system that’s insecure by default.

                  1. 4

                    Persistent and Esqueleto make building and running SQL queries by concatenating raw String values more inconvenient than they make running safe queries. So do nice ORMs like Python’s sqlalchemy. For example, in sqlalchemy, you are unlikely to write:

                    query = query + " WHERE foo='" + bar + "'"
                    

                    with strings because it’s more characters than writing:

                    query = query.where(Table.foo=bar)
                    

                    with Query objects, and the documentation deliberately de-emphasises the former way.

                    The point isn’t only that there’s a way to avoid SQL injection, but also that the way that avoids SQL injection is strictly more convenient than the way that permits SQL injection.

                2. 6

                  Posted a comment on the blog but since the author is answering here…

                  “Abandon SQL” and do what? As long as there is a “query language” there will be concatenating strings.

                  If you’re saying, use one of those JSON-as-QL things, to what degree do they avoid the problem? If it’s just a matter of making group_by, where and other clauses keys in a dictionary…well now and should be a key in a dictionary, too. As long as everything in the query is represented in a AST-like form, it’s safe; but as soon as there is an expression language it’s back to zero. When you’ve got an AST in JSON, whatever the underlying language is, doesn’t matter all that much. I don’t think we want to ask analysts to type or read things like: {"select": {"from": ["a_table", "b_table", {"select": { ...sub-select... }}], "where": {"and": [{"equals": [...expression..., ...other-expression...]}]}, ... } }

                  1. 3

                    Pasting my reply here because there’s a lot of good thought going on here…

                    The issue is that the default way to interact with a relational database is SQL strings. The SQL is parsed on the data plane, so developers commonly concatenate strings to form queries. This is the most obvious way to do SQL, and that’s the problem.

                    If the executable code was loaded out of band and we only sent data parameters this wouldn’t be an issue. This is exactly what stored procedures and prepared statements are. These tools have existed for decades, yet we’re still getting hacked with easy-to-prevent vulnerabilities.

                    This is where I think, “why are we even using SQL?” I mean, we’ve proven that we can be productive with RethinkDB, DynamoDB, and many other NoSQL databases. I’d love to keep using relational databases, but is there anything inherently tying relational databases to SQL? No, not really, other than history. Maybe it’s time to make some changes.

                    1. 5

                      RethinkDB is very similar to the so-called “SQL expression languages”, DSLs – like arel for Ruby and SQLAlchemy for Python – that let one express queries with method chaining and then take care of the SQL for you.

                      Is that something like what you had in mind?

                      On thing I wonder about is the inclination to build brand new databases along with these new approaches to data access. A modern relational database offers consistent and good performance, reasonable efficiency with regards to memory and CPU, and stability. In principle, ReQL could be used to drive Postgres or MySQL. Adopting ReQL in this way could really benefit the teams at any company where I’ve worked. Adopting yet another remix of how to put rows and columns on disk, on the other hand…

                      1. 4

                        Datalog and Prolog come to mind on your last question given they surpass it:

                        http://stackoverflow.com/questions/2117651/comparing-sql-and-prolog

                        Datomic is getting a lot of mileage out of one, too.

                        1. 3

                          When you use any programming language, you also write strings. You most certainly don’t write abstract syntax trees directly, although hopefully those strings are relatively easy to convert to ASTs.

                          However, in any sensible programming environment, there’s a separation between a validation phase, where your program might fail for arbitrary reasons (because humans can’t reliably write meaningful programs), and an execution phase, where ideally failure may only be triggered by external factors (bad user input, dropped network connection, etc.), not internal ones (bugs). In statically typed languages, the validation phase usually coincides with “compile time”, but all that is required is that it strictly precedes the execution phase.

                          The problem with raw query string manipulation is that it unnecessarily creates more things to validate at runtime, defeating the purpose of the phase separation. But this problem is by no means exclusive to SQL, or even caused by it. Raw string manipulation isn’t the only way to embed SQL (or any other DSL) into a general-purpose programming language..

                      2. 5

                        Kinda like any other type of complaining, it’s easy to say that SQL is inherently insecure and should be eliminated. It’s useless, though, without the much harder part of figuring out exactly what we replace it with and why that’s better. And it’s even harder to get people to actually transition to a new thing, particularly the junior, inexperienced developers who are the most likely to write SQL injections in the first place.

                        1. 5

                          This is a bit of a tangent, but I’ve always wanted to interface to Postgres and other DBs through something other than SQL. Like I would like to be able to skip the planner and directly pipe in some tree structure that describes the internal query.

                          I could easily see such a tool mapping better to languages rather than SQL

                          1. 3

                            I’ve been thinking along similar lines: http://nick.zoic.org/sql/postgres-without-sql/ looks like it’d be fairly easy in Postgres. I figure that gives you a much more direct mapping to ORMish operations without all the tedious syntax construction.

                            1. 1

                              I’ve actually been reading through the Postgres executor code to figure out the scope of this. I’m not much of a C ninja, so having a bit of trouble getting something out but I’m going to try something.

                              1. 1

                                Yeah, it’s pretty neat in there. I haven’t got the code handy (was playing with it on a laptop, and don’t think I ever got around to pushing it anywhere) but I think I hooked into the TCP listener to accept a new command type ‘X’ … the intention was to pass in a serialized version of the structure output by the query planner … then write a python library to emit that … updates I hadn’t really thought through yet, but I think it could work similarly. Anyway, get in touch if you’re interested in discussing further.

                          2. 8

                            I get where he’s coming from but it’s still bad hyperbole. Who says that bad developers won’t write insecure code using other DBs?

                            If a given website has SQL injection errors there’s a really good chance it has a whole bunch of other available attack vectors.

                            The real issue, to me, is that jobs like plumber and electrician have licenses required to be able to sell your services as one of those. Programming does not, and I would argue it’s a much more complicated job that is harder to see problems in for lay people.

                            1. 2

                              This is a great reaction to the post. Honestly, I don’t expect “programming licenses” to appear tomorrow. In my experience, that’s going to take decades. Yet in the meantime we still have a huge problem.

                              The bigger point that I wanted to make (not sure how well I did) was that sometimes a problem is simply so disastrous that it’s not worth all the value it offers. If you tell me that “all we need to do is be more careful with user input” I’m going to fucking flip shit. We’ve been doing it for decades and we still can’t do it. Elections are being compromised. Identities are being stolen. People are going broke. When does it end? When do we finally decide that SQL injection is just too disastrous a problem that it’s not even worth using SQL?

                              1. [Comment removed by author]

                                1. 3

                                  If the lock truly was faulty and provably inadequate you could absolutely go after the locksmith and/or the lock manufacturer for their culpability in not doing their job sufficiently.

                                  1. 3

                                    I can get locks right now that nobody knows how to bypass and others that most thieves don’t know how to bypass. There should be no breakin if the lock design works as advertised. I’d absolutely blame the artisan using a defective lock. I wouldn’t blame them if a new class of attack made a previously-good lock breakable. See how that works?

                                    Besides, there’s correct-by-construction techniques for doing many common things in programming. “pyon” and I mention a few in another part of this thread. Using broken methods when not necessary is negligence.

                                    1. 3

                                      There’s a big difference: Physical systems can always be broken, if you have enough resources and try hard enough. You can’t make your house impossible to break in. You can only make it not worth a thief’s time to try. And law enforcement officers will probably still be able to break into your house if they deem it necessary. (Not precisely because locks can magically distinguish between thieves and LEOs.)

                                      On the other hand, SQL injection is the kind of attack that is outright impossible in a properly designed software system.

                                      1. 1

                                        But if your village had two artisans, one with a licence and the other without, I would blame myself for picking the wrong one.

                                        1. 2

                                          When the US government hires the wrong developer to write an app for the elections agency and he uses SQL in the simplest, most obvious way, who do you blame?

                                          1. 4

                                            Both he and the government. DOD & NSA have standards for highly-assured systems. Medium, too, if cost or complexity prohibits high. There’s also lots of solutions in GOTS and COTS to these problems. The could negotiate them free or at cost for budget operations as a term for their other lucrative contracts. They just don’t care. That’s what it boils down to. Neither does Congress in their policies. So, I blame Congress and Executive branches primarily if we’re talking bad INFOSEC in government.

                                            Contractor was incompetent, too, but wouldn’t have been hired if the government was acting competently.

                                      2. 2

                                        The problem is that’s just kicking the can down the road.

                                        You can say the same thing about a million things; people still don’t handle credit cards securely, even big companies. Does that mean we stop using credit cards?

                                        No, there’s no easy solution (not that abandoning SQL is an easy solution but it is a simplistic one.)

                                        I agree licensed programmers are not happening any time soon but that’s really what’s needed. We need to form a union and have a licensing process.

                                        1. 1

                                          In a very wide range of applications - perhaps not all, to be sure - SQL is overkill and more limited alternatives that are better integrated with mainstream programming languages are perfectly adequate, and reduce the rate of security vulnerabilities.

                                          1. 3

                                            So, in other words, “let’s dumb down databases until they’re just as inexpressive as most programming languages”?

                                            1. 5

                                              I definitely support simplifying error-prone systems until they are safer, in general. I wouldn’t have said it how you did, but I’m willing to agree with that statement of intent.

                                              SQL is already a DSL though, and it’s about as simple as it can be for the problem it solves. What we can do on the technical front is build type-safe wrapper APIs and heavily discourage or prohibit the use of direct string concatenation to build queries, but getting people to use them is still hard.

                                              I’m gradually coming around to the idea that licensing has the ability to imbue a sense of responsibility, but it’s hard to imagine… perhaps because we’ve never had it in this industry.

                                              1. 4

                                                You won’t find a stronger proponent than me of disallowing raw string concatenation as a means to build queries, except perhaps as a low-level implementation detail that application programmers shouldn’t concern themselves with. As my other posts on this thread make patently clear, my favorite take on how to safely build queries is Ur/Web, mostly because it’s based on typed metaprogramming, which makes error detection happen as early as possible during the build process. But code generation (Lisp macros, Template Haskell, etc.) is of course another perfectly valid approach.

                                                However, I have to vehemently oppose limiting what can be expressed in a database schema. Types and database schemata are how we make sure that our programs and our data make sense, respectively. The ability to fearlessly refactor an arbitrarily complex piece of database code comes from the certainty that any nonsense in your queries will be located and properly flagged. If programming languages “don’t grok” the structure of your database, they have to be smartened up until they do.

                                              2. 1

                                                Yes. Ideally I’d dumb down the datastore to, at least, a non-Turing-complete level. Just as I’d dumb down e.g. web templates. Having two different ways to express business logic is asking for trouble - sooner or later you’ll get the same logic expressed in both, and sooner or later it’ll get out of sync.

                                                Either the datastore should be dumb, or it should be language-integrated. If SQL were really a first-class general-purpose programming language then it would make sense to write complete systems in it, but I don’t think I’ve ever heard of that approach succeeding.

                                                1. 3

                                                  I’m not a huge fan of the one feature that makes SQL Turing-complete (common table expressions), but the database is the natural place to express many business rules, which are often declarative in nature.

                                                  The inability to embed SQL into a metalanguage (namely, your programming language of choice) is a failure of the metalanguage, not SQL. Java’s type system is incapable of typing the relational algebra operators. GHC Haskell and Scala probably can, but only by explicitly using type-level maps (to keep track of the names and types of a relation’s attributes), which makes this approach unusable in practice. But Ur/Web, whose type system has just the right features to type derived relations (row polymorphism, disjointness assertions), makes using SQL as an embedded DSL a breeze.

                                                  1. 1

                                                    In principle I’m sympathetic. But if these features can only be expressed in an obscure/immature language, are they really that critical to business functionality? How have non-database programs managed without them so far?

                                                    1. 2

                                                      Data integrity is nonnegotiable in the kind of applications I write. The question is not whether to enforce it or not, but rather whether to enforce it automatically in the database or manually in the application. At first sight, this is just a matter of personal preference, and, given how annoying SQL can be, it’s hard not to prefer to do it in the application side. But when you take transactions, concurrency control and error recovery into account, the advantages of doing it in the database side become clear.

                                                      As for how non-database programs have managed so far, I don’t know, because my professional experience is primarily with database applications. But in my admittedly biased opinion, database applications are a common enough use case that they deserve special attention from programming language designers. Database applications tend to manipulate data in ways not previously foreseen by the database designer, so you want to optimize for flexibility (which is precisely what the relational model does!). By contrast, if you write, say, a compiler or a text editor, you can (and most likely want to) plan ahead what data structures you will use, and if you ever decide to change data structures, it will trigger a Major Refactoring Event ™.

                                              3. 2

                                                …more limited alternatives that are better integrated with mainstream programming languages…

                                                Like what?

                                                1. 1

                                                  SQL is great for if you really need ad-hoc querying with partial indices created on insert, can fit your data model into square tables, and really need full ACID. IME that’s actually quite rare and what tends to happen is either people use it as a key-value store and do their aggregation in an explicit userspace batch process (in which case they would be better off with a key-value store designed as such, and map-reduce style batch processing or CQRS/lambda-architecture style near-realtime-but-nonblocking aggregation), or people need truly ad-hoc querying and put indices on every column and then have write load problems caused by the transactional model, and would be better off with an analytics-oriented store that was explicitly more lossy. Needing the ad-hoc model and the strict transactionality at the same time is quite rare (for exploratory programming you rarely need precise answers, if you need a precise answer you normally know the precise question), and you’re better off picking one or the other.

                                                  1. 2

                                                    But to be concrete, which datastore are you recommending?

                                                    Typically businesses need to see ad-hoc query, analysis, and OLTP performed on “the same data”. Not the same datastore, but the same dataset. Having all the data in one integrated datastore does make that really easy for even moderately large sites. Replacing a SQL database with multiple datastores and pipelines between them is a hard sell purely on its merits, not even considering the current state of the industry.

                                                    1. 1

                                                      If you make me pick just one, Cassandra.

                                                      In the current industry it’s easier than ever to run multiple datastores and have a replication pipeline or regular batch job. You can even keep the traditional SQL database interface for ad-hoc BI-type work - the transactionality issues don’t matter if it’s read-only and the security issues matter less if it’s internal-use-only.

                                                      1. 1

                                                        So coming back to your original comment, with regards to Cassandra as a more limited alternative that is “better integrated with mainstream programming languages” – what are the big advantages that you see with regards to the API that it exposes?

                                                        1. 1
                                                          • Easier to make a value look like a programming-language value (in particular, collection types rather than having to map collections to tables)
                                                          • Explicit distinction between looking up by key and queries that will result in a full table scan - these operations have very different behaviour but in SQL they look the same.
                                                          • “External” map-reduce aggregation possible - so you can reuse your business logic when doing aggregations
                                                          • Key-value model covers the most important cases of the “arbitrary subset of columns” model that SQL supports, but you can write the type of it easily
                                                          1. 2

                                                            “External” map-reduce aggregation possible - so you can reuse your business logic when doing aggregations

                                                            I’m understanding this to mean, having the app map over rows; but that doesn’t seem like a feature of Cassandra.

                                                            Key-value model covers the most important cases of the “arbitrary subset of columns” model that SQL supports, but you can write the type of it easily

                                                            If I understand this right, Cassandra is focused on storage and retrieval of whole objects; the type of every result is very likely a type that already exists in your application; whereas a SQL query can return any product of the columns of these types or any subset thereof.

                                                            1. 1

                                                              I’m understanding this to mean, having the app map over rows; but that doesn’t seem like a feature of Cassandra.

                                                              It exposes a hadoop-compatible API directly. You can of course build the same thing for an SQL database but it’s harder: popular querying APIs are not streaming-oriented (I once saw an SQL server brought down because someone had visited a web page 23 days previously - it had been chugging through figuring out the result set for those 23 days, and then started trying to actually stream the results and stopped responding to any other queries), and you have to be very careful with your transaction isolation levels if you’re going to run a long-running query. (And it can be difficult to test, since there’s a “fun” failure mode where the results start off streaming fine, but get slower and slower the longer the query runs as the “snapshot” diverges from the live database).

                                                              If I understand this right, Cassandra is focused on storage and retrieval of whole objects; the type of every result is very likely a type that already exists in your application;

                                                              Yeah. You end up meeting in the middle - you split your datatypes up to be more storage-friendly, or else you have a separate DTO layer with a storage-oriented representation. Which may sound like more overhead, but I find it’s a lot more practical (in terms of testability etc.) to have an explicit transformation in plain old code before a network boundary than to have complex mapping commingled with the remote call.

                                                              1. 1

                                                                I’m understanding this to mean, having the app map over rows; but that doesn’t seem > > like a feature of Cassandra.

                                                                It exposes a hadoop-compatible API directly. You can of course build the same thing for an SQL database but it’s harder: popular querying APIs are not streaming-oriented (I once saw an SQL server brought down because someone had visited a web page 23 days previously - it had been chugging through figuring out the result set for those 23 days, and then started trying to actually stream the results and stopped responding to any other queries), and you have to be very careful with your transaction isolation levels if you’re going to run a long-running query. (And it can be difficult to test, since there’s a “fun” failure mode where the results start off streaming fine, but get slower and slower the longer the query runs as the “snapshot” diverges from the live database).

                                                                I gather this is all client side.

                                                                I hear what you are saying about streaming – relatively few query APIs are built using the async APIs exported by databases or build off of cursors.

                                                                I am not sure the transaction snapshot handling of Cassandra is a selling point, since it seems to be possible to see old rows or mixed writes in a variety of scenarios. Please correct me if I am wrong.

                                                                1. 1

                                                                  I gather this is all client side.

                                                                  Query prioritization/scheduling is the server’s responsibility - the 23-day query should never have been permitted to drown out all the usual business queries. Using the correct transaction isolation level is the client’s responsibility up to a point, but if getting it right is difficult and error-prone then that’s a system failure.

                                                                  I am not sure the transaction snapshot handling of Cassandra is a selling point, since it seems to be possible to see old rows or mixed writes in a variety of scenarios.

                                                                  Depends on the use case. IME in the case of big, aggregating, ad-hoc queries (usually BI-type work), you would rather have inaccuracy than deadlocks. And I think Cassandra’s limitations are more visible, which drives better data design, whereas SQL databases tend to paper over the problems with your data model more.

                                                                  1. 1

                                                                    I gather this is all client side.

                                                                    Query prioritization/scheduling is the server’s responsibility …

                                                                    I am sorry to belabor the point but the map/reduce – it is just a client-side façade? It’s not actually distributed or anything?

                                                                    1. 1

                                                                      map/reduce on Cassandra is or at least can be genuinely distributed (assuming your Cassandra instance is). Sorry if I was unclear. I don’t think distributed vs not is the important property though.

                                        2. 0

                                          A fool and his relational database are soon parted.