I think it’s a success story for ActiveRecord - it is such a powerful, flexible tool that users don’t feel bothered to learn the underlying database concepts.
And I think this kind of article prompting folks who only know the higher-level abstraction to start learning the database is very welcome. Just this week I fixed a big in this (anonymized) method on User, which has_many Groups through Memberships, and Groups might be in a Category that’s visible or not:
def firstvisiblegroup
memberships.collect { |m| Group.find(m.groupid) if Group.find(m.advisorid).category.visible? }.first
end
The author wasn’t comfortable with a complex join, so had to fetch every Membership, loop them to retrieve each Group, select each Group’s Category and, if visible, return the Group. That’s not just a 1 + n query, that’s 1 + 2n + 1 query… and it doesn’t work.
The curious can take this moment to spot the bug if they want.
The method often returns nil even when the user has a visible group. The error is that the memberhips are mapped (collect is a synonym) to an array of elements where each element is either a visible group or nil. The method returns the first element of this array rather than the first group in the array.
The bad fix is to call .compact on the array, but that doesn’t change the fact that this method loads every membership, group, and category in the most inefficient way possible just to get one group. The correct fix is:
The point of all this is to say that we have tools that are so powerful that developers can write lots of ignorant code to get their job done. This is how the “Every problem can be fixed by another layer of abstraction” aphorism works. And yes, every abstraction leaks, but in this case it meant a successful business started a year or so earlier than it would if the junior dev founder had to know SQL cold.
…it also means that if you are a consultant who knows the implementation of (at least) one level of abstraction more than most people, you will never want for satisfied customers.
At work, I was tasked with fixing a process that was taking on the order of two hours to do a bit of work which, by my hand-crafted sanity check query should have taken about thirty seconds. Sure enough, the ORM was emitting multiple queries (to the tune of 200k) when it should have done a single join. Worse, this contradicted what the ORM documentation claimed should have happened given how it was configured. (Thus, I don’t think it was the original author’s fault; he coded it appropriately, but either an edge case or bug in the ORM caused an explosion in runtime once the dataset grew significantly. In any case, I know him and he’s not stupid.)
My solution was to bypass the ORM altogether and use manual queries mapping result rows to a convenient object representation. Frankly, I have yet to encounter a better pattern for dealing with databases from code. Trying to guess how an ORM will interpret your code and perturb it into behaving vaguely sensibly is a painful and stupid waste of time.
I like SQLAlchemy’s approach. It’s really just SQL and gives you some flavor-specific additions you can drop in if you need them, but you are just writing python-ish SQL statements (this is great when you can throw a case statement you’re going to use in 10 places into a variable so your SQL statements look more readable). The ORM bit only really boils into the referencing pieces (I want this table, and I want to guess which indexes to join on, unless I specify).
My experience with other ORMs like Entity Framework for example are that ORMs are great until they’re not, and if they acknowledge this and let you drop down to raw SQL for the bits that it can’t handle, then at least they’re not dictating your limitations for you.
I second this. Three jobs and about a decade I was working on a project where
we discovered late in development that one of our queries were insanely slow,
racking up something like 1201 SELECTs. (I can no longer remember the
specifics, but I think it was a naïve news feed trying to implement a
flickr-style permission model.) One of my colleagues came in late one day
because he’d been up to 4am the previous night bringing it down to 2 queries.
(1 for accurate count of items, 1 for returning the first page.) This required
forking the ORM, and—since we were good guys—trying to feed patches back.
Another approach would have been to rewrite the entire app, and you could
argue that would have been the correct approach, but for various reasons I
don’t remember—if I were to guess: time pressure, fallacy of sunk cost—we
didn’t do that.
The project was eventually canned for reasons unrelated to our ORM woes, but
at some point since then I’ve ended up eyeing ORMs suspiciously; they make it
too easy to paint yourself into a corner that is difficult to get out from. I
too have seen many heinously inefficient queries or algorithms arise from
idiomatic use of ORM that probably worked alright in testing, but turned out
to be terrible when throwing real data sets at them. In my previous job we had
regular downtime due to this, because someone had written a piece of code
that; when adding to a wishlist; pulled all a user’s wishlist entries out of
the DB, added an entry at the end, then attempted to save the entire list.
This worked fine with small wishlists, but one user in particular had over
26,000 entries in their wishlist and we had an outage every time they
interacted with their wishlist.
The ORM made it difficult, or at least non-idiomatic, to add pagination. In my
experience they also make it far too simple to retrieve a list of objects and
map over it to get a single value rather than just write a query to get that
value. Perhaps some ORMs are getting smart enough to recognise that now, but
I’d rather have it explicit rather than implicit. Though; perhaps this feeling
too, shall pass. As I say this I remember that I used to sneer at automatic
memory management, advocating manually managing memory to stay “close to the
metal”. Yet, the last few years all my code has been running on various VMs,
and I’m quite happy using managed memory. The closest I’ve been to using
manual memory management this last decade is probably reference counting in
Objective-C.
That’s a good look at the other side of the coin, but also it should be taken as a difference in values. Active Record is terrible if your values state that you want performance and safety, but it’s nice if you want fast prototyping and a low cost developer workforce.
So we might also say “right tools for the job”.
The question is always in that these are only proxy measures to success of a product or a technology. A low cost developer base becomes a liability unless you invest in training. Performance and correctness are heavy if you don’t know product direction.
SQL I think could make the biggest wins if it had an overhaul as a language. LINQ is a good look as to a direction this could go (but not, e.g. something like jooq I think where the DSL gets caught up in syntax instead of semantics). Comosability would increase reusability and make SQL maintainership less burdensome and its creation faster—perhaps.
This was slightly tongue in cheek. The original paper (see link about feral concurrency) has much more to say about this… they studied 100 open source Rails apps and found their consistency guarantees pretty lacking.
I know a fair number of people who write applications without thinking about these sorts of things, along with those in the paper, that’s my target audience.
I didn’t mean to come off as hostile; this is a good post, given the parlous state of software development these days. I was more bemoaning the fact that Rails developers are happy to assume the complexity of a RDBMS but then, having paid the cost, throw away all of what makes the database useful. It’s symptomatic of the general world of building programs, in my experience, and it makes me sad.
I’ve long wondered why developers who prefer the ORM style of working even use an RDBMS at all, instead of just some framework that does object persistence (and maybe indexing). Paul Graham’s web framework that runs HN does take that approach, but elsewhere the ORM style seems to be more common, even though, as you note, you don’t really get many of the advantages of the RDBMS that way. Or maybe it is just that: ORMs are just the easiest path to object persistence + indexing?
Object databases are actually somewhat more difficult to implement than relational ones. There are a few, like the Franz, Inc. Allegro tools, which use the same pricing model as Oracle, and are only useful if your software is written in Common Lisp. Or, since I was involved with it a long time ago, I should mention Elephant, an open-source system, also in Common Lisp, which uses existing relational databases as its backend; its internal schema is nothing like a table structure, but it does work.
From the papers I’ve read about implementing object databases, I actually suspect that it isn’t so much that they’re harder to implement in absolute terms, but to get one off the ground would require a lot of the same expertise, but people who understand relational databases well enough to create new engines are generally already working on existing engines, and have no interest in object databases.
Also, in recent years, sharding has become quite important. In relational databases, a domain expert who understands data modeling really has to get involved and figure out a strategy for sharding this particular data in a way that produces good locality and avoids most locks that span multiple data centers. In object databases, the reference graph is much more arbitrary in shape; someone would have to invent good techniques to shard them at all. It would need a serious commitment.
Also, object databases want to be able to perform garbage-collection passes. In a relational database, notionally a table owns its rows, so if a row hasn’t been deleted, it still has a reason to exist. One could certainly design an object database that only ever performs explicit deallocation, but being able to do it automatically is a significant advantage. Or would be, except that it doesn’t work at all well in a distributed context. There are papers suggesting techniques that should work, but most of them involve global locks… not great for performance. So that would be unexplored territory as well.
None of that means that a small site that doesn’t need multiple servers couldn’t do well with an object database. So it comes back to, there just aren’t any inexpensive, simple ones. An equivalent of sqlite for objects would be quite something.
An equivalent of sqlite for objects would be quite something.
I think I’d use that! I’m especially interested in the case of smallish amounts of data (tens of GB or less) and smallish number of simultaneous users (<100). Then a lot of the scaling issues are not so relevant. A huge portion of webapps fit into that category, including most of what I’m likely to write, and yet people are using RDBMSs (and ORMs on top of RDBMSs) for that case, too. I mean, HN runs on one server without a real DB, and not that many webapps realistically are going to get more than HN levels of traffic.
Might want to add a bit on how the CHECK clause is parsed but ignored by all storage engines in mysql. I don’t mean to nitpick but i hit my face in that specific issue in my recent past.
Pleasure is all mine! I think the rest of the article provides nice tricks but by now I’m pretty much convinced that consistency in modern apps is pretty much pure luck in most cases. People need to learn about the issue especially if they are doing Web.
Tell me about it. In my first job my boss would refuse to put unique indices in a table that relied on keys being unique (because they were user sessions) because it would slow down inserts. They preferred to deal with the the unique checking constraints in the PHP userland code. We had numerous issues where this user land uniqueness did not hold and we ended up showing users data from someone else’s session.
That the authors, in all honesty, can call basic DB design practice “weird tricks” tells you all you need to know about the dire state of software.
I considered it more a failure of modern ORMs (and by extension Rails, in this case).
I think it’s a success story for ActiveRecord - it is such a powerful, flexible tool that users don’t feel bothered to learn the underlying database concepts.
And I think this kind of article prompting folks who only know the higher-level abstraction to start learning the database is very welcome. Just this week I fixed a big in this (anonymized) method on User, which
has_manyGroupsthroughMemberships, and Groups might be in a Category that’s visible or not:The author wasn’t comfortable with a complex join, so had to fetch every Membership, loop them to retrieve each Group, select each Group’s Category and, if visible, return the Group. That’s not just a 1 + n query, that’s 1 + 2n + 1 query… and it doesn’t work.
The curious can take this moment to spot the bug if they want.
The method often returns nil even when the user has a visible group. The error is that the memberhips are mapped (
collectis a synonym) to an array of elements where each element is either a visible group or nil. The method returns the first element of this array rather than the first group in the array.The bad fix is to call
.compacton the array, but that doesn’t change the fact that this method loads every membership, group, and category in the most inefficient way possible just to get one group. The correct fix is:The point of all this is to say that we have tools that are so powerful that developers can write lots of ignorant code to get their job done. This is how the “Every problem can be fixed by another layer of abstraction” aphorism works. And yes, every abstraction leaks, but in this case it meant a successful business started a year or so earlier than it would if the junior dev founder had to know SQL cold.
…it also means that if you are a consultant who knows the implementation of (at least) one level of abstraction more than most people, you will never want for satisfied customers.
[Comment removed by author]
At work, I was tasked with fixing a process that was taking on the order of two hours to do a bit of work which, by my hand-crafted sanity check query should have taken about thirty seconds. Sure enough, the ORM was emitting multiple queries (to the tune of 200k) when it should have done a single join. Worse, this contradicted what the ORM documentation claimed should have happened given how it was configured. (Thus, I don’t think it was the original author’s fault; he coded it appropriately, but either an edge case or bug in the ORM caused an explosion in runtime once the dataset grew significantly. In any case, I know him and he’s not stupid.)
My solution was to bypass the ORM altogether and use manual queries mapping result rows to a convenient object representation. Frankly, I have yet to encounter a better pattern for dealing with databases from code. Trying to guess how an ORM will interpret your code and perturb it into behaving vaguely sensibly is a painful and stupid waste of time.
I like SQLAlchemy’s approach. It’s really just SQL and gives you some flavor-specific additions you can drop in if you need them, but you are just writing python-ish SQL statements (this is great when you can throw a case statement you’re going to use in 10 places into a variable so your SQL statements look more readable). The ORM bit only really boils into the referencing pieces (I want this table, and I want to guess which indexes to join on, unless I specify).
My experience with other ORMs like Entity Framework for example are that ORMs are great until they’re not, and if they acknowledge this and let you drop down to raw SQL for the bits that it can’t handle, then at least they’re not dictating your limitations for you.
I second this. Three jobs and about a decade I was working on a project where we discovered late in development that one of our queries were insanely slow, racking up something like 1201 SELECTs. (I can no longer remember the specifics, but I think it was a naïve news feed trying to implement a flickr-style permission model.) One of my colleagues came in late one day because he’d been up to 4am the previous night bringing it down to 2 queries. (1 for accurate count of items, 1 for returning the first page.) This required forking the ORM, and—since we were good guys—trying to feed patches back. Another approach would have been to rewrite the entire app, and you could argue that would have been the correct approach, but for various reasons I don’t remember—if I were to guess: time pressure, fallacy of sunk cost—we didn’t do that.
The project was eventually canned for reasons unrelated to our ORM woes, but at some point since then I’ve ended up eyeing ORMs suspiciously; they make it too easy to paint yourself into a corner that is difficult to get out from. I too have seen many heinously inefficient queries or algorithms arise from idiomatic use of ORM that probably worked alright in testing, but turned out to be terrible when throwing real data sets at them. In my previous job we had regular downtime due to this, because someone had written a piece of code that; when adding to a wishlist; pulled all a user’s wishlist entries out of the DB, added an entry at the end, then attempted to save the entire list. This worked fine with small wishlists, but one user in particular had over 26,000 entries in their wishlist and we had an outage every time they interacted with their wishlist.
The ORM made it difficult, or at least non-idiomatic, to add pagination. In my experience they also make it far too simple to retrieve a list of objects and map over it to get a single value rather than just write a query to get that value. Perhaps some ORMs are getting smart enough to recognise that now, but I’d rather have it explicit rather than implicit. Though; perhaps this feeling too, shall pass. As I say this I remember that I used to sneer at automatic memory management, advocating manually managing memory to stay “close to the metal”. Yet, the last few years all my code has been running on various VMs, and I’m quite happy using managed memory. The closest I’ve been to using manual memory management this last decade is probably reference counting in Objective-C.
That’s a good look at the other side of the coin, but also it should be taken as a difference in values. Active Record is terrible if your values state that you want performance and safety, but it’s nice if you want fast prototyping and a low cost developer workforce.
So we might also say “right tools for the job”.
The question is always in that these are only proxy measures to success of a product or a technology. A low cost developer base becomes a liability unless you invest in training. Performance and correctness are heavy if you don’t know product direction.
SQL I think could make the biggest wins if it had an overhaul as a language. LINQ is a good look as to a direction this could go (but not, e.g. something like jooq I think where the DSL gets caught up in syntax instead of semantics). Comosability would increase reusability and make SQL maintainership less burdensome and its creation faster—perhaps.
(Author here)
This was slightly tongue in cheek. The original paper (see link about feral concurrency) has much more to say about this… they studied 100 open source Rails apps and found their consistency guarantees pretty lacking.
I know a fair number of people who write applications without thinking about these sorts of things, along with those in the paper, that’s my target audience.
I didn’t mean to come off as hostile; this is a good post, given the parlous state of software development these days. I was more bemoaning the fact that Rails developers are happy to assume the complexity of a RDBMS but then, having paid the cost, throw away all of what makes the database useful. It’s symptomatic of the general world of building programs, in my experience, and it makes me sad.
I’ve long wondered why developers who prefer the ORM style of working even use an RDBMS at all, instead of just some framework that does object persistence (and maybe indexing). Paul Graham’s web framework that runs HN does take that approach, but elsewhere the ORM style seems to be more common, even though, as you note, you don’t really get many of the advantages of the RDBMS that way. Or maybe it is just that: ORMs are just the easiest path to object persistence + indexing?
That’s a useful question.
Object databases are actually somewhat more difficult to implement than relational ones. There are a few, like the Franz, Inc. Allegro tools, which use the same pricing model as Oracle, and are only useful if your software is written in Common Lisp. Or, since I was involved with it a long time ago, I should mention Elephant, an open-source system, also in Common Lisp, which uses existing relational databases as its backend; its internal schema is nothing like a table structure, but it does work.
From the papers I’ve read about implementing object databases, I actually suspect that it isn’t so much that they’re harder to implement in absolute terms, but to get one off the ground would require a lot of the same expertise, but people who understand relational databases well enough to create new engines are generally already working on existing engines, and have no interest in object databases.
Also, in recent years, sharding has become quite important. In relational databases, a domain expert who understands data modeling really has to get involved and figure out a strategy for sharding this particular data in a way that produces good locality and avoids most locks that span multiple data centers. In object databases, the reference graph is much more arbitrary in shape; someone would have to invent good techniques to shard them at all. It would need a serious commitment.
Also, object databases want to be able to perform garbage-collection passes. In a relational database, notionally a table owns its rows, so if a row hasn’t been deleted, it still has a reason to exist. One could certainly design an object database that only ever performs explicit deallocation, but being able to do it automatically is a significant advantage. Or would be, except that it doesn’t work at all well in a distributed context. There are papers suggesting techniques that should work, but most of them involve global locks… not great for performance. So that would be unexplored territory as well.
None of that means that a small site that doesn’t need multiple servers couldn’t do well with an object database. So it comes back to, there just aren’t any inexpensive, simple ones. An equivalent of sqlite for objects would be quite something.
I think I’d use that! I’m especially interested in the case of smallish amounts of data (tens of GB or less) and smallish number of simultaneous users (<100). Then a lot of the scaling issues are not so relevant. A huge portion of webapps fit into that category, including most of what I’m likely to write, and yet people are using RDBMSs (and ORMs on top of RDBMSs) for that case, too. I mean, HN runs on one server without a real DB, and not that many webapps realistically are going to get more than HN levels of traffic.
Might want to add a bit on how the CHECK clause is parsed but ignored by all storage engines in mysql. I don’t mean to nitpick but i hit my face in that specific issue in my recent past.
Oh wow I did not know that!! I’ve been using Postgres at work & examples were geared toward that :(
Thanks!
Pleasure is all mine! I think the rest of the article provides nice tricks but by now I’m pretty much convinced that consistency in modern apps is pretty much pure luck in most cases. People need to learn about the issue especially if they are doing Web.
Tell me about it. In my first job my boss would refuse to put unique indices in a table that relied on keys being unique (because they were user sessions) because it would slow down inserts. They preferred to deal with the the unique checking constraints in the PHP userland code. We had numerous issues where this user land uniqueness did not hold and we ended up showing users data from someone else’s session.
Yes, ActiveRecord and MySQL are locked in a spiral of codependence and misfeature. Which bad idea came first is probably unresolveable at this point.