1. 32

The paper even mentions our very own https://lobste.rs


  2. 11

    Neat to see the “Lobster” codebase get used this way. And that explains a strange and useless issues I recently got. I had thought they were some new, half-baked commercial product or service that wanted to contribute to Lobsters as a marketing effort (this has happened a couple times and so far been a complete waste of time). This is the only contact I’ve had with the study authors (I have no idea if they contacted jcs), though I see they’re also local to Chicago.

    Looks like this analysis is at least a few months old; when I added rubocop it caught all the opportunities to use find_by.

    1. 3

      :/ about the useless issues opened.

      There was a long history of automated analysis at Google and a takeaway was, more or less, “you don’t really have a useful analyzer ’til it mostly finds fixes coders think are useful.”

      I wonder what folks would come up with given incentives like that; currently, there’s probably more pressure to maximize your claimed number of bugs to get published.

    2. 5

      This is why I much prefer not to use an ORM. Database vendors have spent decades optimizing their engines, and the engines themselves are very well-suited to manipulating data in situ. ORMs encourage very generic operations and tend to require or encourage data manipulation on the client side, meaning more round-trips, more serialization, etc.

      If you’re going to use an SQL database…use an SQL database. Write queries and statements that manipulate your data. Take advantage of your database’s ability to manipulate data in ways that aren’t just “return this list sorted”.

      As a very simple example, you need to have a list of MD5 sums of a field in a database. A common ORM way would be to get a list of all the fields and then calculate the MD5 sums, or you could just do “SELECT MD5(foo) FROM bar” and have the database do it (assuming, of course, that your database supports MD5 calculation). You might be able to trick your ORM into doing it using some additional mechanism to have it apply functions to the returned result set, but at that point you’ve abandoned the database-agnostic part of the ORM and/or have to learn basically a whole new language on top of SQL when you could just…write SQL that you’re gonna end up writing anyway.

      The only time, to me, ORMs really make sense is when your produce needs to support multiple database backends, but for things like web apps, how often do you really migrate your infrastructure to a different DB?

      (Shameless plug: if you need lightweight database abstraction in Python but want to use your database as a database and not a hidden-away database serialization layer, try dpdb )

      1. 12

        Huh, I see performance as one of the best reasons to use an ORM. There’s a lot of techniques that AR does like ‘select 1 from’ for testing existence that most developers don’t know and it precludes a lot of “eh, I don’t understand why this isn’t working so I’ll add distinct”. It gets the 99% cases right better than developers. I know the answer is git gud but I find humans aren’t great at considering the non-obvious implications of every line of code they write, or they reuse existing code that does almost what they want, etc. The overwhelming majority of queries are boring and an ORM helps people focus on the couple that aren’t.

        1. 1

          Not only that; once you know AR well you can quickly generate relatively-well-optimized queries (eg Foo.where(id: Bar.select(:foo_id) generates one query whereas using pluck would generate two).

      2. 3

        Is there any research that shows whether an ORM based application is more riddled with these problems, compared to applications not written with ORMs, but still use SQL? …

        A good example in a project I ha e written is I do: Select 1 from too where x = 4 rather than a count to check for existence of something …. But I could see every entry level programming arguing “ I counted the results – it was bigger than one” … Its not wrong, just an inefficient approach until you learn better.

        So are ORMs helping us learn better by avoiding these mistakes?

        1. 1

          checks Github Nope, no PRs for any of these as far as I can tell.

          Although, I suppose our server load is already pretty low anyways. Still, it never hurts to cut response times where we can.

          1. 2

            They opened two issues, but neither was usable (linked in my other comment). We do have a couple hotspots, but in general serving a couple hundred records to a few thousand people per day is hard to get wrong.

          2. 1

            This analogy is a stretch, but ORMs and DBs face a problem a little like what dynamic language runtimes face: with more understanding of how the code really runs, you could make it run faster (x is almost always a float in this JS function, or we almost always/never retrieve obj.related_thing after this ORM query retrieves some objs), but that info isn’t readily available when the code is first run.

            JITs deal with this by recording specifics of what happens at each callsite, then making an optimized path for what usually happens. You could imagine ORMs tagging query results with the callsites they came from, and figuring out things like “this bulk query should probably retrieve this related object” or “looks like this query is usually just an existence check” from how the result was actually used.

            An inherent challenge with this kind of thing is that you need to make sure that tracking isn’t so expensive it eats up any advantage it brings. We take for granted that JVMs, V8, etc. do magic with our code, but that’s with big teams of experts working on them over years. Perhaps a more achievable thing is more like profile-guided optimization, where you do test runs in a slow stat-tracking mode and some changes get suggested to the code.

            That’s sort of a big dream and there is much lower hanging fruit; pushcx notes rubocop was able to find a chunk of things with static analysis, and stuff like “this query scans a table” or just “profiling shows this line is empirically one of our slowest” probably does a lot for big hotspots out there with a lot less trickiness.

            1. 1

              Inefficient Rendering (IR). This one was a surprise to me in the measured impact. It’s very common in Rails to do something like this, which calls link_to within a loop to generate an anchor:


              Replacing it with one call to link_to outside of the loop, and the use of gsub within the loop is faster.

              I wonder which is causing the bigger speed up: inlining the partial view, or caching the link_to and using gsub?

              1. 1

                Inefficient data accessing (ID) results in slow data transfers by either not getting enough data (resulting in the familiar N+1 selects problem), or getting too much data. This can be due to inefficient lazy loading (still prevalent in this study), or too-eager eager loading. Here’s an example of a patch in Lobsters that replaces 51 database queries with one:

                1. 3

                  There’s some really interesting theoretical work being done to compile template files into optimal queries.

                  In theory a templating language statically determine what data needs to be fetched ahead of time, but integrating it into an application framework is a significant job.