1. 17
  1. 7

    I was doing a code review and was about to comment that COUNT(*) is an antipattern, but decided to do some investigation whether that’s (still) the case and provide some background info. Then I found this post, so I was completely wrong here! I wonder where that “knowledge” came from - perhaps misguided handed down information? Or MySQL-specific info that I mistakenly took with me when I switched to Postgres all those years ago?

    1. 2

      Interesting, I have always “known” that count(1) is the correct way, but no idea where I got that from either.

      Though if count(1) were actually slower that would be bonkers. I should hope Postgres can be very confident 1 is not NULL and not check it for every row…

      1. 2

        Postgresql can compute count() amazingly quickly in some/many cases, but it remains a sign of bad code.

        When you use a good database (and postgresql is the best database in my neutral, modest and well-informed opinion), then you should use it by issuing queries that reflect your goal as closely as the query language allows. If your use can be described as “something something all the rows something something small result”, then you have a detour via all the rows. That detour is long and may be costly for the database. (Or may not: Postgresql has many tricks to to count without actually looking at all the rows). Conceptually count() is a about all the rows, though.

        It may also show that the way you express your task isn’t quite right, and maybe that you don’t really understand what you’re trying to do with the database or the best way to express it in code.

        EDIT: count() is usually used to count all the rows in a table, and my argument above is based on that. However, it can also be used to count a smaller set for rows, and my argument holds for that case too. In that case, the detour is via a number. Databases operate on sets, and if your code leaves set domain that’s often a sign of a detour, which in turn is a sign that the queries aren’t as clear as they could be, and a lack of clarity is a sign of bugs.

        1. 5

          For the review in question, it was a COUNT(*) with a WHERE condition, so essentially it meant “how many rows match this condition?” which is indeed expressed perfectly with a COUNT.

          1. 2

            Then we get into the issue of count(*) versus count(thingy), where the former counts all rows while the latter excludes rows where that column is null. Which one is a more precise expression of the thing to count? Whichever is more precise is (warning: more designy-touchy-feely verbiage follows) has ≥51% chance of being faster too.

            Part of what postgresql will do to avoid counting rows in the slow way is to look for a usable index. So if you have a suitable index on the table and the query is a suitable expression of the code’s intent and harmonious with your other use of the table, the ducks often line up naturally and postgresql counts quickly.

            1. 1

              It’s been ages since I did any serious database work, but it feels like an antipattern to me if it’s in a frequently used thing because the rule of thumb for databases should be ‘don’t compute on the hot paths’. If you need the number of things that match a particular condition frequently then your normalised database should have that list in a separate table and can then either update a count on insert or query the count dynamically depending on how you want to trade concurrency of inputs (updating a count is a serialising operation) with speed of queries (reading a count entry is faster than counting the elements in a table).

              More generally though, there’s a point that the article hints at but doesn’t explain the detail of: If you’re querying the count and actually want anything other than an approximation then you’re almost certainly doing something wrong. The reason for this is one word: concurrency. Unless the count is part of a transaction, you’re racing against other updates, so the count that you get is correct when you query it but isn’t necessarily right when you do whatever you’re doing based on the end. If your count is part of a transaction, then you’re doing an expensive serialising operation as part of a transaction and so are probably introducing bottlenecks. If you don’t have multiple concurrent writers, then using an RDBMS is probably overkill.

        2. 2

          I implemented the equivalent of that trigger solution for SQLite in my sqlite-utils Python library back in January: https://sqlite-utils.datasette.io/en/stable/python-api.html#python-api-cached-table-counts

          1. 3

            SQLite special-cases “count(*)” to avoid a full scan. It still has to read each page of the table, but it just adds up the row-count stored in each leaf page’s header, so the CPU cost is near zero.

            1. 2

              Huh, I hadn’t seen that! My experience is that a count(*) on a table with a few million rows in it still takes long enough that’s its useful to have an optimization though.

              1. 1

                I drew a blank looking it up in the SQLite docs just now, but I swear I read about this in some docs on the query optimizer.

                (If it kept a count field with the pointers on the first-level interior nodes it would be much faster because it’d only have to read about 1/100 as many pages, but it probably wasn’t worth allocating ~100 bytes of the page for this specific optimization.)