1. 15
  1. 3

    What I find interesting about this outage is that I’m not sure how to avoid it. This particular issue is literally something I have seen everyone I know running PostgreSQL at scale run into. It is something that is documented but I have yet to see anyone learn about it through any path other than an outage. What is insidious is that it violates most humans mental model.

    1. It violates an inductive view of the world. Transaction X + 1 all of a sudden because terribly performing.
    2. It is difficult to simulate, you have to run a lot of traffic to hit it in a simulation.
    3. It takes awhile to show in production and it will happen infrequently giving people a false sense of security. It doesn’t “fail fast” if you aren’t handling it correctly.
    4. It involves understanding a fair amount about the database to predict its effect.

    IMO, it feels like this autovacuum stuff is a design defect in PostegreSQL and outages related to it won’t go away until it goes away. This whole class of defects is very fascinating, how do you make something like this visible much earlier, and consistently, or get rid of it completely?

    1. 2

      Interesting but mildly scary: patching out a sleep call from a running process.

      a PostgreSQL function … was explicitly sleeping as part of the cost-based vacuum delay mechanism, which attempts to limit resources used by vacuum operations. In this case, with all database activity effectively blocked on this operation, this delay was only hurting us. Since these tunables are not dynamically adjustable, we manually patched the running process to avoid sleeping in this case. This improved performance considerably.

      1. 7

        “Mildly scary” is certainly correct; here’s the log from our chat at the time.

      2. 1

        I am a little puzzled about the file prefetching. How many files does Postgres create?

        1. 2

          I believe Postgres creates a directory per database, and a file per table. I think each index is a separate file too?

          EDIT: reference

          1. 2

            Ya, that’s what I thought. For somebody to notice that there’s a pause each time a new file is opened, that’s a lot of tables.

            1. 8

              Postgres chunks its data across files – so it’s not that it was a lot of tables, just a lot of data. And there wasn’t a pause when it was opened, it was that the files themselves were on the platters and we could see Postgres coming off CPU and blocking on I/O. Using DTrace, we easily determined which files were inducing this – and could see that it was marching through them in numerical order. Based on this, we prefetched manually to stay ahead of it (and then ran into the second issue).