In my opinion, the biggest problem with Postgres is that people try to run it on memory-starved cloud servers. Even the example that the article gives shows ludicrously small numbers:
Suppose our movies table has 10 million live and 40 million dead tuples, making 80% of the table obsolete data. Assume also that the table also has many more columns than what we are showing and that the average size of each tuple is 1KB. With this scenario, the live tuples occupy 10GB of storage space while the dead tuples occupy ~40GB of storage; the total size of the table is 50GB.
We run Postgres in production with at least 256 GiB of RAM, and usually 512 GiB or more. If you have physical hardware, this actually adds very little to the cost. We handle databases containing terabytes of data on fast NVMe storage with a substantial page cache. This works perfectly fine, even with relatively high write load.
Even if you’re using something like OVH or Hetzner, more RAM is just not that expensive. It is only in the “traditional” clouds like GCP, AWS, Azure, DigitalOcean, Linode, etc where you get charged an arm and a leg for RAM.
Of course, by throwing more money into the problem you can basically solve it, and that’s a valid approach.
I am not disagreeing. However regarding “throwing more money at the problem”, spending less than EUR 60 per 256GiB (Hetzner server auction) or EUR 134 (this one) instance for 10 million live rows doesn’t seem like an unreasonable amount of money. You could probably do with less than those 256GiB. 128GiB and 2x 512GiB NVME here are EUR 56.
Anecdotally, I worked at 2 different companies that used PG, and Vacuuming legitimately never came up. One was a very read-heavy application, with very few writes and updates. The other had much more writes, but I just legitimately didn’t know about Vacuuming, and there were never any performance incidents related to it.
My current job is very write heavy, and is higher-scale than the other companies, and probably 15-20% of incidents have to do with Vacuuming or a lack of Vacuuming. So it certainly feels like you can be fine for years with certain workloads, but when Vacuuming does enter the fold, it’s a pretty serious issue.
I also haven’t run into it, after working for 4 companies using it. I think it’s only on update-heavy workflows where you’d typically run into it. It sounds like it’s the kind of thing that if you do run into it, it’s very difficult to get around and likely to strongly skew your perception of Postgres in the negative direction. Battle scars and all that. I can definitely relate: most of my scars are MySQL-induced and I don’t ever want to touch that db anymore with a 10 foot pole.
The only time I had a vacuum issue was when we decided to trim a large table that’s been accumulating data for years and we decided that there was no reason to keep around anything that was older than a month.
After issuing the DELETE statement that deleted the billions of rows of old data in bulk, every query hitting that table came to a crawl, because apparently Postgres was trying to do some basic statistics for query planning (I don’t remember very well), but since almost all the rows were stale, it was taking forever scanning, trying to run into some live rows.
I tried triggering a VACUUM, but we decided to cancel it after 2 days or so, since it didn’t seem to be making much progress. The solution was to copy the remaining data to a new table and swap it with the old table, which took (if I remember correctly) a few tens of minutes.
After that, we were worried that Postgres’ periodic vacuum might not be able to catch up as the application deletes stale data periodically, but that was never an issue, so the problem was the bulk delete. We had some other production environments where we had to perform the same trimming, but having learnt our lesson, we performed the deletions over a number of batches with vacuums in between and it all went smoothly.
I tried triggering a VACUUM, but we decided to cancel it after 2 days or so, since it didn’t seem to be making much progress.
Not relevant to your overall point, but if anybody else finds this: VACUUM can be blocked by certain locks held open by other transactions. I’ve definitely seen the situation where somebody was unintentionally blocking their own vacuum.
Counter-anecdotally: circa 2011 I was using PostgreSQL for an ecommerce SaaS solution, some (many) of our customers had “bloat” in the terabytes due to vacuum being less effective than it could have been. Occasionally we would need to completely dump/restore databases because it was quicker than a forced full vacuum.
This was back in the days of PostreSQL 7.x; every single version of postgresql since then has improved vacuums hugely, I strongly suspect that if we ran the same workloads today it wouldn’t be a problem. SSDs would help too I suspect.
And yes, terabytes of data was very difficult to deal with in 2011, where the average max capacity enterprise drive was 200GiB
It’s really interesting how well this validates my current team’s append-only approach to postgres. Obviously not viable on all workloads, but man, if you can avoid overwriting historical data in favor of a functionally pure approach it sure does solve a lot of problems!
Ironically for in-memory data, FP approaches tend to have worse GC characteristics, but in the database they do dramatically better, largely because “never collect” can actually be a viable “garbage collection” strategy for the database.
Append-only is definitely good with respect to dead tuple accumulation, but of course watch out for data growth overall since that also can slow everything down.
I’m not sure, but I believe they only optimized configuration settings, which don’t change much over time. So it didn’t seem like a product that made sense to subscribe to business-wise.
In my opinion, the biggest problem with Postgres is that people try to run it on memory-starved cloud servers. Even the example that the article gives shows ludicrously small numbers:
We run Postgres in production with at least 256 GiB of RAM, and usually 512 GiB or more. If you have physical hardware, this actually adds very little to the cost. We handle databases containing terabytes of data on fast NVMe storage with a substantial page cache. This works perfectly fine, even with relatively high write load.
Even if you’re using something like OVH or Hetzner, more RAM is just not that expensive. It is only in the “traditional” clouds like GCP, AWS, Azure, DigitalOcean, Linode, etc where you get charged an arm and a leg for RAM.
Of course, by throwing more money into the problem you can basically solve it, and that’s a valid approach.
It is still true that PostgreSQL could just do better in this regard, though. Even so, it is still by favorite database system by far.
I am not disagreeing. However regarding “throwing more money at the problem”, spending less than EUR 60 per 256GiB (Hetzner server auction) or EUR 134 (this one) instance for 10 million live rows doesn’t seem like an unreasonable amount of money. You could probably do with less than those 256GiB. 128GiB and 2x 512GiB NVME here are EUR 56.
Anecdotally, I worked at 2 different companies that used PG, and Vacuuming legitimately never came up. One was a very read-heavy application, with very few writes and updates. The other had much more writes, but I just legitimately didn’t know about Vacuuming, and there were never any performance incidents related to it.
My current job is very write heavy, and is higher-scale than the other companies, and probably 15-20% of incidents have to do with Vacuuming or a lack of Vacuuming. So it certainly feels like you can be fine for years with certain workloads, but when Vacuuming does enter the fold, it’s a pretty serious issue.
I also haven’t run into it, after working for 4 companies using it. I think it’s only on update-heavy workflows where you’d typically run into it. It sounds like it’s the kind of thing that if you do run into it, it’s very difficult to get around and likely to strongly skew your perception of Postgres in the negative direction. Battle scars and all that. I can definitely relate: most of my scars are MySQL-induced and I don’t ever want to touch that db anymore with a 10 foot pole.
The only time I had a vacuum issue was when we decided to trim a large table that’s been accumulating data for years and we decided that there was no reason to keep around anything that was older than a month.
After issuing the
DELETEstatement that deleted the billions of rows of old data in bulk, every query hitting that table came to a crawl, because apparently Postgres was trying to do some basic statistics for query planning (I don’t remember very well), but since almost all the rows were stale, it was taking forever scanning, trying to run into some live rows.I tried triggering a
VACUUM, but we decided to cancel it after 2 days or so, since it didn’t seem to be making much progress. The solution was to copy the remaining data to a new table and swap it with the old table, which took (if I remember correctly) a few tens of minutes.After that, we were worried that Postgres’ periodic vacuum might not be able to catch up as the application deletes stale data periodically, but that was never an issue, so the problem was the bulk delete. We had some other production environments where we had to perform the same trimming, but having learnt our lesson, we performed the deletions over a number of batches with vacuums in between and it all went smoothly.
Not relevant to your overall point, but if anybody else finds this: VACUUM can be blocked by certain locks held open by other transactions. I’ve definitely seen the situation where somebody was unintentionally blocking their own vacuum.
Counter-anecdotally: circa 2011 I was using PostgreSQL for an ecommerce SaaS solution, some (many) of our customers had “bloat” in the terabytes due to vacuum being less effective than it could have been. Occasionally we would need to completely dump/restore databases because it was quicker than a forced full vacuum.
This was back in the days of PostreSQL 7.x; every single version of postgresql since then has improved vacuums hugely, I strongly suspect that if we ran the same workloads today it wouldn’t be a problem. SSDs would help too I suspect.
And yes, terabytes of data was very difficult to deal with in 2011, where the average max capacity enterprise drive was 200GiB
It’s really interesting how well this validates my current team’s append-only approach to postgres. Obviously not viable on all workloads, but man, if you can avoid overwriting historical data in favor of a functionally pure approach it sure does solve a lot of problems!
Ironically for in-memory data, FP approaches tend to have worse GC characteristics, but in the database they do dramatically better, largely because “never collect” can actually be a viable “garbage collection” strategy for the database.
Append-only is definitely good with respect to dead tuple accumulation, but of course watch out for data growth overall since that also can slow everything down.
I liked the technical detail of why PGSQL has its famous VACUUM problems. The post gives a good insight into the design.
But I have to be a little snarky: I couldn’t get around noticing how the further you read, the more ads for their own tuning solution appear.
Of course, after all you sell the maintenance tools.
they used to. OtterTune closed last year
why did they close btw?
I’m not sure, but I believe they only optimized configuration settings, which don’t change much over time. So it didn’t seem like a product that made sense to subscribe to business-wise.