If you know all this - great! I’ve worked with a few databases that haven’t and figured I’d share.
From context I assume the JSON advice here is only intended for JSON used inside relational databases? In that case the advice seems reasonable. But I’d recommend not using custom compression schemes like that for JSON output used as data storage, where just running it through a standard compression algorithm is typically as good, and much less error-prone. From some quick experimentation, it seems that gzip compresses a stream of random UUIDs to about 21 bytes each, roughly the same as the 22 bytes per UUID you get if you manually base64 encode the field.
Huh. Do you know if Postgres compresses JSONB columns before saving to disk?
We recently got to try InnoDB compression with some of our data, in Percona MySQL 5.6. Packed down about 3 to 1, and we’ve yet to see problems, but it’s the early days for us so we’ll see(/don’t take it as a promise that it’s great). I recall we had more trouble with 5.1 compression, but I’m not the primary DBA and have forgotten any details I once knew. We haven’t tried other algorithms yet (see https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/); lz4 or zstd may help us if CPU use becomes an issue.
We’re also kind of interested in partitioning; right now, certain queries wind up bringing lots of historical data into RAM when we only care about, say, particular users' recent activity. Having a small “recent” partition could help with that it (relatively transparently from the app’s perspective even), but we would love to know more about people’s experiences with the partitioning and its effectiveness/quirks/gotchas before diving in.
Edit: On Kevin’s theme of dumb tricks to shrink data, one we use is memoizing/interning/normalizing often-repeated values. Like, instead of storing a User-Agent or Referer raw for each hit, maintain another table with each unique agent/referrer and store a user_agent_id. Of course that carries its own issues, e.g. values used only once are more expensive to store, and getting rid of values only used in old data is kind of tricky. At our size in our use case it’s been handy; probably not equally so at smaller or larger scales.
At one job we used partitioning to good effect. It worked quite nicely with our date bound data (metrics). We had data that we wanted to “age out” periodically, and we partitioned on timeboxes. This not only made queries for recent data faster (query pruning would often result in hitting only a single partition, smaller indexes, etc), but we could also quickly drop the aged out data just by dropping the oldest partition (and creating a new one for the next timebox). We also had another vary large dataset, that we partitioned on some arbitrary key, that one was mostly to keep index size down a bit.
As far as caveats, if you end up querying by something other than your non partitioning keys, you can make things MUCH slower, as you now have to load data from multiple partitions, which can significantly increase IO, and likely hoses caches. So I would recommend you use it very judiciously.
We also used compression, and it seemed to work fairly well. We had some super fast ssd raid boxes, which after time ended up being a bit space constrained. We flipped on compression on one of our slaves and saw good results, then rolled it out to other nodes. The data we were compressing was mostly json blob data, so it compressed quite well. Some queries also ended up being faster, because with compression we were doing less IO, and we had quite a bit of spare cpu on those boxes.
Alas, I don’t recall what we had set for KEY_BLOCK_SIZE or any of the other tunables.
I know binary columns save space, but I have seen/experienced some significant pain when doing it with mysql – mostly due to the padding rules, dealing with truncated data and accidental collisions, newer devs on the team not understanding it resulting in people trying to use hex/unhex (and getting null results sometimes), incoming data that was just…garbage (mobile device identifiers at the time), combined with codebases and dev-teams that changed over time. It made dealing with the binary fields a huge mess, and made adhoc queries / debugging significantly harder.
If you need it, then of course use it, but I recommend putting it off until you really need it, and probably only then for “internal” data (not user supplied) and nothing you might ever need to manually introspect or troubleshoot.