For reference gcc + glibc (a somewhat complete platform for building web and mobile apps and games and kernels and databases in pure C/C++) is 140MB.
I’m glad somebody said this so I didn’t have to. :) If the goal is to save bits, switching compression provides only modest, incremental savings.
gcc + glibc is very incomplete. Does it include any kind of HTTP support? Any kind of RPC? Any kind of serialization? Any GUI support, even just the API? Compression format support? A config format support? Heck, any kind of parsing library? Support for any kind of structured datastore (beyond the filesystem)?
600mb does sound like too much, but the comparison is off; it’s likely doing a lot more than gcc + glibc.
From a quick look at the tarball, it doesn’t seem to include the runtime (Node.js), so it’s incomplete in the other way.
I’m not caremad about people using gzip for things.
But! If you want to see the fruits of some of the work done on compression over time, the Squash benchmark is super cool. This guy’s analysis of it approximately agrees with mine.
(The initial file type and machine for the benchmark results are randomly picked, so it might load showing compression results for Protocol Buffers on a Raspberry Pi. You might want to choose options more like your use case.)
If there’s anything I’m oddly obsessed with, it’s where we might be able to use compression but don’t, or could make cheap compression better. OS X, Android, and Chrome OS compress RAM before swapping to disk. Are there advancements to make there, in terms of hardware assists, better algorithms, or clever tricks to decide what to compress (or choose among algorithms, e.g. tighter packing for data likely to stay packed for longer)? Samsung apparently put a simple hardware memory compressor in their Exynos chips, and Intel gzip QuickAssist and the planned compression coprocessor in the AMD ARM A1100 server chip (delayed but still kicking) are other (different) examples. And sometimes transparent compression while writing to disk would be a net speed win (i.e., we’re I/O bound) but we don’t do it.
I think folks are requesting the .br extension for Brotli from IANA, and Chrome and Firefox support it (or will) as an HTTP Content-Encoding. It’s part of the WOFF2 font standard already. There are also zlib patches improving speed but maintaining backcompat. Zstd, though not finalized yet, is interesting. I would love all those to get handy command-line utilities so they’re easy to deploy when, you know, I have a dozens-of-GB backup to archive at work.
Came here to mention Squash, too. You might also be interested in my heatshrink project, which is a C library for doing data compression and decompression in severely memory constrained and/or hard real-time systems. (It’s LZSS-based.)
I decided to do some tests on my backups. I generally dump a postgres instances with the following command:
time pgdump -U postgres head | (gzip > /home/database/koparohead$(/bin/date +\%Y-\%m-\%d\%H-\%M-\%S).gz)
Here are the timings with gzip, xz and gzip -9:
Doing a dump and compressing it with gzip:
Dump performed and compressed with xz:
gzip -9 the db dump:
-rw-r--r-- 1 database database 643M Dec 12 19:49 koparo_head_2015-12-12_19-48-00.gz
-rw-r--r-- 1 database database 477M Dec 12 20:07 koparo_head_2015-12-12_19-50-00.xz
-rw-r--r-- 1 database database 641M Dec 12 20:11 koparo_head_2015-12-12_20-09-40.gz9
In my use case, it seems gz is the most sane approach. I doubt waiting 17m57s for a DB backup is a viable case :)
I happen to have an actual potential use case I tried throwing this at. I’m regularly taking snapshots of a bit of server filesystem with tar (it happens to be a Minecraft server, but I doubt that massively skews the compression performance characteristics of the data). I have snapshots easily to hand, so I grabbed one and checked it. These are all done with a warmed cache and are profoundly unscientific, but here we go:
$ time cat snapshot.tar | cat | wc -c
$ time cat snapshot.tar | gzip | wc -c
$ time cat snapshot.tar | bzip2 | wc -c
$ time cat snapshot.tar | xz | wc -c
So using gzip saves about 40% on my original most-of-2GB. Not bad, especially in only a minute and a half. bzip2 saves an additional half a percent, at the cost of another five minutes of processing; and xz saves 0.8% over gzip, at the cost of almost thirteen minutes of additional processing. There’s no way that’s worth it, especially since the server needs to be doing things other than compressing snapshots while this is going on. So I guess I’ll keep using gzip.
How many cores do you have available and did you try xz with -T 0? For apples-to-apples, you could try pigz, as well.
I regularly run personal backups and disk images through xz, they both end up about half the size of gzip. Also, having threaded compression built into the primary executable (-T 0) is nice compared to having to download a separate pigz package.
I guess it’s less of an issue now'a days but what is the CPU overhead for xz vs gzip compared to the compression ratio?
From the numbers in the post, roughly 2x compression for 5x CPU.