1. 8
  1.  

  2. 4

    A single table where both output sizes and compression times differ is not a valid way to compare compressors, except cases that show a Pareto improvement (i.e. both smallerand faster at the same time).

    For most algorithms compression ratio is variable, so they can be as fast or as slow as you tell them, which makes it too easy to create inconclusive or outright misleading results. The -1..9 flags are not equivalent for different tools, so even running them all with same command-line flag is not fair.

    When you have two variables (speed and compression), and you don’t equalize one of the variables then the results are not comparable. Valid comparisons are “smallest file given same time” and “fastest to compress to specified size”. But if you compare cases and one gave larger file, but other finished faster, then it’s inconclusive.

    The proper way to compare compression algorithms is to graph curves created by (time to compress, output size) points of all their speed settings. Such graph can then be used to answer questions such as “of the slow compressors, which one gives the smallest file” or “is compressor A better than compressor B for all of their speed settings” or “which one is the fastest to compress better than a given ratio”.

    BTW: this is orthogonal to choosing what data to compress in the benchmark. Choice of relevant and representative data is a whole another issue.

    1. 2

      Fairly recently I’ve done some benchmarks with xz and zstd for an embedded disk image on ubuntu 18.04 (meaning zstd isn’t the most recent version available even though it’s not horribly outdated). The winner was “still” xz because it could compress more, slightly faster and the bottleneck was the transfer of the compressed file. I know that ArchLinux has switched to zstd but xz decompression (especially at high compression since they make decompression faster as mentionned in another comment from me) is probably able to saturate storage I/O.

      There’s something that zstd has and xz doesn’t however: lots and lots of steam. The most interesting recent development of zstd is the ability to train it on a file and the compress another file. This makes it possible to generate binary diffs which are almost as good as bsdiff’s but much more economical to generate and use.

      edit: and one configuration of xz that is often very interesting is a low compression preset but a regular or even large dictionary: –lzma2=preset=0,dict=64M is often surprisingly fast and efficient.

      1. 2

        Here are the comparisons between xz and zstd that was done for the Arch Linux migration. They might be worth a look at if you are curious how different settings affect the packages.

        https://lists.archlinux.org/pipermail/arch-dev-public/2019-March/029520.html

        https://lists.archlinux.org/pipermail/arch-dev-public/2019-March/029542.html

        1. 2

          If we are benchmarking unnecessarily slow compression algorithms, why no zopfli?

          1. 2

            gzip -9 is rarely (or perhaps even never) a good setting. In my experience it provides almost no increase in compression ratio even for large files while being much slower. Just gzip on this data would probably be very close to 141M as well while being much faster.

            Fun story: years ago someone put in gzip -9 in an hourly backup script, and at some point we ran in to problems because it took more than an hour to compress the data. It was fixed by just removing the -9 which reduced compression time to 10 minutes or so. The files were like 500k larger or some such comically small number.

            I did some tests with zstd to get the best compression/performance trade-off for compressing SQL backups and settled on zstd -11; after that performance increases quite a bit with decreasing gains in compression ratio. I can’t really find the script/results from that now (probably rm’d it?) but shouldn’t be too hard to reproduce.

            1. 1

              Debian’s lintian requires gzip -9 (or gzip --best) for gzipped files in binary deb packages. It’s motivated by the goal of reproducible builds.

              1. 3

                Can’t you get reproducible builds with gzip -6?

                1. 1

                  I don’t know. I just know that lintian will complain about any compression level other than -9. I also don’t think that Debian packages are large enough and built frequently enough that it would be problematic.

                2. 2

                  Sounds more like reproducible packages, than reproducible builds?

                  1. 2

                    Depends on what you’re building. As a developer the thing that you’re probably building most frequently is executable binaries. From Debian’s perspective however, what they’re building is packages. They take a source package and produce a binary package. It’s important that it’s reproducible in full (not just in degree to which it consists of executable binaries) for two reasons:

                    1. Attack vectors aren’t limited to code included in binaries.
                    2. It’s easier to just check if the whole package is built the same, than track which parts of packages are supposed to be reproducible and in the end probably miss parts which may lead to an exploit (and consequently loss of trust) anyway.
              2. 1

                Don’t see an xz -e example to use extreme compression. Then comparing with xz -eT0 to use all cores, thats how I use xz, xz with threads compresses faster with -e than gzip without (unless you use say pigz) and is infinitely smaller output.

                1. 1

                  xz’ -e usually doesn’t gain you much however but makes compression much much slower. That being said, lzma has this interesting property that decompression speed is expressed as compressed-bytes per second, which means that improving compression improves decompression speed.

                  By the way, even with that, I find that xz -e is not all that useful in practice: large dictionnaries tend to have a larger effect and xz makes it easy to use. Its command-line supports the following: xz -9vv –lzma2=preset=6,dict=1024M (only the dictionary size changes between levels 6, 7, 8 and 9) but you need to have 10*$dict RAM (lzma absolutely hates swap) and actually 10*$dict*$n_threads if you use threads (decompression requirements are only $dict, no matter other settings). Also worth keeping in mind: it’s not useful to have a dictionary larger than the amount of data to compress (and with threads, the limit would be the amount of data to compress divided by the number of threads).

                  1. 1

                    xz’ -e usually doesn’t gain you much however but makes compression much much slower.

                    Not in my experience. Yeah its slower but…. who cares? Example on a roughly 20GiB file of mixed compressibility, the majority of the end being super compressible (lots of zeros), actually this is an actual data file that will get transferred a lot after build so saving any size at the expense of a bit of cpu time up front only makes sense to me.

                    $ for x in 3 6 9 e
                    xz -${x} -vkT0 -c input.data > input.data.${x}.xz
                    input.data (1/1)
                      100 %      1,800.2 MiB / 19.5 GiB = 0.090   132 MiB/s       2:31
                    input.data (1/1)
                      100 %      1,650.2 MiB / 19.5 GiB = 0.083    74 MiB/s       4:29
                    input.data (1/1)
                      100 %      1,532.0 MiB / 19.5 GiB = 0.077    44 MiB/s       7:31
                    input.data (1/1)
                      100 %      1,648.6 MiB / 19.5 GiB = 0.082    54 MiB/s       6:10
                    

                    The fun part here is -e is closer to -6 on this random bit of data I snagged versus -9, but normally -e wins out, though for this specific instance now I kinda want to change our ci to do 6/9/e and pick the smallest of the three. -e is also faster than -9 on this input data, so I’m not sure I buy your much much slower argument without some sort of examples.

                    And to be fairish and compare using pigz -9 cause I can’t be bothered with single threaded compression of a 20GiB file:

                    $ time pigz -9v -k input.data
                    input.data to input.data.gz
                    
                    real	1m43.304s
                    user	23m44.951s
                    sys	0m15.329s
                    $ du -hm input.data.gz
                    2289	input.data.gz
                    

                    A few minutes more cpu time (less on actual non laptop hardware) for 600-900MiB reduction seems well worth it to me. Your needs may vary but every time I’ve looked at the “cost” in time for compressing with xz at higher modes is that it pays off in either network transfer or space savings later.

                    1. 1

                      Yup, definitely agreed that xz oftens provides wins in I/O or space overall. When I do the math and compare to gzip or even zstd (let alone bzip2), xz is often the best choice.

                      Have you tried playing with sparse files for your dataset? I’ve been using tar –sparse for quite some time and I’ve recently combined it with “fallocate –dig-holes” to automatically turn zeroes into sparsity. Xz is not fast for sparse data unfortunately but since tar preserves sparsity, xz doesn’t get to see the zeroes at all. It almost always provides me nice speed boosts.