1. 38
  1.  

  2. 10

    This page is really painful to read: it’s quite aggressive towards the author of xz. The tone is really needlessly nasty. There are only elements against xz/lzma2, nothing in favor; it’s just criticism which conclusion is “use lzip [my software]”.

    Numbers are shown the way they look bigger: “0.015% (i.e. nothing) to 3%” efficiency difference is then turned as “max compression ratio can only be 6875:1 rather than 7089:1” but that’s over 1TB of zeroes and only 3% relative to the compressed data, which amounts to a 4*10^-6 difference on the uncompressed data! (and if you’re compressing that kind of things, you might want to look at lrzip)

    The author fails to understand that xz’s success has several causes besides compression ratio and the file format. It’s a huge improvement over gzip and bzip2 for packages. The documentation is really good and helps you get better results both with compression ratio and speed (see “man xz”). It is ported pretty much everywhere (that includes OS/2 and VMS iirc). It is stable. And so on.

    As a side-note, this is the only place where I’ve seen compression formats being used for archiving and expecting handling of potential corruption. Compression goes against archiving. If you’re doing archiving, you’ll be using something that provides redundancy. But redundancy is what you eliminate when you compress. What is used for archiving of audio and video? Simple formats with low compression at best. The thing with lzip is that while its file format might be better suited for archiving, lzip itself as a whole still isn’t suited for archiving. And that’s ok.

    Now, I just wish the author gets less angry. That’s one of the ways to a better life. Going from project to project and telling them they really should abandon xz in favor of lzip for their source code releases is only a proof of frustration and a painful life.

    1. 6

      The author fails to understand that xz’s success has several causes besides compression ratio and the file format.

      But the author doesn’t even talk about that? All he has to say about adoption is that it happened without any analysis of the format.

      Compression goes against archiving. If you’re doing archiving, you’ll be using something that provides redundancy. But redundancy is what you eliminate when you compress.

      This sounds like “you can’t be team archiving if you are team compression, they have opposite redundancy stat”. It’s not an argument, or at least not a sensical one. Compression makes individual copies more fragile; at the same time, compression helps you store more individual copies of the same data in the same space. So is compression better or worse for archiving? Sorry, I’m asking a silly question. The kind of question I should be asking is along the lines of “what is the total redundancy in the archiving system?” and “which piece of data in the archiving system is the weakest link in terms of redundancy?”

      Which, coincidentally, is exactly the sort of question under which this article is examining the xz format…

      What is used for archiving of audio and video? Simple formats with low compression at best.

      That’s a red herring. A/V archiving achieves only low compression because it eschews lossy compression and the data typically doesn’t lend itself well to lossless compression. Nevertheless it absolutely does use lossless compression (e.g. FLAC is typically ~50% smaller than WAV because of that). This is just more “team compression vs team archiving”-type reasoning.

      The thing with lzip is that while its file format might be better suited for archiving, lzip itself as a whole still isn’t suited for archiving.

      Can you actually explain why, rather than just asserting so? If lzip has deficiencies in areas xz does well in, could you step up and criticise what would have to improve to make it a contender? As it is, you seem to just be dismissing this criticism of the xz format – which as a universal stance would result in neither xz nor lzip improving on any of their flaws (in whatever areas those flaws may be in).

      As a side-note, this is the only place where I’ve seen compression formats being used for archiving and expecting handling of potential corruption.

      Juxtaposing this with your “author fails to understand” statement is interesting. Should I then say that you fail to understand what the author is even talking about?

      This page is really painful to read: it’s quite aggressive towards the author of xz.

      I saw only a single mention of a specific author. All the substantive statements are about the format, and all of the judgements given are justified by statements of fact. The very end of the conclusion speaks about inexperience in both authors and adopters, and it’s certainly correct about me as an adopter of xz.

      There are only elements against xz/lzma2, nothing in favor; it’s just criticism which conclusion is “use lzip [my software]”.

      Yes. The authors of xz are barely mentioned. They are certainly not decried nor vilified, if anything they are excused. It’s just criticism. That’s all it is. Why should that be objectionable? I’ve been using xz; I’m potentially affected by the flaws in its design, which I was not aware of, and wouldn’t have thought to investigate – I’m one of the unthinking adopters the author of the page mentions. So I’m glad he took the time to write up his criticism.

      Is valid criticism only permissible if one goes out of one’s way to find something proportionately positive to pad the criticism with, in order to make it “fair and balanced”?

      Frankly, as the recipient of such cushioned criticism I would feel patronised. Insulting me is one thing and telling me I screwed up is another. I can tell them apart just fine, so if you just leave the insults at home, there’s no need to compliment me for unrelated things in order to tell me what I screwed up – and I sure as heck want to know.

      1. 2

        The author fails to understand that xz’s success has several causes besides compression ratio and the file format.

        But the author doesn’t even talk about that? All he has to say about adoption is that it happened without any analysis of the format.

        Indeed, this is more a comment about what appears to be biterness from the author. This isn’t part of the linked page (although the tone of the article is probably a consequence).

        Compression goes against archiving. If you’re doing archiving, you’ll be using something that provides redundancy. But redundancy is what you eliminate when you compress.

        This sounds like “you can’t be team archiving if you are team compression, they have opposite redundancy stat”. It’s not an argument, or at least not a sensical one. Compression makes individual copies more fragile; at the same time, compression helps you store more individual copies of the same data in the same space. So is compression better or worse for archiving? Sorry, I’m asking a silly question. The kind of question I should be asking is along the lines of “what is the total redundancy in the archiving system?” and “which piece of data in the archiving system is the weakest link in terms of redundancy?”

        Agreed. I’m mostly copying the argument from the lzip author. That being said, one issue with compression is that corruption on compressed data is amplified with no chance to be able to reconstruct the data, even by hand. Intuitively I would expect the best approach for archiving would be compression followed by adding “better” (i.e. more even) redundancy and error recovery (within the storage budget). Now, if your data has some specific properties, the best approach might be different, especially if you’re more interested in some parts (for instance if you have a progressive image, you might value more the less specific parts because losing the more specific ones implies only losing on the image resolution).

        Which, coincidentally, is exactly the sort of question under which this article is examining the xz format…

        What is used for archiving of audio and video? Simple formats with low compression at best.

        That’s a red herring. A/V archiving achieves only low compression because it eschews lossy compression and the data typically doesn’t lend itself well to lossless compression. Nevertheless it absolutely does use lossless compression (e.g. FLAC is typically ~50% smaller than WAV because of that). This is just more “team compression vs team archiving”-type reasoning.

        If you look for some stuff from archivists, FLAC isn’t one of the preferred format. It is acceptable but the preferred one still seems to be WAV/PCM.

        Sources:

        The thing with lzip is that while its file format might be better suited for archiving, lzip itself as a whole still isn’t suited for archiving.

        Can you actually explain why, rather than just asserting so? If lzip has deficiencies in areas xz does well in, could you step up and criticise what would have to improve to make it a contender? As it is, you seem to just be dismissing this criticism of the xz format – which as a universal stance would result in neither xz nor lzip improving on any of their flaws (in whatever areas those flaws may be in).

        I had intended the leading sentences to explain that. The reasonning is simply that compression is mostly at odds with long-term preservation by itself. As discussed above, proper redundancy and error recovery can probably turn that into a good match but then the qualities of the compression format itself don’t matter that much since the “protection” is done at another layer that is dedicated to that and also provides recovery.

        As a side-note, this is the only place where I’ve seen compression formats being used for archiving and expecting handling of potential corruption.

        Juxtaposing this with your “author fails to understand” statement is interesting. Should I then say that you fail to understand what the author is even talking about?

        You’re obviously free to do so if you wish to. :)

        This page is really painful to read: it’s quite aggressive towards the author of xz.

        I saw only a single mention of a specific author. All the substantive statements are about the format, and all of the judgements given are justified by statements of fact. The very end of the conclusion speaks about inexperience in both authors and adopters, and it’s certainly correct about me as an adopter of xz.

        Being full of facts doesn’t make the article objective. It’s easy to not mention some things and while the main author of xz/liblzma could technically answer, he doesn’t really wish to do so (especially since it would cause a very high mental load). That being said, I’ll take liberalities and quote from IRC where I basically only lurk nowadays (nicks replaced by “Alice” and “Bob”). This is a recent discussion, there were more detailled ones earlier but I’m not only taking the most recent one.

        Bob : Alice the lzip html pages says that lzip compresses a bit better than xz. Can you tell me the technical differences that would explain that difference in size ?

        Bob : Alice do you have ideas on how improving the size with xz ?

        Alice : Bob: I think it used to be the opposite at least with some files since .lz doesn’t support changing certain settings. E.g. plain text (like source code tarballs) are slightly better with xz –lzma2=pb=0 than with plain xz. It’s not a big difference though.

        Alice : Bob: Technically .lz has LZMA and .xz has LZMA2. LZMA2 is just LZMA with chunking which adds a slight amount of overhead in a typical situation while being a bit better with incompressible data.

        Alice : Bob: With tiny files .xz headers are a little bloatier than .lz.

        Alice : Bob: In practice, unless one cares about differences of a few bytes in either direction, the compression ratios are the same as long as the encoders are comparable (I don’t know if they are nowadays).

        Alice : Bob: With xz there are extra filters for some files types, mostly executables. E.g. x86 executables become about 5 % smaller with the x86 BCJ filter. One can apply it to binary tarballs too but for certain known reasons it sometimes can make things worse in such cases. It could be fixed with a more intelligent filtering method.

        Alice : Bob: There are ideas about other filters but me getting those done in the next 2-3 years seem really low.

        Alice : So one has to compare what exist now, of course.

        Bob : Alice btw, fyi, i have tried one of the exemples where the lzip guy says that xz throws an error while it shouldn’t

        Bob : but it is working fine, actually

        Alice : Heh

        Two main points here: the chunking, the point of view that the differences are very small; and the fact that one of the complaint seems wrong.

        If I look for “chunk” in the article, the only thing that comes up is the following:

        But LZMA2 is a container format that divides LZMA data into chunks in an unsafe way. In practice, for compressible data, LZMA2 is just LZMA with 0.015%-3% more overhead. The maximum compression ratio of LZMA is about 7089:1, but LZMA2 is limited to 6875:1 approximately (measured with 1 TB of data).

        Indeed, the sentence “In practice, for compressible data, LZMA2 is just LZMA with 0.015%-3% more overhead.” is probably absolutely true. But there is no mention of what happens for uncompressible data. I can’t tell whether that omission was voluntary or not but it makes this paragraph quite misleading.

        Note that xz/liblzma’s author acknowledges some of the points of lzip’s author, but not the majority of them.

        There are only elements against xz/lzma2, nothing in favor; it’s just criticism which conclusion is “use lzip [my software]”.

        Yes. The authors of xz are barely mentioned. They are certainly not decried nor vilified, if anything they are excused. It’s just criticism. That’s all it is. Why should that be objectionable? I’ve been using xz; I’m potentially affected by the flaws in its design, which I was not aware of, and wouldn’t have thought to investigate – I’m one of the unthinking adopters the author of the page mentions. So I’m glad he took the time to write up his criticism.

        Is valid criticism only permissible if one goes out of one’s way to find something proportionately positive to pad the criticism with, in order to make it “fair and balanced”?

        I concur that writing criticism is a good thing but the article is not really objective and probably doesn’t try to be. In an ideal world there would be a page with rebuttals from other people. In a real world, that would probably start a flamewar and the xz/libzma author does not wish to get involved into that.

        I’ve just looked up the author name + lzip and first result is: https://gcc.gnu.org/ml/gcc/2017-06/msg00044.html “Re: Steering committee, please, consider using lzip instead of xz”.

        Another scary element is that nor “man lzip” nor “info lzip” mention “xz”. They mention gzip and bzip2 but not xz (“Lzip is better than gzip and bzip2 from a data recovery perspective.”). Considering the length of this article, not seeing a single mention of xz makes me think the lzip author does not have a peaceful relation to xz.

        You might think that the preference of lzip in https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html would be a good indication but the author of that manual is also lzip’s author!

        And now scrolling down my search results, I see https://lists.debian.org/debian-devel/2015/07/msg00634.html “Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide” and the messages there again make me think he doesn’t have a peacfeful relation to xz.

        I don’t like criticizing authors but with this one-way article with surprising omissions and incorrect elements (no idea if that’s because things changed at some point), I think more context (and an author’s personnality and history are context) helps decide how much you trust the whole article.

        Frankly, as the recipient of such cushioned criticism I would feel patronised. Insulting me is one thing and telling me I screwed up is another. I can tell them apart just fine, so if you just leave the insults at home, there’s no need to compliment me for unrelated things in order to tell me what I screwed up – and I sure as heck want to know.

        Yes, it’s cushioned because I don’t like criticizing authors as I said above so I’m uncomfortable doing, I try to avoid doing it but sometimes that’s not something we can separate from a topic or article so I still ended up doing it at least a bit (you can now see that I did it as little as possible in my previous message). With that being said, I don’t think the author needs to be told all of this, or at least I don’t want to start such a discussion with the author who seems to be able to go on for years (and tbh, I’m not sure that’s healthy for him).

        edit: fixed formatting of the IRC quote

      2. 3

        As a side-note, this is the only place where I’ve seen compression formats being used for archiving and expecting handling of potential corruption. Compression goes against archiving. If you’re doing archiving, you’ll be using something that provides redundancy.

        This is not true at all. [Edit: Most of the widely used professional backup and recovery software that was specifically designed for long-term archiving also included compression as an integral part of the package, and advertised it’s ability to work in a robust manner.]

        BRU for UNIX, for example, does compression, and is designed for archiving and backup. This tool is from 1985 and is still maintained today.

        Afio is specifically designed for archiving and backup. It also supports redundant fault-tolerant compression. This tool is also from 1985 and is still maintained today.

        [Edit: LONE-TAR is another backup product I remember using from the mid 1980s, was originally produced by Cactus Software. It’s still supported and maintained today. It provided a fault-tolerant compression mode, so it would be able to restore (most) data even if there was damage to the archive.]

        As to all your other complaints, it seems you are attacking the documents “aggressive tone” and you mention that you find it painful (or offensive) to read, but you haven’t actually refuted any of the technical claims that author of the article makes.

        1. 1

          Sorry, I had compression software in mind when I wrote that sentence. I meant that I had never seen a compression software that made the resistance to corruption such an important feature.

          Thanks for the links! I’m not that surprised that there are some pieces of software that already exist and fit in that niche (I would have had to build a startup otherwise!). I’m quite curious at their tradeoff choices (in space vs. recovery capabilities) but since two of them are proprietary, I’m not sure there is one unfortunately.

          As to all your other complaints, it seems you are attacking the documents “aggressive tone” and you mention that you find it painful (or offensive) to read, but you haven’t actually refuted any of the technical claims that author of the article makes.

          Indeed. Part of that is because comments are probably not really a good place for that because the article itself is very long. The other part is because xz’ author does not wish to get into that debate and I don’t want to pull him in by publishing his answers on IRC on that topic. It’s not a great situation and I don’t really know what to do so I end up hesitating. Not perfect either. I think I mostly only hope to get people to question a bit the numbers and facts on that page and to not forget everything else that goes into making a file format useful in practice and it’s not because there’s no rebuttal that the article is true, spot-on, unbiaised and so on.

        2. 2

          I agree about the tone of the article, but I’m not sure that archiving and compression run counter to each other.

          I’ve spent a lot of time digging around for old software, in particular to get old hardware running, but also to access old data. Already we are having to dig up software from 20+ years ago for these things.

          In another 20 years, when people need to do the same job, it will be more complicated: if you need to run one package, you may find yourself needing tens or worse of transitive dependencies. If you’re looking in some ancient Linux distribution mirror on some forgotten server, what are the chances that all the bits are still 100% perfect? And certainly nobody’s going to mirror all these in some uncompressed format ;-)

          This is one case where being able to recover corrupted files is important. It’s also helpful to be able to do best-effort recovery on these; in any given distro archive you can live with corruption in some proportion of the resulting payload bytes - think of all the documentation files you can live without - but if a bit error causes the entire rest of the stream to be abandoned then you’re stuffed.

          I’d argue that archival is something we already practice in everyday release of software. The way people tend to name release files as packagename-version.suffix is a good example: it makes the filename unique and hence very easy to search for in the future. And here, picking one format over another where it has better robustness for future retrievers seems pretty low-cost. It’s not like adding parity data or something that increases sizes.

          1. 2

            Agreed. :)

            Makes me think of archive.org and softwareheritage.org (which already has pretty good stuff if I’ve understood correctly).

        3. 3

          Despite the thorough explanation why XZ should not be used for long-term archiving (and why the format seems to be badly specified) the article lacks to mention any alternative. What should we use instead?

          Update

          From the front page:

          This article describes the reasons why you should switch to lzip if you are using xz for anything other than compressing short-lived executables.

          1. 4

            Assuming by the website - http://lzip.nongnu.org

            I cannot find it right now, but there was dissection of this article with list of problems this article has.

            1. 3

              I’d appreciate reading that if you can locate it. Also, the article was updated very recently (2019-05-17), perhaps in response to previous feedback?

              I am aware of some previous discussion and concerns that the critique might be “politically motivated”, but I’ve not seen any convincing counter arguments made that actually debunk the technical claims.

              Interesting, some projects (such as wget) are now using .lz for distribution, while other (such as emacs) are using .xz.

              1. 1

                It would be hard as it pops on HN and Reddit every quarter so it is even hard to google.

            2. 2

              I’d personally recommend lzip’d POSIX.1-2001 archives (pax, or GNU tar `–format=pax’), or afio, which has fault-tolerant compression, and makes an excellent archive format.

            3. 3

              Glad to see this article brought to light again, I’ve long since switched to .tar.lz privately, but the new greatness now is … Zstd, I suppose? Is Zstd suitable for long-term archiving?

              1. 2

                it is an RFC now(rfc8478), so probably. zstandard is pretty nice since decompression is basically the same regardless of compression level(s), so you only pay the cost at create time.

                1. 2

                  It is an RFC now, so probably.

                  Just because it’s an RFC doesn’t inherently mean anyone analysed the format under the same criteria as examined in this article. The question is, has anyone?

                  1. 1

                    There’s zchunk ( “A file format designed for highly efficient deltas while maintaining good compression” - https://github.com/zchunk/zchunk ) which description makes me think errors in a zstd stream might “kill” large blocks of the output. That’s only an uninformed supposition on my end though.