Compression is, in a way, equivalent to prediction. Given all of the bits you’ve seen so far, predict the next bit of the file. If you are very confident about your prediction, then you can use many bits to encode the unlikely alternative, and less than one bit to encode the likely alternative. If you know nothing, then 0 and 1 are equally likely, and you have to spend an equal number of bits on each possibility (which can’t be less than 1).

So the compressor has some kind of model of what it expects to see. Predictable inputs produce small outputs, and surprising inputs produce large outputs. The better the model, the harder it is to surprise. Some algorithms like LZMA and PPM make this really explicit, but every algorithm does it in some way. The simplest Huffman codes say “I think that the next byte is drawn from some hard-coded distribution, just like every byte” or “I think that the probability of the next byte being X is proportional to how many Xes I’ve seen in the file so far”. LZ-type algorithms say “I think there’s a good chance that the next several bytes will be identical to a run of several bytes that I’ve already seen”.

Indeed, in audio one way to compress is to use a polynomial as a predictor of the wave and then only encode the difference between the polynomial and actual value, which requires less bits to represent (and can later be further compressed, say using Huffman.

Assume you have an algorithm that compresses any input to a smaller size, reversibly.
Just feed the output back into the input — i.e. just keep compressing the file over and over. Eventually it must compress down to a single byte … which can still be decompressed (in multiple steps) back to the original. 🤨

(This still invokes the pigeonhole principle, but it’s much more intuitive with only 1 byte or 256 pigeonholes.)

The discussion of hashing is interesting. I’m not sure what the context for their discussion was, but I have wondered about such things in the context of bittorrent in the past. Essentially, you start with the just hash of the file. As discussed in the article, this hash isn’t reversible: there are simply too many files with the same hash. Then you download a couple of blocks (e.g. the first and last block). Now you have the second situation they discuss, with the 2^(m-k) possible files given a file of length m bits, of which we currently have k bits. The interesting follow-on questions for me would be:

Do we ever reach a point where we can realistically generate one or two missing blocks instead of downloading them? (probably not, even for a relatively small block size)

What if we apply other heuristics to check each of the possible files to see if it’s ‘the one’:

Some file formats have internal checksums

The real file will probably follow patterns which humans (or possibly machines) could distinguish from noise. For example, if we know the file is an image (and that it is an image of something, not random noise in an image file), then could a human viewing the possible files as images pick out the one that appears to be a complete, correct, image?

If I understand you correctly, there are just too damn many “possible files” to be able to check them. Even if there are only 32 unknown bytes in the file, that’s 2^256 possibilities, which is on the order of, um, 10^70 or so … more than there are particles in the universe, more nanoseconds that have passed since the Big Bang, etc.

I think that’s ruled out (implicitly) a priori for the purpose of this paper. If you are willing to accept infinite information loss then you can compress anything down to nothing. Practically, of course, if you’re willing to throw information away, then of course you can reduce the size of any body of information. This does make the interesting point that, while you can’t guarantee compression in theory, in practice you can often get around the problem by doing things the theory doesn’t contemplate (deciding that some information isn’t important).

There is obviously a limit though at which that trick ceases to be beneficial. You start writing pesky iotas which are elided in pronunciation as subscripts to save space, then someone stops writing them altogether to save time, and then you get the Great Schism, then “Istanbul was Constantinople…” (mostly joking here, obviously the political and historical impulses were very significant in that, probably more than the Homousion, but I couldn’t pass it up).

in practice you can often get around the problem by doing things the theory doesn’t contemplate

This is a great point. Some of the best engineering developments have been from carefully reconsidering the assumptions that the theory made. (“What if we radically redefine what information loss means?”)

Compression is, in a way, equivalent to prediction. Given all of the bits you’ve seen so far, predict the next bit of the file. If you are very confident about your prediction, then you can use many bits to encode the unlikely alternative, and less than one bit to encode the likely alternative. If you know nothing, then 0 and 1 are equally likely, and you have to spend an equal number of bits on each possibility (which can’t be less than 1).

So the compressor has some kind of model of what it expects to see. Predictable inputs produce small outputs, and surprising inputs produce large outputs. The better the model, the harder it is to surprise. Some algorithms like LZMA and PPM make this really explicit, but every algorithm does it in some way. The simplest Huffman codes say “I think that the next byte is drawn from some hard-coded distribution, just like every byte” or “I think that the probability of the next byte being X is proportional to how many Xes I’ve seen in the file so far”. LZ-type algorithms say “I think there’s a good chance that the next several bytes will be identical to a run of several bytes that I’ve already seen”.

Indeed, in audio one way to compress is to use a polynomial as a predictor of the wave and then only encode the difference between the polynomial and actual value, which requires less bits to represent (and can later be further compressed, say using Huffman.

A simplified proof:

Assume you have an algorithm that compresses any input to a smaller size, reversibly. Just feed the output back into the input — i.e. just keep compressing the file over and over. Eventually it must compress down to a single byte … which can still be decompressed (in multiple steps) back to the original. 🤨

(This still invokes the pigeonhole principle, but it’s much more intuitive with only 1 byte or 256 pigeonholes.)

In case anyone’s curious, I got the date out of this site’s RSS feed. https://web.archive.org/web/20090123072348mp_/http://matt.might.net/articles/feed.rss

It doesn’t seem to be anywhere on the page itself.

The discussion of hashing is interesting. I’m not sure what the context for their discussion was, but I have wondered about such things in the context of bittorrent in the past. Essentially, you start with the just hash of the file. As discussed in the article, this hash isn’t reversible: there are simply too many files with the same hash. Then you download a couple of blocks (e.g. the first and last block). Now you have the second situation they discuss, with the 2^(m-k) possible files given a file of length m bits, of which we currently have k bits. The interesting follow-on questions for me would be:

If I understand you correctly, there are just too damn many “possible files” to be able to check them. Even if there are only 32 unknown bytes in the file, that’s 2^256 possibilities, which is on the order of, um, 10^70 or so … more than there are particles in the universe, more nanoseconds that have passed since the Big Bang, etc.

counterpoint: lossy compression

I think that’s ruled out (implicitly) a priori for the purpose of this paper. If you are willing to accept infinite information loss then you can compress anything down to nothing. Practically, of course, if you’re willing to throw information away, then of course you can reduce the size of any body of information. This does make the interesting point that, while you can’t guarantee compression in theory, in practice you can often get around the problem by doing things the theory doesn’t contemplate (deciding that some information isn’t important).

There is obviously a limit though at which that trick ceases to be beneficial. You start writing pesky iotas which are elided in pronunciation as subscripts to save space, then someone stops writing them altogether to save time, and then you get the Great Schism, then “Istanbul was Constantinople…” (mostly joking here, obviously the political and historical impulses were very significant in that, probably more than the Homousion, but I couldn’t pass it up).

This is a great point. Some of the best engineering developments have been from carefully reconsidering the assumptions that the theory made. (“What if we radically redefine what information loss means?”)