1. 4
  1. 4

    For an update, see On the Information Bottleneck Theory of Deep Learning.

    The article itself asks:

    It remains to be seen whether the information bottleneck governs all deep-learning regimes, or whether there are other routes to generalization besides compression.

    The paper I linked answers:

    Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa.

    I think it was a great idea. But evidences suggest that it’s just not true.

    1. 2

      My intuition has been that discovering sparse representations are usually necessary for any kind of generalization – a model learning speech-to-text, for instance, will necessarily have somewhere inside it an understanding of the individual vowel/consonant sounds and utterances which are then building blocks for generating text.

      “Compression” ~= “sparse representation”, right? So the paper refutes that idea?

      1. 1

        thank you kindly for the link ! having cursorily looked at it and the arguments raised by tishby et al, it seems that information bottleneck might still be relevant…

        1. 1

          Why do you think information bottleneck might still be relevant? I am curious. (I consider the theory mostly failed at this point.)

          1. 2

            In that link @sanxiyn posts, there seems to be a very vigorous back and forth between Tishby et al. (the IB theory crew) and the article criticizing IB (EDIT: with neither side conceding defeat). The program committee accepting the paper to the conference may only mean they thought it worthy of a much broader discussion in the community than their review process.

            Since that was 2 years ago, perhaps other papers or discussion have taken place in the understanding of IB or its critique. I think the link itself and publication is non-conclusive, even of community opinion, never mind the fact of the matter.

            One kind of “obvious” point about “compression” and “generalization” is that the are almost semantically the same. To find a well generalizing representation means to have some representation that has been “properly de-noised”. Representing noise takes up space (probably a lot, usually, but that is problem specific). This applies to all fitting, from multi-linear regression on up, and maybe to all “induction”. (The transduction case, more analogous to “interpolation” is different.)

            That is just one piece of the puzzle, of course, and there may be controversy over how to define/assess “compression” (e.g. a neural net with a bunch of near zero weights may take up computer memory, but be the same as one without those weights at all), and also controversy over specific quantitative relationships between compression, however assessed and out of sample error rates.

            TL;DR - I think @sanxiyn has more work to do in order to establish “mostly failed” or “debunked” status.

            1. 2

              @cblake, i don’t think i could have said it better than you did. thank you !

              1. 2

                You’re welcome. The Wikipedia entry on the Information Bottleneck Method covers some of this controversy in the “Information theory of deep learning” section (today’s version..Future folk may someday have to go back to that in wiki history). They also have more references.

      2. 3

        This is from 2017.