1. 5

  2. 2

    This is an interesting post, but I want to point out just one part of it:

    Massive data sets are not at all what humans need to learn things so something is missing

    This is a great point. NNs are effective but have some serious limitations.

    1. 3

      Massive data sets are required to fit models with lots of parameters. It appears that there are qualitative changes in the predictions of models as the number of parameters grows. Relevant to the recent llama.cpp work is the ability to quantize very large models to 8 bits/component or less. One of the authors of that work has a summarizing blogpost with some notes in which they opine:

      There are two types of transformers and you should not generalize from one to the other. From these findings it is clear that transformer after the phase shift at 6.7B parameters behave very different to transformers before the phase shift. As such, one should not try to generalize from <6.7B transformers to beyond 6.7B parameters.

      As regards humans, we might consider the typical human as not well-learned. This neatly crosses the gap between the author’s position and the Bitter Lesson without contradicting either.

      1. 1

        Thank you very much. That’s very interesting. I am looking for more information about this phase change.