1. 17
  1.  

  2. 4

    With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

    John von Neumann

    1. 4

      Has anyone done analysis on the scaling of these models? Is a 10x model 10x better? More? Less?

      1. 8

        Yes. See Scaling Laws for Neural Language Models. It follows power law and scaling is pretty exact.

        10x more parameters result in 16% reduction of loss. Exponent is 0.076.

        1. 3

          Important caveat: provided your dataset is large (and clean) enough, see Figure 9 from the paper.

          “For large [dataset size] D, performance is a straight power law in model [parameter count] N . For a smaller fixed D, performance stops improving as N increases and the model begins to overfit.”

          1. 1

            Thank you for the link to the paper. I need to read a bit into the topic to understand it better.

            Maybe there is a market related reason why training of larger data has not happened e.g.

            • the solution has been kept in house - this has happened with GPT-3, as I understand, so probably is not the case
            • the increase in model performance is no longer attractive when it comes to the cost achieving it
            • there is not demand on the market for the price a GPT-N+ solution would bring

            Out of cusiosity: What interesting applications have come out of GPT-3 based technology?

            1. 1

              That’s fantastic. Thank you much!

          2. 2

            As the post itself says, this is an MoE(Mixture of Experts) model so its parameter count is not really comparable to parameter count of dense models like GPT-3 (175B). Google already did GShard in 2020, 600B MoE model.

            It is a bit surprising to me that after more than a year since GPT-3, apparently no one(!) trained a larger dense model. What’s going on?

            1. 1

              It is a bit surprising to me that after more than a year since GPT-3, apparently no one(!) trained a larger dense model. What’s going on?

              It’s computationally expensive and conceptually quite difficult to train such a large, dense model. Difficult enough that GPT-3 still hasn’t been reversed reliably.

              1. 1

                I understand it is expensive and difficult, but there are lots of players with more money and more talent than OpenAI. My conclusion is no one is willing to invest, which in itself is surprising.

                1. 1

                  It is not that surprising. Academia at large are not interested in replications, it doesn’t help publication. Industry replicates if it has real-world usage or help promote something. At the moment, GPT-3 is sort of “jack of all trades, master of none” situation in real-world usage. Google’s conversation bot (LaMDA, not the previous gen Meena) haven’t disclosed their parameter size at the moment, it is probably closer to GPT-3 in terms of parameter size and far better at conversations.

                  The best bet is in the next 2 years, some hardware players will tot this to show their hardware capabilities. It is later than what can be done probably due to the dataset issues (I don’t think OpenAI has a dataset that trains GPT-3 ready-to-download?). A rack of NVIDIA latest GPU servers probably can train GPT-3 in a month. It will probably show up once we can train them under a week in a rack or two.