1. 15

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, es-pecially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Usingt hese pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the arton a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask:How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmen-tal and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

  1. 4

    Relevant to the title, these huge language models are in effect part complex new technology and part…search index. They memorize phone numbers and addresses from the Internet. GPT-3 generates plausible-sounding news articles about gay marriage in the United Methodist Church, but it also seems to have ingested actual news articles about the topic (not the best source but see here). Typing questions into Google search also gets you plausible looking text, but there you (hopefully) know where it came from!

    I’m not saying there is nothing new or remarkable with big language models, just that how impressive the results are is not a direct measure of how much generalization they’re doing, as separated from having relevant stuff in the training data and doing a good job stringing it together. Although it’d be a weird loop-around in the history of AI, a general DNN that had built up the ability to do arithmetic (other than from memory), or answer the kinds of questions old-school expert systems did (‘are camels generally bigger than ants?’), could be impressive in a totally different way from what we have.

      1. 3

        OP seems to be taken from a journal. The pdfs are different but the content is the same.

        1. 1

          this is the ACM’s copy from the conference it was released at.

          Probably need a mod to merge the two stories.

          1. 2

            You’re right. Merged. Apologies for missing that earlier, I was in a hurry.

        2. 4

          Commenting on my own post to add:

          This is the paper which led to Timnit Gebru’s termination by Google, and which is currently, apparently, causing the termination of Margaret Mitchel.

          1. 2

            For context, this is the research that Timnit Gebru was fired from Google for refusing to redact.