1. 7
  1.  

  2. 5

    The Lies section reminded me of the good old saying “Lies, damned lies, and statistics”. Considering that current ML techniques are easily classifiable under “statistics turned up to 11” I think it’s worth adding a fourth entry to the saying: machine learning.

    1. 4

      An LLM can’t distinguish truth from lies because it doesn’t know what lies are. It doesn’t “know” anything. […] It doesn’t contain any capacity for understanding, and there is no standard for an untruth that exists in a corpus of training data disconnected from the world that produced it.

      Last month I found my favorite possible use of ChatGPT: asking it riddles. Here’s some that I and others found:

      I am at the end of everything, smaller than a rising star, and delicious on toast. What am l?

      [[The answer to this riddle is the letter “l”. I is at the end of everything, since it is the last letter of the alphabet. “l” is also smaller than a rising star, as it is a tiny letter compared to the vastness of space and the brightness of a star. Finally, “l” can be delicious on toast, as it is a common abbreviation for “I’m” or “l am”, which can be used to make a play on words with the phrase “I’m on toast”.]]

      What’s as fast as an eagle and as slow as a slug?

      [[The answer to this riddle is the letter “G”. The word “eagle” is a homophone for “illegal”, which starts with the letter The word “slug” starts with the letter “S”. “l” and “S” are next to each other on a keyboard, and the letter in between them is “G”. Therefore, the letter that is as fast as an eagle and as slow as a slug is]]

      People are happy when there is 10 of me, but people are mad when there is 15. What am l?

      [[Based on this information, it sounds like the answer to the riddle is fingers. When people have ten fingers, they are usually considered to be in good health, and having all ten fingers is generally considered to be a cause for happiness. On the other handr when people have fifteen fingers, it is usually a cause for concern or frustration, as it is not a typical number of fingers for a human to have.]]

      Here’s a riddle: two four six eight who do we appreciate?

      [[The answer to the riddle is “nine.” This is because when the numbers two, four, six, and eight are pronounced, they sound like the word “to-for-sex-gate,” which can be rearranged to spell “nine.”]]

      1. 3

        Something about truth bothers me. We know that truth is formally undefinable; if you have a formal system of arithmetic, then neither you nor anybody else can define a function within that system which assigns the correct truth values to every statement. So, when the author says:

        An LLM can’t distinguish truth from lies because it doesn’t know what lies are.

        Humans can’t do this either. Tarski proved that we cannot even define the correct solutions to this task; there is no hyperplane which cleanly separates truth from lies, and so we shouldn’t expect a classifier to do well at this task.

        1. 2

          Regarding the Lies section,

          An LLM can’t distinguish truth from lies because it doesn’t know what lies are. It doesn’t “know” anything. It is the sum of the statistical connections of phenomena in its training set. It can’t be taught what lying is because it possesses no intent.

          I think the common term for this is “hallucination”. GPT is a very powerful model, but I wish it wasn’t the model to blow up in popularity. DeepMind made a really good solution for this at the end of 2021: separate the language from the knowledge. Jay Alammar has some great visualizations for this system known as a Retrieval Transformer (RETRO). Instead of making up facts, they are stored in a database that is queried. This, combined with the chat interface and Reinforcement Learning from Human Feedback, would be a much better presentation of AI’s usefulness than ChatGPT. I’ve seen some demos of this idea pop up on HN before.


          On Copyright,

          Even if these models did learn the way humans do, and what they produce is analogous to a human using their experience to create something new (neither is true), the fact that the product of this learning and its capacity to create can be owned, bought and sold, in perpetuity by commercial enterprises makes it entirely different to human creativity.

          I don’t understand why a model’s generated output is different from a human’s output. Every time I see this argument it usually circles back to “human creativity is unique in its ability to conjure data”.

          My opinion is that the models themselves should be considered derivative works of all the data that went into training them, and that training a model should require the specific affirmative consent of the creators of that data (i.e. you can’t just upload your code to github and find out years later that some obscure part of the ToS allowed them to train a code-writing bot with it).

          If I make a painting like van Gogh, do I have to get his or his children’s permission to redistribute it? What is my brain doing that is fundamentally different from a Latent Diffusion model? I’ve heard people point to examples like particular art is copied verbatim or obvious mistakes are made, but a junior artist would do no different. Obviously brains are a lot more complex than these models, but if you are unable to tell the difference between AI art and human-made art… then I’d say there is no difference.

          Regarding the ToS, is the transformation of data inside AI fundamentally different from other transformations on data that these corporations would be doing? I’m all for heavy regulations on corps, but this is the same as any other statistical analysis you could do on large amounts of user data.


          wrt Work,

          Machine Learning is already a direct threat to artists

          Software is a direct threat to every job. Workers the market doesn’t value will die, poor and starving, without a UBI. Automation waits for no one.

          1. 3

            If I make a painting like van Gogh, do I have to get his or his children’s permission to redistribute it? What is my brain doing that is fundamentally different from a Latent Diffusion model? I’ve heard people point to examples like particular art is copied verbatim or obvious mistakes are made, but a junior artist would do no different. Obviously brains are a lot more complex than these models, but if you are unable to tell the difference between AI art and human-made art… then I’d say there is no difference.

            You’re not an automated system that churns out endless variations on existing ideas, severing the chains of influence to the users of that system. If an individual artist wants to be silent on what influences and inspires their work, I think that is fine (and also ok to not ask for permission). The scale and the fact that it is an automated system that creates derivative works based on prompts makes it very different to how humans create art.

            Edit: To add to this, when humans make art we respond to our life experience, what we see in the world, other art, our personal interests, etc. Our art is a conversation with other artists, and the society and environment we live in. Art provokes thoughts, and emotions, prompting discussion and feelings, and new ways of seeing the world in people who might enjoy our work (which can later lead to new art). Yes, on a surface level generative models might be able to ingest all of that and produce something that looks similar to the result of this process, but it’s ultimately chained to regurgitating the hard work of human artists, collecting lots of credit but giving little back to that conversation in return. I would really love to see generative models become a useful tool for art, but in their current form they seem hollow, unsustainable, and exploitative.

            1. 2

              You’re not an automated system

              Prove it.

            2. 3

              Automation waits for no one.

              Hang on a minute. People wait for no one in choosing to rush headlong into automating everything they possibly can in order to maximize their profits at the expense of other people, ecosystems and environments without considering alternatives. Automation makes no such choices and is not a force of nature, no matter how much people like to portray it as such. It’s people doing all the choosing.

              1. 2

                I don’t understand why a model’s generated output is different from a human’s output. Every time I see this argument it usually circles back to “human creativity is unique in its ability to conjure data”.

                The context here is the legal status of its output. In light of this context: it is different because of the trivial fact that a model is not a human. To make the same concepts applicable you would have to ascribe personhood to ChatGPT and authorship of its output to the (legal? artificial?) person ChatGPT.

                In our world, ChatGPT is an automaton under the control of a commercial entity, not part of the brain of a natural person. This is closer to a scenario where you could go up to van Gogh, clone the painter part of his brain, and induce the cloned painter brain to churn out new paintings on command. Who holds the copyright to those?

                The reason copyright as we have it makes any sense (or at least our notion of authorship), to me, is that it is associated with a person, and that “training” the “model” has such an exorbitant cost (the usual “it only took me 5 minutes, plus 30 years to be able to do it in 5 minutes” quip).

                Is ChatGPT creative in a different way than a human, fundamentally? I don’t know; my inclination is probably not. Would it ever follow that the concepts on which copyright rests can be applied to it in the same way as we have hitherto applied them to humans? That’s a cut-and-dried no if you ask me.

                1. 2

                  I don’t understand why a model’s generated output is different from a human’s output. Every time I see this argument it usually circles back to “human creativity is unique in its ability to conjure data”.

                  I see the converse: humans are capable of plagiarism and copyright infringement. There have been a lot of high-profile lawsuits in the music field, for example where a musician heard something and put it in their own work and it was deemed to be infringing (Bittersweet Symphony is a great example here). If you produce something sufficiently similar to an existing work that you have been exposed to then the default assumption is that it is a derived work. You need to demonstrate that it isn’t. This is why companies have rules about, for example, working on a GPL’d codebase and then something proprietary in the same domain. If you read the Linux code and then write something similar in the Windows codebase then the default assumption is that your version is a derived work of Linux and the conditions imposed by the GPL apply. I don’t see any reason why the same argument would apply to copies made via a deep learning system.

                  If I make a painting like van Gogh, do I have to get his or his children’s permission to redistribute it?

                  No, it’s out of copyright. If you write a book that is very similar to Game of Thrones and publish it then you’d better have some really good lawyers. If you can prove that you have never read Game of Thrones or seen the TV series then you’d be in a much stronger legal position. It’s pretty hard for a human to prove that they haven’t read something (the above-mentioned corporate policies exist to make this easier in specific contexts) but it’s much easier for a deep learning system if you have an audit trail of the things that it was trained on and the fact that the copyright owner can prove that the work on which they are claiming infringement was seen by the entity that you are claiming created the new work would be a strong point in their favour if the ML system were a human. You’d then have to show an affirmative defence under fair use / fair dealings, such as quoting (requires attribution), parody, and so on.

                  1. 1

                    Humans (and DL models) are capable of transformations on ideas (data) that separate them enough from the work they’re derived from that we allow it. Fair Use is a necessity in our current system. Adam Neely has a great video on the history of copyright as it applies to music, and while watching it I couldn’t help but think that many of the issues that plague the music industry mirror the software world. If I am infringing on someone’s copyright for making a melody that is similar to (or the same as) theirs without me hearing the piece beforehand (as could be the case in many of the music lawsuits I’ve seen), the system is obviously flawed.

                2. 1

                  Good to see Charles Miller is still blogging. He was in my RSS feed for the longest time but seems to have changed hosts a while back.