If a regular person published a confession about their piracy and how they’re storing thousands of pirated works, the copyright goon squad would be knocking down their doors!
I thought that we already went through this a couple decades ago, and the general understanding is that the copyright cartels only go after people who are profiting from piracy because the amount of copyright infringement committed by the “regular person” is large and unprosecutable. Even Nintendo and Disney are known to have given up on preventing private non-commercial copies of their video games and films from being made; they focus on commercial pirates. The main reason is merely cost versus benefit; cases like Capitol v. Thomas-Rasset and Sony BMG v. Tanenbaum were cartel victories, but unprofitable. Following the failure of high-profile fishing expeditions like Arista v. Does (there are several of these from Arista alone!), the RIAA gave up on suing individuals in 2008.
Let me make a slight analogy. Suppose that I wanted to train a LLM on legal corpora. I may legally source documents from RECAP, but not PACER; why not? The general understanding is that this incoherence in copyright law is a fundamental issue, and that copyright law is incommensurate with information theory. Further, the law as written has killed at least one data archivist, in USA v. Swartz, but not harmed any legal professionals. Does this seem like a defensible status quo?
But I don’t want to live in a future which is built by ripping-off people against their will.
Then show support for universal basic income, art funds, independent publishing techniques, open access, copyright reform, prosecution of wage theft, or anything else which undermines copyright cartels. Copyright reform alone could do the job; imagine a world where copyright only lasts seven years and all LLMs are seven years out-of-date, as an example compromise.
This is one case of many similar ones. The SFLC has a case against Microsoft for training on GitHub data without providing accurate attribution of derived works will cite it as precedent. There’s a stock photo company that noticed that Dall-E seems to be trained on their photos and captions, to the degree that it even reproduces their watermarks.
It seems a bit unfair to characterise this as ‘greedy, rent-seeking authors’.
While I’m not a fan of the push to put LLMs in everything as a miracle cure-all, it seems that US copyright law does not provide any “reach-through” rights[1] that would allow injuctions or recovery of damage for the resulting model, even if the creators of the model infringed copyright by doing so, if it was determined that the use of the work to produce the model was transformative. I assume at most you could pursue action against the creators of the model for direct violaitons, but it doesn’t seem likely you could recover much in the way of damages, and as mentioned, it presumably wouldn’t allow you an injunction against the distribution of the LLM.
If I read a book and then summarize it to someone else, is that a violation of copyright?
Similarly, visual artists learn by looking at other artists’ work, studying it, copying it for practice… Does that mean they violate copyright when they produce original art pieces afterwards?
We need a way to define when an algorithm just “learns” from examples, and then clearly state that its operation afterwards does not violate copyright, like Japan did.
If I read a book and then summarize it to someone else, is that a violation of copyright?
That’s a complicated topic. In the US, it is copyright infringement but fair use may be an affirmative defence (which is not quite the same as it not being a violation of copyright in ways that you’d need someone who is a real lawyer to explain). Whether fair use is a valid affirmative defence depends on the amount that is quoted. There’s a lot of case law for making this determination, as well as some rules of thumb (quantities of quotation, context, and so on).
Outside the US (and, because the US legal system is a total mess, in some US states) there is also a notion of moral rights in copyright law. You might see something in the inside cover of a book along the lines of ‘the right of the author to be associated with this work…’ that asserts explicitly that the moral rights are upheld. If you quote without attribution, then you may be violating the author’s moral rights. Very few ML systems can tell whether they’re quoting something in their input data set.
Similarly, visual artists learn by looking at other artists’ work, studying it, copying it for practice… Does that mean they violate copyright when they produce original art pieces afterwards?
Again, there’s a lot of case law here too. It often boils down to whether the new work is sufficiently creative to count as an independent work. For a machine-learning system, the user provides a prompt but most of the output is derived from the inputs. As such, I would expect a judge to rule that it is a derived work of the training set and not an independent copyrightable work (unless the prompt is large enough to qualify as a creative work - again, this is somewhat subjective and there’s a load of fun case law here with a few surprising examples).
We need a way to define when an algorithm just “learns” from examples, and then clearly state that its operation afterwards does not violate copyright, like Japan did.
That looks like an absolutely terrible idea. How many examples does it need? In the degenerate case, I train an ML system on one input, it will reproduce that input. Do I now have a machine for laundering copyrighted works in Japan? Any neural network system can act as a lossy compression engine for storing and reproducing its inputs. If I take a set of copyrighted images and train an ML system to produce any of them on demand with roughly the same loss of quality as an 80% quality JPEG, can I now strip copyright from them?
I thought that we already went through this a couple decades ago, and the general understanding is that the copyright cartels only go after people who are profiting from piracy because the amount of copyright infringement committed by the “regular person” is large and unprosecutable. Even Nintendo and Disney are known to have given up on preventing private non-commercial copies of their video games and films from being made; they focus on commercial pirates. The main reason is merely cost versus benefit; cases like Capitol v. Thomas-Rasset and Sony BMG v. Tanenbaum were cartel victories, but unprofitable. Following the failure of high-profile fishing expeditions like Arista v. Does (there are several of these from Arista alone!), the RIAA gave up on suing individuals in 2008.
Let me make a slight analogy. Suppose that I wanted to train a LLM on legal corpora. I may legally source documents from RECAP, but not PACER; why not? The general understanding is that this incoherence in copyright law is a fundamental issue, and that copyright law is incommensurate with information theory. Further, the law as written has killed at least one data archivist, in USA v. Swartz, but not harmed any legal professionals. Does this seem like a defensible status quo?
Then show support for universal basic income, art funds, independent publishing techniques, open access, copyright reform, prosecution of wage theft, or anything else which undermines copyright cartels. Copyright reform alone could do the job; imagine a world where copyright only lasts seven years and all LLMs are seven years out-of-date, as an example compromise.
It will be darkly humorous to me if the much-ballyhooed AI apocalypse is derailed by a bunch of greedy, rent-seeking authors.
This is one case of many similar ones. The SFLC has a case against Microsoft for training on GitHub data without providing accurate attribution of derived works will cite it as precedent. There’s a stock photo company that noticed that Dall-E seems to be trained on their photos and captions, to the degree that it even reproduces their watermarks.
It seems a bit unfair to characterise this as ‘greedy, rent-seeking authors’.
While I’m not a fan of the push to put LLMs in everything as a miracle cure-all, it seems that US copyright law does not provide any “reach-through” rights[1] that would allow injuctions or recovery of damage for the resulting model, even if the creators of the model infringed copyright by doing so, if it was determined that the use of the work to produce the model was transformative. I assume at most you could pursue action against the creators of the model for direct violaitons, but it doesn’t seem likely you could recover much in the way of damages, and as mentioned, it presumably wouldn’t allow you an injunction against the distribution of the LLM.
Sounds like an NFT enthusiast getting angry at right-clickers.
If I read a book and then summarize it to someone else, is that a violation of copyright?
Similarly, visual artists learn by looking at other artists’ work, studying it, copying it for practice… Does that mean they violate copyright when they produce original art pieces afterwards?
We need a way to define when an algorithm just “learns” from examples, and then clearly state that its operation afterwards does not violate copyright, like Japan did.
That’s a complicated topic. In the US, it is copyright infringement but fair use may be an affirmative defence (which is not quite the same as it not being a violation of copyright in ways that you’d need someone who is a real lawyer to explain). Whether fair use is a valid affirmative defence depends on the amount that is quoted. There’s a lot of case law for making this determination, as well as some rules of thumb (quantities of quotation, context, and so on).
Outside the US (and, because the US legal system is a total mess, in some US states) there is also a notion of moral rights in copyright law. You might see something in the inside cover of a book along the lines of ‘the right of the author to be associated with this work…’ that asserts explicitly that the moral rights are upheld. If you quote without attribution, then you may be violating the author’s moral rights. Very few ML systems can tell whether they’re quoting something in their input data set.
Again, there’s a lot of case law here too. It often boils down to whether the new work is sufficiently creative to count as an independent work. For a machine-learning system, the user provides a prompt but most of the output is derived from the inputs. As such, I would expect a judge to rule that it is a derived work of the training set and not an independent copyrightable work (unless the prompt is large enough to qualify as a creative work - again, this is somewhat subjective and there’s a load of fun case law here with a few surprising examples).
That looks like an absolutely terrible idea. How many examples does it need? In the degenerate case, I train an ML system on one input, it will reproduce that input. Do I now have a machine for laundering copyrighted works in Japan? Any neural network system can act as a lossy compression engine for storing and reproducing its inputs. If I take a set of copyrighted images and train an ML system to produce any of them on demand with roughly the same loss of quality as an 80% quality JPEG, can I now strip copyright from them?
I wonder how Microsoft would respond if someone trained a model on the leaked Windows source from a few years ago?
Terence Eden has a broken moral compass. SAD!