1. 30
    1. 48

      Scraping copyrighted content to then spend the electricity budget of Panama to regurgitate it back to you via LLMs: legal

      Carefully curating content such that only one person can borrow it at a time, as libraries have been doing for centuries: illegal

      If a techbro can’t make money off the commons, it’s illegal, is what I’m learning.

      1. 3

        Intentionally distributing full verbatim copies of copyrighted material when you do not have permission from the copyright owner, illegal.

        Transforming vast amounts of copyrighted material into an entirely different new form of technology, (probably) legal. There is a reason that “transformativeness” is one of the fair use tests in US copyright law.

          1. 2

            It demonstrates that you can prompt the LLM to reproduce (some) texts. Whether that is copyright infringement is debatable. I can ask someone to recite The Hollow Men from memory, but that doesn’t make their memory a violation of copyright.

            The question of transformativeness is about whether the LLM itself is sufficiently different from the copyrighted texts in its inputs to be considered a work of fundamentally distinct purpose and value. That’s not the only question at issue, but it’s an important one that clearly comes out in favor of the LLM creators. Google won a major lawsuit on this issue regarding Google Books even though it was a much less transformative use than LLMs are.

            1. 8

              I can ask someone to recite The Hollow Men from memory, but that doesn’t make their memory a violation of copyright.

              Their memory isn’t a violation of copyright, but their reciting The Hollow Men may be (seriously though, 1925, isn’t that thing public domain by now!?)

              I imagine this varies with jurisdiction but where I’m from requesting permission from the publisher or the author is a standard requirement in a case like this. You’re free to learn poetry or music by heart but if you’re going to reproduce it verbatim in public, for reasons other than satire, commentary or academic study, you’ve got to ask for permission, no matter how transformative your brain is :-).

              1. 1

                Yep. The fact that LLMs can reproduce copyrighted text is the strongest argument against them. But the question is who is responsible for the reproduction, the LLM creator or the LLM user asking for a copy of The Hollow Men. :)

                1. 2

                  Context: people go to jail if they allow others to use their servers and others use it for sharing warez. Even if the server owner can not see what has been shared, because data are e2e encrypted. Or teenagers went to jail because police found CDs with copies of MS Windows, Photoshop, AutoCAD or some films and music etc. Judge just multiplied the list price by the number of copies found and declared that the amount of damages.

                  I do not see much reason why (not much) transformed (rather anonymized) LLM should be more legal than a verbatim copy. LLMs are even worse, because they cannibalize the content and give no credit to authors. While verbatim copies, even warez, retain the name of the original author and the publisher. People often buy paper books, optical discs etc. of works that they first find in warez and liked them. But if you get some response generated from LLM, you even do not know, what original work you should buy or which author to be grateful to. This is purely parasitic business model.

                  1. 1

                    That is probably the kind of thing that will have to be settled by a new ruling, and examining existing precedents by analogy is really treacherous. E.g. even if they superficially look like they’re the same thing, reciting poetry in general and reciting poetry from a dramatic work are treated very differently by copyright law in many Western countries. Similarly, an artist reciting poetry and distributing a recording of them reciting poetry are treated differently.

                    With the obvious jurisdiction caveat, where I’m from, assuming that storing copyrighted material in the form of LLM weights would qualify as a breach of copyright law, the debate would revolve around three major issues:

                    1. Who (as in which person with agency) included copyrighted material in their work
                    2. Who is distributing the work that includes copyrighted material
                    3. Did that person claim to own the right to distribute that work, even though it didn’t, and did they make enough information available to the user who asked for a copy so that they can verify if it’s being legally distributed.

                    If we’re talking about the simple case (user asks the LLM to write him a poem, LLM replies with The Hollow Men), that would mostly point at the LLM creator.

                    1. 1

                      You may be right about the jurisdiction you live in. But on balance, US precedent points the other way.

                      1. 1

                        How so? I was under the impression that US legislation also put the onus for copyright compliance on the party that’s distributing the copyrighted material.

                        1. 1

                          A general purpose LLM doesn’t behave like an archive of copyrighted material. It behaves like a general purpose language manipulation machine.

                          It would thus be analyzed as a new type of work with the possibility of reproducing copyrighted material, not as itself a reproduction of copyrighted material.

                          1. 1

                            Right, but that’s why I qualified it with this:

                            assuming that storing copyrighted material in the form of LLM weights would qualify as a breach of copyright law

                            “Behaves like a general purpose language manipulation machine” is one way to look at it. Another is “behaves like an archive of copyrighted works scrambled with a very long key”.

                            I’m not describing the latter point specifically because I agree with it, but because it’s the only one where direct precedents exist. The former is definitely a whole other story, indeed.

            2. 7

              If I compress a film it’s still the film. If a model can reproduce ingested training data to… I dunno, whatever degree makes my argument work, then is the thing not in there?

              1. 1

                Whether “the thing is in there” doesn’t answer the legal questions at stake in this case. See my comment here for more context.

              2. 4

                How does training an AI model differ from e.g. compilation of a source code to an executable binary or converting one image format to another?

                1. 1

                  Taking a bunch of novels and turning them into an LLM produces something that is not a novel. Add HRLF on top and you’ve got something even less like a novel.

                  The novel is not a specification for program behavior. It is used as input into a statistical model whose useful properties come from the fact that it is not a novel.

                  1. 4

                    It is a derived work that would not exist without the original one. Whether it „is a novel“ or not, is not much important, however these „models“ actually sometimes vomit parts of the original works. If you do this with Windows source code (it is available on the internet) and copy a method or a block of code from their sources to your program, corporations like Microsoft will send an army of lawyers after you and sue you to death.

                    AI is not a human being, it is property of someone, it serves its owner, it is a tool like compiler, data convertor or any other software. Analogies like „I learned something from a book and then use that knowledge, so AI can also learn something from a book and use it“ does not apply.

                    1. 1

                      Whether it „is a novel“ or not, is not much important,

                      Not important to whom? To the US legal system, whether the LLM is the same type of thing as its inputs is an important part of the legal analysis.

                      however these „models“ actually sometimes vomit parts of the original works.

                      And that is the strongest case against the LLM companies. But generally speaking, a user would have to knowingly prompt the LLM in a certain way to get more than snippets of copyrighted material. So there is a decent argument to be made that any copyright infringement is by the user requesting a copy of Harry Potter, not by the company that created the LLM.

                      Analogies like „I learned something from a book and then use that knowledge, so AI can also learn something from a book and use it“ does not apply.

                      Analogy is one of the primary ways that existing law gets applied to novel situations. Analogy may not matter much in software, but it matters a great deal in law.

                      To be clear, I’m not arguing about whether making LLMs from copyrighted material without author permission should be legal. I’m just pointing out that under current US legal precedent it probably is.

                      The conspiratorial tone of the top level comment in this thread really rubs me the wrong way, because the likely legality could already be correctly evaluated without regards to who the actors are in each case.

            3. 13

              I have what the DRM-inclined would call a radical view on copyright, so this news is personally rather depressing, but ultimately not at all unexpected.

              1. 8

                Paywall protected. Can’t read.

                1. 8

                  Heckin’ ironic, isn’t it?

                  1. 4

                    Apologies for the paywall, I have an extension which bypassed it automatically. Thanks for posting an unpaywalled version @adamshaylor.

                      1. 5

                        Relevant content is typically taken down if it’s paywalled e.g. by the NY Times. The existence of a workaround, while helpful in a practical sense, doesn’t seem like a good test for whether an article is topical.

                        1. 2

                          what?

                          1. 3

                            Like court rulings IRL, I look to the precedence set by moderator actions here on lobste.rs as a kind of case law that augments our guidelines and helps me determine whether articles I would like to share are topical. At the time I wrote the above comment, I thought Wired had a blanket paywall. (It turns out it’s an article limit, which, although a kind of “soft” paywall, seems to put it in a bit more of a gray area.) Given that otherwise relevant content from the New York Times is consistently removed because of its paywall, I thought it would make sense to apply that precedent consistently, without preference to Wired or any other publication, in spite of the existence of helpful workarounds. Now I’m less sure.

                            1. 1

                              obviously paywalls have nothing to do with whether the content is topical or relevant. I think the reason for disallowing NYT content is that it’s inaccessible, so if some other content is more accessible it makes sense that it would be treated differently, no?

                      2. 2

                        Agreed, not the ideal source. I posted another version of the same story here. Perhaps the two posts can be merged with the non-paywalled Verge story favored over the paywalled Wired one.

                      3. 7

                        (IANAL)

                        If you’re a proponent of Controlled Digital Lending (https://controlleddigitallending.org) by libraries, this previous court ruling in the case is worth reading:

                        Memorandum & Opinion – #188 in Hachette Book Group, Inc. v. Internet Archive (S.D.N.Y., 1:20-cv-04160) –

                        https://www.courtlistener.com/docket/17211300/188/hachette-book-group-inc-v-internet-archive/

                        I think the Internet Archive didn’t help the cause when they went beyond a strict 1:1 correspondence. Perhaps they were trying to provoke the suit tho

                        According to other laws cited in the opinion it doesn’t sound like even a 1:1 ratio would have mattered. But it does appear that going way beyond that ratio is what prompted the suit back in 2020. Again, perhaps intended.

                        I’d like to see CDL become the law of the land but the opinion cites a lot of fair use case law that makes it seem like CDL via lawsuits isn’t gonna get it done. I think Congress would have to pass legislation, which seems unlikely

                        See also https://controlleddigitallending.org/ which states

                        “Through CDL, libraries use technical controls to ensure a consistent “owned-to-loaned” ratio, meaning the library circulates the exact number of copies of a specific title it owns, regardless of format, putting controls in place to prevent users from redistributing or copying the digitized version”

                        The internet Archive did not appear to have such controls in place with partner libraries, nor did they maintain the 1:1 ratio for a while in 2020, so it’s hard to see how what they were doing was really “controlled digital lending” according to the definition

                        I’m not surprised to see they lost the appeal, and I honestly don’t know why they thought this was a good idea strategically

                        1. 5

                          There’s links to a great online library here: https://en.wikipedia.org/wiki/Library_Genesis

                          1. 5

                            This is arguably a duplicate of another post, but it’s behind a paywall. Verge articles get flagged as off-topic more than most domains, but this seems like a decent summary of both the context and the judgement as well as an embed of the judgement itself.

                            1. 2

                              Happy for my post to be merged into this one - turns out the downside of having your browser automatically bypass paywalls is not realising that stuff you post is paywalled facepalm

                              1. 1

                                Yarp. I’ve started mainly sharing archive.today links specifically because of that. One of my relatives asked me if I was trying to be a jerk on our group text thread because she kept getting caught in paywalls I didn’t know were there for a while…

                              2. 1

                                Sorry, which is behind a paywall? I can’t repro one for Verge or for Wired in a stock browser.

                                1. 3

                                  I don’t know about The Verge, but Wired has a limited number of free articles.

                                  1. 2

                                    That’s it. I assumed Wired had a blanket firewall, but indeed it’s actually an article limit.

                                    @pushcx, what do you recommend as the most topical source for stories like this? Is the official court ruling preferable, or is a non-paywalled mainstream news source suitable?

                                    1. 5

                                      Ah, thanks to you both. After some tinkering I was able to repro it. It seems to be based on the number of stories you view but doesn’t count or activate on the affiliate advertising “reviews” that make up a substantial part of Wired’s homepage. Having grown up with the magazine I feel pretty bad about it, but I’ve banned the domain because paywalled articles prompt really poor discussion as people riff on the headline.

                                      For legal topics a court ruling is good, but our discussions are kind of prone to playing lawyer in an embarrassing way. We don’t seem to remember that legal writing has its own technical terms and interpret their language poorly, or make naive assumptions of legal construction and procedure. (Spending a couple years on a product with significant legal considerations made this into a small pet peeve for me.) Ideally we’d link to a lawyer giving an legal analysis intended for a lay audience, those tend to start much better-informed and less-inflamed discussions. A legal analysis intended for other lawyers also tends to be obvious enough about jargon that we approach the topic with a significantly less overconfidence. And continuing down the ranking, a primary source still produces better discussions than the usually-clickbaity “judge SMASHES the copyright cartel”-style news sources that imply every minor procedural filing is dispositive or precedent-setting.

                                      1. 3

                                        Often, when I want to discuss a thing Wired or NYT would cover with my nontechnical friends, I’ll share an archive link like this:

                                        https://archive.is/9irhZ

                                        because the soft paywall makes it so hard to be sure everyone can read otherwise. In the event a similarly soft-paywalled article is on-topic here, is that kind of link OK to share here?

                                        It probably almost never matters for me here, as I’m nearly always sharing and discussing things from sources that just aren’t subjected to this kind of shitty soft paywall treatment. But Wired frequently used to be and is still occasionally worthwhile (I’d say I ‘grew up with’ it too, in some sense), and I do often use archive.today as a courtesy to people with whom I share wired links.

                                        Is it OK to use those links here, without making you feel like I’m evading a ban?

                                        1. 4

                                          I do the same thing. I don’t want to say it’s OK for three reasons:

                                          1. There’s usually a primary source or a more-focused writeup intended for a technical audience available.
                                          2. As you note, it blurs the line on ban evasion in a way that’d send exactly the wrong message to anyone who sees a story posted that way. There’s currently a lot of value in the new user restrictions and domain banning: having a couple unambiguous lines that are not terribly hard to circumvent means that when people do so it’s strong evidence of bad intent that would otherwise have to get addressed over months of bad behavior, excuses, and judgement calls before a clear picture formed.
                                          3. These outlets made considered choices to have paywalls. While I doubt they mind too badly that individuals privately share stories using these links, putting one in front of this site’s ~250k daily visitors is a whole different scale and use. I’ve respected authors’ wishes to not have their links appear on the site and it seems like a very similar respect to not end run these sites’ paywalls.
                                          1. 1

                                            Thanks for the clarity. To be sure, (1) is normally the thing that stops me.

                                            These outlets made considered choices to have paywalls. While I doubt they mind too badly that individuals privately share stories using these links, putting one in front of this site’s ~250k daily visitors is a whole different scale and use. I’ve respected authors’ wishes to not have their links appear on the site and it seems like a very similar respect to not end run these sites’ paywalls.

                                            That’s a good point. Most of the sites with these oddly inconsistent soft paywalls have ways to share with an audience like this, because it benefits them, and I tend to choose the archive route just because I’m too lazy to figure out where they moved the free share link today. But I should probably just ignore them or not be lazy about finding their preferred sharing method instead of finding my own when it comes to things like lobste.rs or fedi posts.

                                            1. 2

                                              Some hassles with ‘free share links’ are: they almost always have caps on usage that our traffic would immediately hit, they expire after a day or two (very unpleasant for our non-daily readers), and often they’re used for marketing attribution, which I try to break as part of a long-term strategy to reduce content marketing by making it hard to include our traffic in individual and team metrics (for example, medium).

                              🇬🇧 The UK geoblock is lifted, hopefully permanently.