1. 29
  1. 24

    The idea that models looking at copyrighted images is unethical does not just fly against current copyright laws, it is downright dangerous and unsustainable.

    Copyright is not absolute for a reason. We use copyrighted materials in remixed form all the time. Society just wouldn’t be possible without this. Only 30 years after the advent of copyright an exemption was carved out, the fair abridgment doctrine, all the way back in 1740. This led to the fair use doctrine that we have in the US today; a use is ok if it transforms the original into something new. It’s hard to see what can possibly be more transformative than a model that generates new images that have never existed before.

    If a court were to decide that models can’t look at copyrighted material, it would likely be the end of progress in AI for the foreseeable future. The burden of having to document every single image and every single piece of text we use to train models would make large datasets impossible to collect. We just couldn’t do ML/AI research anymore. It would also end indexes, so you couldn’t have Google search anymore.

    But it’s far worse than that. Declaring that models can’t look at copyrighted material would end many nascent AI applications, ones you wouldn’t expect would be impacted. Collecting data while models run and tuning them is critical. Copyrighted images easily sneak into such datasets. For example, the freedom of panorama is an issue in the EU right now. Some countries allow it, some don’t. There are literally images you can take in public that you aren’t allowed to use for commercial purposes; the person who made the artwork owns the copyright to them. How can you possibly create datasets and models of cities under these conditions? If such laws were enforced against models and datasets it would bar many robotics, autonomous driving, drone and other AI applications in the EU.

    If the idea that it’s unethical for models to learn from copyrighted images takes hold it will essentially halt human progress. Just like our predecessors discovered that copyright can’t be absolute in the 1700s, so we have codify the notion that models are a transformative use.

    1. 17

      The core controversy here, which is best shown by lots of these models outright outputting the watermarks, is that the output result is so clearly derived from the input. It feels akin to sampling in music, where people do end up just … getting permission from the owners. If you look at these transformations as “I layered a bunch of images that matched your search queries” the “strict” interpretation of everything clearly lands in this laundering authorship.

      Remixed materials are not fair game as-is, and fair use isn’t even a concept in every copyright regime! The right to panorama is another example where different parts of the world has come to different conclusions.

      I think there is a lot of room for moving around, and there is an interpretation of copyright that would basically only cause issues for generative art, while leaving loads of other ML applications completely fine. I’m not a copyright maximalist, but models that can seemingly spit out the input material clearly seems to be non-theoretical at this point.

      1. 6

        I didn’t think any of the diffusion image models had been caught spitting out exact copies of input material.

        I would expect them to be able to produce watermarks because they’ve been trained on thousands of images that incorporate watermarks - so they have plenty of examples they can use to generate a realistic copy of one.

        My mental model of how they work is that they start from random noise and work backwards - the models don’t incorporate any pixel patterns that could be pasted back into an image, they’re all floating point weights in a giant matrix.

        Stable Diffusion for example managed to compress data from hundred of millions of images into just 9GB of model - there’s nothing left of the original images. They just wouldn’t fit!

        1. 8

          It’s more the fact that you type in “Joe Biden” into txt2img and you clearly get composites of 20 different pictures of him (down to one option I got which was a split between him and his face merged with Trump, clearly a header picture from a news website). You can also of course just type any major Pokemon name and get those.

          The giant matrix is what it is. But the training sessions really are “here’s some description and an image, try to commit that to memory in a way that makes sense”. This is perhaps the same as what a human does! But humans also get in trouble for copying other people’s work too closely.

          I don’t have a policy answer, but I’m not the one trying to build out the magical art machine that happens to know what all the Pokemon look like. I barely have a problem statement, really.

          1. 4

            Where’d you get 9GB from? The most recent model file is only 4.27GB.

            1. 1

              Huh, even smaller than I thought!

        2. 10

          What’s ethical and what’s legal are different things though. I don’t feel qualified to make declarations about legality, but my current intuition is that the legal side of things won’t be a problem for these AI models.

          Ethics is another thing: that’s not a black-or-white answer. That’s why I’m so keen on the analogy to veganism: plenty of people think it is unethical to consume animal products, but it remains legal to do so and plenty of people have made the ethical decision that they are OK with it.

          1. 14

            I think it’s more interesting that communities which have traditionally been copyright-minimalist are suddenly turning copyright-maximalist over “AI” tools. For example, look at the reactions to GitHub’s Copilot.

            1. 12

              I’m fairly certain that those communities have traditionally been copyright-minimalist because that way is easier for individuals to have equal access. But “AI” tools are essentially impossible for individuals to make because of the immense computational costs, and as such are only accessible for large entities. As such, those communities are now looking into utilizing copyright to keep the playing field level between corporations and individuals. And in this case, level doesn’t mean that everyone can train a model, because this way benefits the corporations alone.

              1. 11

                This is one of the things I find so interesting about Stable Diffusion: they are giving away the whole model! They seem to be living the original values hinted at by OpenAI far more closely than OpenAI themselves.

                All of these models support transfer learning too, which means that if you want a custom model you can often take an existing large model and train it with just a few hundred more items (and in a way that is much more affordable in terms of time and hardware costs).

                1. 5

                  I’ve said similar in another comment, but having a model does not help me if I want to try and explore building a model with a different architecture on my own. In similar fields, modifications mostly cost mental labor, but with AI, I am limited to what kinds of modifications I can do without paying the majority of the cost in compute.

                2. 4

                  Free Software/open-source communities have the resources to train an “AI” on a ton of code if they want to. Community projects have trained some pretty impressive models in other fields – see Leela in the chess world for an example.

                  And honestly, if you had a time machine and could go back to frustrated-at-the-jamming-printer Richard Stallman and tell him “in the future, a magic box will exist that launders any code, under even the most restrictive license, into something you can inspect and hack on and redistribute at will, and the only ‘downside’ is everyone else can do the same to code you write”, do you really think he would have had a problem with that?

                  1. 4

                    community != individual. If I wanted to try out a modification of Leela on my own, I would need to spend a hefty amount of money to do that. Community requires a consensus. Trying to go against it with in e.g. programming doesn’t have such high immediate costs as trying to do so with AI.

                    1. 1

                      So, do you have a moral problem with Leela? Or with other community-compute-supported projects like SETI@Home?

                      Because the position I’m taking away from your comments is that you do, or at least that as a logical consequence of your stance you should, have a problem with them. And if so that’s where we have to disagree – there have always been projects that a random individual can’t replicate in full due to lack of infrastructure or other resources. Many of them are Free Software projects. I can’t see any way to have a consistent position that takes the non-replicability as grounds to condemn them.

                      1. 2

                        SETI@Home is very different from Leela. SETI@Home actually does distributed computing. Leela does distributed training data collection.

                        Actual training with Leela still happens on a centralized server, though one that is sponsored by community. And that server has limited resources. If you can only test out 10 changes in a month, chances are, that the community will decide that only the changes from “trusted community members” are worthy of testing. The expensive training process is required to test the changes out, and it is effectively gated behind a social game of getting into the community and “proving your worth”.

                        Meanwhile, for SETI@Home, or other similar distributed computing projects, I can fairly easily make a change, compile the code, and test it out on a small number of problems. It does not require me to go through a social game for my changes to be considered for larger scale testing, because I can give some preliminary results by myself, without spending a ton of compute to produce them.

                        Essentially, my problem with AI community projects is that it becomes a game of you either having the social connections, or compute power to testing out and contributing your changes. Machine learning raises the costs of compiling your software to be ready for testing to such high levels, that compiles become a limited resource. And with limited resources, those who control them, tend to limit the distribution of them to those they think will use them well, even if that seems to be the wrong way.

                    2. 3

                      Are you also telling young Stallman that the magic box is owned by a company that for a while was a huge enemy of opensource, that the code to the magic box is completely proprietary, that you have to pay to use it, and that the magic is derived from the free labor of millions of other hackers?

                  2. 7

                    Over the past few years I’ve learned not to be overly surprised by philosophical incongruities and internal contradictions exhibited by our colleagues. In a decade or two I may even have the wisdom to no longer be annoyed. Start your journey today if you haven’t already.

                    FWIW, I think that the reason for the about face is, as usual, economic. “They’ll never replace engineers with AI!” rings a little hollower every year.

                    1. 3

                      I also suspect this is why the about face is so acute with the advent of art. I think many knowledge workers never thought AI could replace their output/jobs.

                      1. 2

                        Tbh I think the main reason this hasn’t happened is the lack of a corpus of project descriptions and outputs. I suspect the first application will come from mega consulting firms which do have the corpus of inputs and outputs.

                        Who will benefit and lose out is too hard to predict. It’s possible that these advances will increase the productivity of software engineers and therefore increase demand; or do so for some subset at the cost of demand for others.

                        My hunch is that this will benefit people in relatively low cost places by allowing them to really pump out the kind of thing that now is done by two underpaid early career engineers, and similarly to get a jump on eg Shopify themes. It’s also going to drive revenue for the kind of saas offerings that can make use of that combined human + ai configuration. CRM and e-commerce seem like obvious arenas; and fairly static websites, which can also have copy and art generated as well.

                      2. 3

                        I suspect that part of the problem there is that it’s pretty easy to get Copilot to spit out full phrases (including copyright statements and suchlike) from existing source code - because the potential options for sequences of text tokens are so much slimmer.

                        An image generator that starts with Gaussian noise is vanishingly unlikely to come up with anything that looks like it has fragments directly copied in from an existing artwork, even while being able to imitate individual styles to an incredible degree.

                        Also: the nature of open source requires that people engage far more deeply with the specific details of their chosen licenses - and so are more likely to call out anything that seems not to follow the letter or spirit of them.

                        1. 3

                          It’s not really so surprising that people who care about preserving the rights of users aren’t equally interested in preserving the rights of programs.

                          1. 3

                            Please look at the context before making a dismissive comment like this. Specifically, please look at my other reply to someone where I pose the hypothetical of using a time machine to tell past Richard Stallman about the future development of a magic box to launder away restrictive licensing from code, and what he would think about that.

                            And a huge amount of objection to Copilot has been that it can “launder away” the GPL/AGPL. Which I don’t understand: if you have a magic license-laundering box, you no longer need copyleft license enforcement. If someone took your code and ran it through the magic license-laundering box to gain the freedom to incorporate it into non-Free software, then you can run their code through the magic license-laundering box and get back a copy which you have the freedom to inspect/modify/distribute/etc.

                            1. 4

                              I’d appreciate it if you’d elaborate on your second paragraph, because it does not align with my understanding of the reality of who has access and in which ways to that magic box. That “someone” that you mention administers the box and has control over its inputs, but they are in no way obligated to then submit their own code to it as an input. To be transparent, I have some real ethical hang-ups re Copilot, but I don’t think those are relevant to the fact that you’re imputing a power symmetry between GitHub/Microsoft and individual users of that platform that does not exist, as far as I can tell.

                              1. 1

                                If someone truly believes that Copilot is a magic license-laundering box, then the only consistent position is for them to believe they could train their own model on whatever corpus of code they feel like throwing at it and get the same result. As I’ve already pointed out in other replies, open-source communities have trained some impressive models in other fields, so I don’t see why a “community Copilot” keeps being treated as an impossibility.

                                Nor do I see why it’s necessary for every individual to have the resources (computing power, corpus, etc.) to train their own. For example: I don’t have the resources, in multiple ways, to build a competitor to the Linux kernel from scratch, but nobody seems to be demanding that Linus stop using his superior resources to make and release Linux in order to be fair to potential competitors like me. In fact, precisely the opposite: the whole point of Linux as an open-source success story is that many people banding together had more resources and capability than any one of them alone.

                        2. 4

                          Except that not eating animals has positive effects on the environment. So convincing people about this ethical position is a long-term improvement.

                          The consequences of convincing people that models looking at copyrighted material is unethical are simply catastrophic. Not just for AI/ML, but for medicine, the environment, science and society in general, and for the progress of humankind.

                          It’s exactly the opposite of being vegetarian.

                          1. 4

                            Why is the idea that the producers of copyrighted material be compensated so catastrophic? If the economic benefits are so great, surely a small fraction of the profits can be put towards paying the people who made the model possible.

                            1. 2

                              Why is the idea that the producers of copyrighted material be compensated so catastrophic? If the economic benefits are so great, surely a small fraction of the profits can be put towards paying the people who made the model possible.

                              If I as a scientist release a paper publishing a model, who would I pay? And from where does that money come from? Do I pay based on the value of the model when I made it? Zero. Do users of my model pay? Based on what? It’s an endless nightmare. But that’s not even the beginning of the nightmare.

                              We have no registry of copyrights. So it’s impossible to determine who we have to pay. Or what amounts.

                              But it gets so so much worse. Roomba would need to pay if you left a newspaper on the ground and it saw it with its camera and then included that data in its training set. How would Roomba’s software even know that this is copyrighted? What if you have an image of the wall that you don’t have full rights to? Even something as trivial and basic as your Roomba would become such a copyright nightmare as to be impossible. Never mind more advanced things like autonomous cars, etc.

                              The simplest bread and butter ML and ML applications would be impossible under this regime.

                              That’s why I say, without any hyperbole or exaggeration, the line is stark. Either models get to freely look at copyrighted materials and we have ML/AI/progress or they don’t, and we stop ML/AI/progress.

                              1. 5

                                If I as a scientist release a paper publishing a model, who would I pay?

                                The creators of the data that was used to train it.

                                And from where does that money come from?

                                The research funding.

                                We have no registry of copyrights. So it’s impossible to determine who we have to pay. Or what amounts.

                                And yet, youtube somehow manages.

                                That’s why I say, without any hyperbole or exaggeration, the line is stark. Either models get to freely look at copyrighted materials and we have ML/AI/progress or they don’t, and we stop ML/AI/progress.

                                If ML/AI doesn’t produce enough benefit to support paying people for training data, perhaps ML/AI progress isn’t as valuable as claimed.

                                1. 2

                                  The creators of the data that was used to train it.

                                  As I explained, those people are absolutely impossible to identify. Moreover, I gave you examples of how this is doubly-impossible. Not only can you not do it for a fixed dataset, it’s even more impossible to do it with images that your robot collects on the go. Even if a human looks at every example they cannot determine the copyright status of a random image on your wall at home. This just isn’t possible.

                                  What you are proposing is completely equivalent to saying that there should be no ML research anymore.

                                  And from where does that money come from?

                                  The research funding.

                                  There is no money to do this.

                                  Again, this is equivalent to saying there is no more ML research.

                                  We have no registry of copyrights. So it’s impossible to determine who we have to pay. Or what amounts.

                                  And yet, youtube somehow manages.

                                  Because YouTube has people sign up with their data and identify themselves. And even then, YouTube has copyright strikes, etc. Who can possibly manage this in the AI/ML research community? Who can accept the legal risk of mistakes and lawsuits? No one.

                                  There are many other reasons why we cannot use this kind of approach for ML. Even if we had a registry where you donated your data for use with the model, the resulting models would be hopelessly biased towards the kinds of data people like to contribute. It would make them basically useless for any applications.

                                  If ML/AI doesn’t produce enough benefit to support paying people for training data, perhaps ML/AI progress isn’t as valuable as claimed.

                                  This is a very myopic view of how research works. The vast majority of research is useless and provides no value to anyone. That research must happen, but cannot under this regime. Out of that soup emerges some work that provides more value. Which eventually may find applications. These people use the training data and make the progress happen, they don’t make any value. It’s the final end users, people who start companies doing new things that make value. They also cannot pay until far down the road even if they wanted to.

                                  Changing the law so that models cannot look at copyrighted material is literally the end of ML and AI research. There is no way around it. The legal liability is immense and impossible to overcome (it’s not even an issue of money, it’s just not possible). And the costs would be so high on the very people who cannot shoulder them as to make progress end.

                                  1. 2

                                    Because YouTube has people sign up with their data and identify themselves. And even then, YouTube has copyright strikes, etc. Who can possibly manage this in the AI/ML research community? Who can accept the legal risk of mistakes and lawsuits? No one.

                                    Risk is balanced against reward. Why do you think there’s no business model to be had around managing copyright for AI and ML? Is the reward so small that nobody would step up to make money off this?

                                    If AI researchers were required to follow copyright, I’d expect Shutterstocks for training data to spring up like crazy, and sell licenses to training data, taking on the copyright management and payment disbursement, both mitigating risk for researchers and allowing the people who produce training data to get paid for their work.

                                    Because YouTube has people sign up with their data and identify themselves. And even then, YouTube has copyright strikes, etc.

                                    Taking down infringing content is exactly the point of managing copyright.

                                    1. 1

                                      Risk is balanced against reward. Why do you think there’s no business model to be had around managing copyright for AI and ML? Is the reward so small that nobody would step up to make money off this?

                                      Because the people who take the risk under this scenario, scientists doing the research, have no money. And the rewards are basically zero.

                                      Datasets are not static. We need to collect new datasets constantly for many different problem domains. It’s not like you collect 1 dataset and call it quits. There are tens of thousands of datasets out there for all sorts of things, and we need far more than we have today.

                                      If AI researchers were required to follow copyright, I’d expect Shutterstocks for training data to spring up like crazy, and sell licenses to training data, taking on the copyright management and payment disbursement, both mitigating risk for researchers and allowing the people who produce training data to get paid for their work.

                                      This would be the end of AI and ML.

                                      Shutterstock doesn’t know what we need in training data. Not all training data is the same. Datasets aren’t designed by random people. They’re designed by scientists who work very hard for many years to understand what kinds of datasets are valuable for what kinds of problems in which conditions and for which models. This is not something that works on an assembly-line process. And whatever datasets Shutterstock makes will be hopelessly biased by their collection procedure, rendering them basically worthless.

                                      And set that all aside.

                                      Think of the Roomba scenario. Models are not “trained” and “then run in production until the end of time”. Models need to be updated on the fly from new training data gathered while they operate. We could never do that under these conditions.

                                      1. 3

                                        Because the people who take the risk under this scenario, scientists doing the research, have no money.

                                        If nobody gets enough benefit from AI research to pay the people producing the data, perhaps it shouldn’t happen.

                                2. 1

                                  If I as a scientist release a paper publishing a model, who would I pay?

                                  Well, if you’re scientist publishing a paper you’re probably poor and losing money already, so, I think this can easily fall into fair use rules?

                                  1. 1

                                    poor is a bad adjective, I apologize. Not super rich is more accurate.

                                    1. 1

                                      Fair use is totally unrelated to how much you can afford to pay. Even whether your lose money or not only has a limited impact fair use.

                                      1. 1

                                        Well, maybe it should be. I know that in this conversation we tend to talk about copyright as it is, but it’s not written in stone.

                                3. 1

                                  That’s a good argument - that supporting the idea that AI should not be trained on copyright materials is actively harmful.

                              2. 7

                                If the idea that it’s unethical for models to learn from copyrighted images takes hold it will essentially halt human progress.

                                Could you please expound on how human progress will halt if models had to respect copyright? That seems like a fairly hyperbolic statement. For example, the models and datasets needed to train autonomous driving, is there a reason that the companies behind that can’t pay for their own training data or license it?

                                In addition, I see a difference between “you cannot train an AI on copyrighted works when the intent is to generate another image that may appear similar” and “you can train an AI on copyrighted works for [good purpose X, such as learning the paintings in the Lourve to give an audio description when a blind person walks through].”

                                1. 3

                                  If the idea that it’s unethical for models to learn from copyrighted images takes hold it will essentially halt human progress.

                                  Could you please expound on how human progress will halt if models had to respect copyright? That seems like a fairly hyperbolic statement

                                  It’s not hyperbolic. I mean it very literally. And I speak as an ML researcher.

                                  There is no progress in AI and ML if we say that models cannot look at copyrighted materials. And there is no more progress on industrial applications of ML either.

                                  Virtually all current progress has come from computer vision and natural language processing research. Without large datasets this research could never have happened. These are also the most active and fruitful areas now. Every major advance would have been impossible if we had to live with the condition that models cannot look at copyrighted materials. From the earliest of the modern era which developed CNNs, which rely on ImageNet, to the most recent which created Transformers that rely on massive text and sometimes image corpora from the web.

                                  Neither researchers at universities nor corporations could ever afford to pay for such datasets. But it wouldn’t matter if they could.

                                  In the real world, models are not “trained” and then you’re done. There is no external dataset for anything in reality. There are datasets to jump-start you to an application, but then you need to constantly evaluate and fine-tune the model; often a cascade of models that use the processing of earlier models. Autonomous car companies need to ingest their own data and train on it, no amount of external data will ever help.

                                  Moreover, the risk that your model is in violation without you knowing it because it uses billions of images or texts would be immense. It’s hard to see universities bearing that risk.

                                  To make any progress we would have to wind the clock back to the 90s. Throw away all of modern ML. Go back to the era for datasets that have just a few hand-curated examples on them. Very little progress was made in ML at that point.

                                  1. 6

                                    Virtually all current progress has come from computer vision and natural language processing research. Without large datasets this research could never have happened

                                    Some of those datasets explicitly grant rights to this kind of use. For example, a lot of the training data for translation models have been trained by the transcripts of the EU Parliament, which are translated into all member states’ languages by professional translators. The Spanish government also funded the creation of a fairly large dataset for training translation systems. The need for large data sets does not necessarily imply the need to build those large data sets by harvesting the creative output of others without their consent.

                                    Fair use (or fair dealings) has always been a tricky and subjective concept in copyright law and is constantly reevaluated but when a computer system can reproduce something exactly then it generally isn’t covered. There’s then a much more complex question of whether it counts as a derived work.

                                    1. 1

                                      Thanks for the clear explanation. As not an ML researcher, I now better understand the line you draw between potentially copyrighted images and advances in fields that would not be obviously related to images.

                                      Autonomous car companies need to ingest their own data and train on it, no amount of external data will ever help.

                                      Isn’t this a good example of a non-copyrighted dataset or a dataset that the owner would hold the copyright to and be able to use without concern? The car manufacturer is using cameras in the car itself and, I assume, telemetry from the drive to better learn what was going on. You might take a picture of something copyrighted, like an Amazon logo, but this would never have a chance of “reproducing” that copyrighted image. The model output is a car that doesn’t drive into the side of an Amazon delivery van, not a car that turns into an Amazon delivery van.

                                      1. 1

                                        Isn’t this a good example of a non-copyrighted dataset or a dataset that the owner would hold the copyright to and be able to use without concern? The car manufacturer is using cameras in the car itself and, I assume, telemetry from the drive to better learn what was going on. You might take a picture of something copyrighted, like an Amazon logo, but this would never have a chance of “reproducing” that copyrighted image. The model output is a car that doesn’t drive into the side of an Amazon delivery van, not a car that turns into an Amazon delivery van.

                                        A model that you train to take images and decide what a car should do next can easily be turned into a model that generates images. All you need to do is take off the last parts that predict what the cars should do, and use the intermediate results in an image generation scheme; like say with diffusion models. DALL-E 2’s internals for example were never intended for image generation; they had a model called CLIP and wrapped it in a diffusion model.

                                        Really any model (ok, pretty much any model, there are rare edge cases) could easily be used to produce images. Of course, quality will vary, but there is nothing special about image-producing and image-non-producing models.

                                        So even if you own the camera, and you own the car, and you take your own pictures on a public road, if showing a model an image with copyrighted material affects the copyright status of that model, you’ve ruled out autonomous cars, Roombas, etc.

                                        1. 2

                                          OK, so on the way of developing the tech to eliminate the jobs of truck drivers and cabbies, we got tech to eliminate the jobs of illustrators and copyrighters. Got it.

                                          edit brainfart

                                      2. 1

                                        It’s not hyperbolic. I mean it very literally. And I speak as an ML researcher.

                                        There is no progress in AI and ML if we say that models cannot look at copyrighted materials. And there is no more progress on industrial applications of ML either.

                                        It is, IMO, hyperbolic at best to insist that human progress depends on progress in AI and ML. Indeed, their effect on humanity so far has been largely negative¹. Immense amounts of human progress could be made simply by making intelligent and conscious use of existing technologies, rather than leaving it up to the god of the market.

                                    2. 1

                                      I don’t disagree with any of the broad points you made, and even the ones I kinda do, it’s a meh disagreement, not worth the finger stress.

                                      However. I see a pretty clear distinction between generative and non-generative models. Maybe it’s because I don’t know shit about AI, but it seems to me that this is a fairly binary distinction, and that the ethical concerns are mainly with generative models.

                                      Quick aside: I guess you could raise the point that if your model to detect whatever, trained on my copyrighted photo of a whatever, is making you millions, maybe I should have a cut, that is not the context in which moral concerns have been raised, not here, not anywhere else I remember seeing lately.

                                      Back to the point I actually wanted to make: what are the actual, palpable, life saving/improving, use cases of generative models? Cool, I can make stuff for my DnD game, I can prototype games, I can do weird art stuff, yeah, nice, none of that is curing cancer. I value art, but if the full extent of the value this kind of model can bring is artistic, maybe it’s not worth throwing copyright away for the sake of it? And I don’t even like copyright that much.

                                      1. 1

                                        However. I see a pretty clear distinction between generative and non-generative models. Maybe it’s because I don’t know shit about AI, but it seems to me that this is a fairly binary distinction, and that the ethical concerns are mainly with generative models.

                                        There is no such distinction. I can use any model (almost) to generate images from it. Of course quality will vary.

                                        I understand why you feel like this distinction exists, because we often talk like it does, since we tune models for specific tasks. So people will say, X is used for generation and Y for recognition. But.. that’s shorthand. There is nothing special about a generative model these days (there used to be years ago).

                                        1. 2

                                          Hmm, interesting, I stand corrected.

                                    3. 18

                                      There will be many people who will decide that the AI models trained on copyrighted images are incompatible with their values.

                                      Copyright law is incompatible with my values. It is ironic, on the same platform where I heard the first complaints about the ethics of copilot, to now be hearing the exact opposite ethical position. Having an AI trained on public domain/commons/libre work allows big corporations to use all that work without respecting licensing conditions, including claiming ownership of the resulting work. Having an AI trained on copyrighted works allows everyone with access to that AI to use the derivative works without having to worry about license restrictions. What the article is proposing is not veganism it is cannibalism.

                                      When I discuss the ethics of copyright law with people it seems the only real argument in favour of the current legal framework is the conflation of of the means to survive with intellectual property restrictions. ‘People need to eat and pay rent and for some people copyright is the only opportunity they have to earn money which they need to survive’. I wonder why a society with enough food lets people go hungry in the first place. To me it is absurd as saying ‘my cat would starve to death if I didn’t let it hunt endangered birds’.

                                      1. 6

                                        I don’t think that anyone is objecting to ML training of public domain works (at least not that I’ve seen). Most of the objections are for using works where the copyright owner cannot realistically enforce their copyright, both due to this being a case without precedent, and due to them usually being a lot smaller than the organization that trained the model. I don’t see these models being trained on works of large, copyright driven corporations (e.g. Disney), usually they are rather trained on works of random individuals that put their works on the internet to be freely accessible, but not necessarily freely used. If it was really fine to use copyrighted works to train your model, why is nobody taking that opportunity to use copyrighted works of overprotective corporations?

                                        Also, the fact that the works used sometimes are under licenses that commonly are described as permissive muddies the waters, especially since often the trivial requirements of those permissive licenses aren’t complied with. Permissive isn’t public domain, but some people tend to mix them up, whether deliberately or not.

                                        1. 2

                                          I’ve been digging through the Stable Diffusion training data and there is a vast amount of Disney stuff in there - they made no attempt to filter out images sourced from notoriously litigious copyright holders at all.

                                          1. 1

                                            There is a very important distinction between paintings of Disney characters, and actual official Disney art. From my quick search on https://rom1504.github.io/clip-retrieval none of them are from official, simply by reverse-image searching it.

                                            1. 2

                                              I just searched for “thanos” there and got what looked to me like screencaps and posters from movies featuring that character.

                                      2. 11
                                        1. A big corporation crawled the Internet and copied everything without observing licenses
                                        2. Compiled all the material into a big database
                                        3. Lets users type text prompts to query the database
                                        4. Presents an output that is assembled from fragments of the copyrighted works, while claiming ownership of it and imposes Terms of Service on it

                                        Am I describing an AI image generator or a search engine?

                                        1. 9

                                          As a vegan, the “I can’t resist imagining what I could do with that capability” line is especially amusing since it indicates a willingness to pursue personal entertainment even if it means doing something you perceive as harmful, just like eating meat haha

                                          The potential harm caused by these kinds of AI is definitely worth research and discussion, but if the discussion settles on “yes it’s harmful”, then– just like being a regular vegan– being an AI vegan ought to be the logical answer. No need for hand-wringing over waiting for a “good enough” alternative when we can just not use it.

                                          1. 1

                                            I don’t expect that this will come down to anything as simple as “yes, it’s harmful” though.

                                            The positive impacts of this technology are only just starting to be explored. What does human society look like if everyone gets access to their own concept artist? What if everyone can now afford to commission custom art?

                                            Ethics is easy if the answer is “this is purely bad” - it’s a lot harder when there are positive and negative effects to be compared to each other.

                                            1. 1

                                              Adding “Richard Stallman” to the beginning of prompts can get you some very interesting and cursed results. For example: ukiyo-e Richard Stallman drinking Starbucks, Richard Stallman acid trip in a forest, or even impressionist painting of Richard Stallman at Starbucks.

                                            2. 7

                                              Every artist on this planet is influenced by artwork created by other artists. Some may straight up copy work done by others, but artists with integrity don’t do that. Instead, they might find specific color combinations, lines, scenery, details and so on very interesting. So what do they do? They copy these details into their own work and create something new. This is progress, it’s literally evolution. What’s bad is companies and individuals stealing work and publishing under their own name. But that is not was is going on here. The ML algorithm is creating something new, being influenced by previous work, just as we do.

                                              If people created a law that prevents researchers from building ML models based on previous work without consent, it would be a devastating blow to progress in AI and put us back decades. Generative art could help us imagine what things we hadn’t thought of, in all types of fields.

                                              1. 5

                                                I find this to be a very odd use of the word “vegan” and comparison to vegans. But since that’s where we are, here are some things you may want to check out if you’re concerned with the welfare of animals:

                                                https://www.dominionmovement.com/

                                                https://www.cowspiracy.com/

                                                https://veganoutreach.org/

                                                https://mercyforanimals.org/

                                                https://farmusa.org/

                                                https://www.reddit.com/r/vegan/

                                                1. 3

                                                  What do you find odd about it?

                                                  So far I’ve had a few people say that the analogy works really well for them. I’m very interested to hear reasons it doesn’t hold up.

                                                  1. 2

                                                    It’s taking a specific form of a more general idea and using it as the general form. That is, you are talking about veganism insofar as it’s a boycott of/abstention from animal products (though that is not all it is) and applying that to the idea of AI art, rather than directly talking about boycotting/abstaining from AI art.

                                                    In other words, “vegan” has a specific definition and you can’t just throw any adjective in front of it.

                                                    1. 3

                                                      I tried to clarify in my writing that this was a mental model analogy that I’ve been thinking about

                                                      Are you responding just to my headline here, or do you think the analogy as described doesn’t work?

                                                      The key reason I like veganism as an analogy is that I personally continue to eat meat and feel guilty about it - which looks like it may be the way I personally end up using these AI models.

                                                      To put it another way: I am someone who cares deeply about animal welfare but continues to eat meat. I am also someone who respects author’s moral rights to their work while continuing to use AI models they have been trained against that work without their permission.

                                                      Furthermore: the fact that veganism is a deep area with many different aspects makes me like it even more as an analogy for objections to generative AI models - because those too have many factors and I expect that opponents to them will come from a wide array of perspectives covering a wide array of reasons.

                                                      1. 4

                                                        Mostly the term “AI vegan”, since that is after all in the heading and stands out more than what it represents: your own personal ethical conflicts. And as someone who sees animal agricultural as pretty fucking evil, I also find the comparison to be trivializing.

                                                        You could alternatively frame it in the opposite way, and I think it would be a more direct question and would rely less on your own personal position on some other ethical topic: “Ethics: will you be an AI hypocrite?”

                                                        1. 1

                                                          This itself is an interesting ethical conundrum!

                                                          Comparing these two headlines:

                                                          • Ethics: will you be an AI hypocrite?
                                                          • Ethics: will you be an AI vegan?

                                                          The first avoids the risk of offending vegans. But I find it a lot less compelling as a heading - I don’t think it would stick in peoples minds, and lead to extensive conversations in the same way as “vegan”.

                                                          So I’m making an ethical choice here that I think the harm caused by using “vegan” is outweighed by the benefit in terms of starting conversations and expressing my position.

                                                          Basically the same ethical question as a clickbait headline attached to a valuable piece of journalism.

                                                          1. 4

                                                            I am also vegan and I don’t necessarily disagree with your use of the term. It’s a reasonable analogy, and illustrates your point well enough.

                                                            But it does irk me when you consider the magnitude of each decision. At the end of the day, this AI technology is not powered my the mass imprisonment, torture, and slaughter of artists who post their work online.

                                                            1. 2

                                                              Take a look at how music or other popular media is produced at scale. If the USA’s techniques are not close enough to imprisonment and torture, then consider South Korea’s techniques instead. It is not unreasonable to hope that ML-driven art generation can prevent artists from starving merely by changing the economic landscape of art production.

                                                              1. 2

                                                                It’s interesting how it can be seen as offensive from both directions:

                                                                • Artists: you’re comparing us to animals raised for meat now?
                                                                • Animals: you’re saying killing and eating us is equivalent to stealing someone’s picture?
                                                                1. 2

                                                                  I think folks are comparing the scale of harm, not animals to artists, as per my comment here: https://lobste.rs/s/kxqvam/image_generation_ethics_will_you_be_ai#c_4fk0qx

                                                          2. 3

                                                            I also feel like something is a bit off with the analogy to veganism, and am trying to put my finger on it.

                                                            It seems like one central commonality is the idea of consent, but the scale of consent feels really different. One end is artists not consenting for their works to be included in model training. This is bad as it may result in a loss of their livelihoods, losing competitive advantages, not reaping the full value of their work, etc. On the other end (veganism), we’re talking about billions of sentient land animals born, tortured, and slaughtered per year for our pleasure without their consent. This is really really bad, worse than exploiting artists, but maybe that’s just me making a value judgement, or maybe the difference in scale of consent doesn’t actually matter for the analogy.

                                                            Another central commonality is the idea of convenience and functionality for the end user. There are a lot of potential other analogies here, many more comparable to the goings on of the digital world: shopping on amazon, duckduckgo over google, email over facebook events/messenger, etc.

                                                            And then there’s all the negative connotations people have with the word veganism, mostly undeserved - preachy, holier-than-thou, social burden, etc. These may distract from the article’s point, although I’m happy to see the comments here discussing the model training issue at hand and not just veganism!

                                                            Either way, it’s an interesting discussion, and if you want to dip your toe into reducing your consumption of animal products while you improve your digital impact, godspeed :)

                                                            1. 1

                                                              I do worry a bit that artists could take justified offense at being compared to animals that are killed for their meat!

                                                              For me the value in the analogy isn’t copyright-abuse compared to animal-abuse, it’s eating-meat-anyway compared to using-generative-AI-anyway - it’s an analogy for people making their own personal ethical decisions.

                                                              1. 1

                                                                To be totally clear, I’m not comparing artists to the animals, but pointing out the difference in my perceived scale between the harm done to artists by using their work without consent to the harm done to animals by using them without consent.

                                                                it’s an analogy for people making their own personal ethical decisions

                                                                That makes sense! I think it certainly works well enough to get the point across.

                                                    2. 3

                                                      I’m for a flag or a special license that would allow me to exclude anything I create from inclusion into ML training sets.

                                                      I believe the proliferation of content creation via ML is a net negative for society. Even if I don’t support a ban, I do not wish to personally contribute to it.

                                                      1. 3

                                                        Sounds like a good case for a specific rider to any license you apply to works to specifically forbid such use.

                                                        1. 4

                                                          True. At least my photography is already “all rights reserved”. But the pro-ML crowd seems to believe that’s fine to snarf because it’s just an infinitesimal input into something whose output is automatically covered by fair use anyway.

                                                          Ultimately it’s not a legal issue, it’s a moral one.

                                                          1. 2

                                                            Well, I think it’s both.

                                                      2. 2

                                                        Yeah I’ve been thinking about this a bit. Machine learning could be such a great thing for art, illustration, and other creative industries, but the mass-scale appropriation without attribution that the current tools employ, along with them only being accessible via for-profit companies makes it seem very exploitative. Fair use is a fundamental part of the creative process, but I do think that the context is different and that tech companies should shoulder a greater responsibility when it comes to being open about attribution vs. human creatives.

                                                        As an artist or any other creative when you are inspired other creative works, you at least have a general idea of where you are drawing inspiration from, even if it is unconscious. If people ask your where your ideas come from you can tell them. Even if you wish to keep those details to yourself (which is entirely valid), others can glean that through other means, and the scale is usually still limited.

                                                        The interaction when generating images via a machine learning model is much different. It’s great that these models give people the ability to try out creative ideas quickly, without needing to spend so much time developing a sense of style and taste or artistic skill, but that’s only part of the artistic process. You might have a cool picture, but you have no idea where to find more works like it, and what the motivations the original artists had… that’s such an important part of art and creativity that I wish we could provide more access to. It would be great if these tools would also provide references to their source material to help address this. I can think of a bunch of reasons why they don’t, but I still think they should.

                                                        1. 3

                                                          I guess.

                                                          But the feeling among the original creators who provide raw material for the generator mills is that this is a bad deal.

                                                          See for example Simon Stålenhag - https://twitter.com/simonstalenhag/status/1559796122083811328?s=20

                                                          Anyway, I think ai-art, just like nfts, is a technology that just amplify all the shit I hate with being an artist in this feudal capitalist dystopia where every promising new tool always end up in the hands of the least imaginative and most exploitative and unscrupulous people.

                                                          1. 3

                                                            Yeah, my response is actually quite watered down from what I had originally wrote! I was trying to be more positive and find possibilities for compromise, but I’m definitely concerned about capitalists trashing the commons, and do worry about the artists and creative who will be caught up in this. I definitely know of many who find it a violation (a bit like programmers with Copilot), and I can definitely understand where they are coming from.

                                                            It wouldn’t be the first time that machine learning people have:

                                                            • barged into a domain the don’t understand
                                                            • achieved surprising initial success replicating some hollow facsimile based on generations of work
                                                            • convinced others that the old ways are no longer needed
                                                            • get enormous amounts of funding
                                                            • put the current practitioners out of work
                                                            • only to later find that those practitioners had a huge amount to offer when the low hanging fruit has been exhausted

                                                            Meanwhile the original people have retired or found other work, and the next generation hasn’t been taught, and funding has been slashed, an there needs to be a bunch of work put into to counteracting the misunderstandings people might have about the limitations of machine learning.

                                                            1. 2

                                                              I’d hardly say this is universal. My partner is an artist and has peers who are artists and they feel differently. It’s dangerous to extrapolate general opinion from anecdata, let alone Twitter.

                                                          2. 1

                                                            I think in the medium term, this will result in more but different jobs for artists. When I worked at the Atlantic circa 2014, we were still in the transition to all posts having an image associated with them. When we coded up a river of stories, we had to account for the fact that many stories, particularly older ones, had no image. Nowadays, no major site publishes a story without some sort of image associated with it. OTOH, most lone coder blogs that we read on Lobsters don’t have art yet. AI is going to make it practical to have more and fancier images in stories for people up and down the economic ladder…

                                                            But the images straight out of the AI today just aren’t good enough to use professionally. Yes, they are very impressive, but basically every story about AI art includes images with weird eyes and weird finders. Maybe that will fix itself once the neural nets get larger, but I sort of think it won’t just because even demos of bigger nets still have that problem. Human artists avoid drawing feet because feet are hard to draw! It seems like eyes are the feet of AIs. Anyway, I think there will be jobs for people to a) think up and tweak good prompts b) do a finishing pass on AI art to make it not weird. But obviously, if your current skillset is putting paint on a canvas or otherwise analog, those new jobs won’t help you much, so there will be a lot of churn and people hurt.

                                                            1. 2

                                                              One of the things I’m finding most interesting about https://www.reddit.com/r/StableDiffusion/ is the posts (and sometimes video demos) from artists who are embracing this new tooling and using it to accelerate their productivity and create new things.

                                                              A couple of examples:

                                                              1. 2

                                                                That reminds me of an old post by Clive Thompson that said that the best chess play was when humans would work together with an AI. An AI could beat any human player, but not a human player who had access to an AI. I don’t know if that’s still true, but it’s an encouraging thought.

                                                            2. 1

                                                              It’s clearly a big change that is going to create a lot of new ethical and legal concerns. But is it a bigger change than when Photoshop became widely available? Than when the internet made it easy to share copyrighted material? Than when photography was invented? Society will adapt and norms will adjust.