1. 92
    1. 17

      Court listener docket for those who want to follow along.

      I’m not a lawyer, but personally I’m skeptical, I don’t think anything that github did is likely to have required a license in the first place. To the extent that copilot produces verbatim copies, it seems to do so only of tiny samples of code that have been replicated numerous times by humans before. I expect the court will find that to be fair use/de minimis copying and not actionable. Without the initial copyright infringement occurring, I don’t think many of the other claims survive this, they either require it to be copyright infringement as a precursor (e.g. the DMCA), or they require it to be unlawful.

      I’m less sure what to think about the personal information claims.

      Regardless, I’m pretty happy that this suit is happening. Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone, and both sides here should have deep enough pockets to do a good job at arguing their side, so the decisions come out saying what they should say.

      1. 4

        Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone

        It’s pretty clear already today; this litigation is rather a publicity stunt; the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law, unless the result is sufficiently similar to something human generated, which itself is sufficiently creative; a few code snippets will hardly suffice for this (and even if they do it is very likely fair use according to current jurisprudence); but this must be judged on a case-by-case basis, not in a class action suit. I also can’t understand the outrage of many developers; on the one hand people seem to take it for granted that others provide them code or services for free on a grand scale (e.g. Github hosting and additional features heavily used by the open source community); but at the slightest suspicion that they should give something away, all hell breaks loose.

        1. 13

          the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law,

          I don’t believe that this is settled precedent yet. In particular, it is clear that a neural network can memorise some of the input date. The fact that it’s a neural network doesn’t really matter - if it were a rule-based system with a database of code snippets that it could combine to produce output then it would be equally legal or illegal and that’s up to a court to decide.

          unless the result is sufficiently similar to something human generated, which itself is sufficiently creative

          That’s the crux of the matter. It is established that Copilot can generate exact copies of code. It is not yet established whether these snippets are sufficiently creative to merit copyright. This is a very tricky issue because it does not depend exactly on length. A two-line snippet might be copyrightable if it is doing something sufficiently original and a court agrees that it is a creative work. In that case, you may still be allowed to quote it but then you may have attribution requirements, depending on the jurisdiction. It is more likely that a long fragment is both considered a creative work and not covered by fair use, but some long things can be considered non-copyrightable (e.g. if they are mechanical implementations of a published specification).

          1. 1

            Well, we will see what comes out (likely not much).

            it is clear that a neural network can memorise some of the input date

            That’s not correct; the DNN doesn’t just make a copy or memorizes a copy; it might be able to reproduce parts or the training set, though this is not the actual purpose, but a rather coincidental an unwanted side effect (which occurs less than 1% according to Github officials as far as I remember). Also note that it is not comparable to a simple database, not even a compressed or encrypted one, since there is no technical means to restore the original works used to train a DNN; it’s rather like a hash sum; the abstraction and transformation done by the DNN training algorithm is substantial; the original works are unrecognizable and unrecoverable; the DNN is thus no derivative work; any other outcome of the trial would be a big surprise.

            then it would be equally legal or illegal and that’s up to a court to decide

            Storing copyrighted work is generally no violation of copyright law (in some countries it might be illegal if the copyrighted works were not legally acquired). This is established legal practice; we don’t have to wait for a decision.

            This is a very tricky issue because it does not depend exactly on length

            Not that tricky; there is a well established legal practice in this regard with various precedents; if the DNN would repeatably produce code sufficiently equal to existing code, the matter would have to be clarified in the individual case anyway, whereby the burden of proof of authorship as well as similarity and copyright infringement would lie with the individual plaintiff; and the defendant in this case would not be Github, but the developers using the code in question.

            1. 12

              It could be like lossy compression. If you make a shitty JPEG copy of a copyrighted artwork, the new bytes look nothing like the original, and you can’t even restore the original, but it may still be an infringement when it’s a close-enough copy.

              You could also look at this from a higher level:

              code goes in -> black box -> same code comes out
              

              The complex implementation details of the black box may be irrelevant legally. If you put a copyrighted work in a paper shredder and then glue the shreds back together in the original order, even by chance, the court may not care how you did it, only that you have ended up making a copy of the original.

              1. 2

                code goes in -> black box -> same code comes out

                That’s essentially the concept of all electronic media today. If you take a picture of Mona Lisa, the camera, the memory card and the JPEG format in use are a blackbox for the majority of users; even though they are able to view or even publish the picture displaying Mona Lisa with little effort.

                This also nicely demonstrates the present situation. Neither the manufacturer of the camera, nor the inventor of the JPEG format, nor the photographer making and keeping the picture is liable of copyright infringement. But if the photographer wants to publish the picture, a permission of the copyright holder may be necessary; this depends on what you can see on the picture, not on how the picture was taken or stored, or the slight quality loss of the format.

                In the present case, the DNN is conceptually comparable to the photographer and the storage format; but the DNN doesn’t store a copy nor a “picture” of the original, but a statistical abstraction of certain features of millions of originals. So the DNN doesn’t simply transport or “publish” original content, but it is able to synthesize new content based on the feature abstractions.

                It’s a similar process as when you write code, remembering concepts you have learned over the years (I am not talking about the widespread method here where developers simply copy existing code from the internet). If by chance something comes out of the DNN that resembles an existing work, the user still has the responsibility that copyright imposes on him, and the copyright holder still has all the possibilities that copyright grants him; but this is not Github’s responsibility.

                1. 4

                  JPEG compression transforms pixels into a completely different domain which does not visually resemble the original image at all (what Grant Sanderson calls the “Fourier world”); the only reason why this works is because we have a series of master theorems which establish the bidirectional connection between our visual world and its Fourier-transformed counterpart. But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.

                  If I take a JPEG-compressed image of a copyrighted work and alter its coefficients in a bidirectional transformation, producing an image which visually resembles the original work to a human observer, then I may have infringed. Similarly, if I take a neural-network-encoded pile of copyrighted source code and approximate its coefficients in a bidirectional transformation, producing source code which syntactically resembles the original work to a human observer, than I may have infringed. It doesn’t matter whether I tuned the coefficients by hand or used an automated tool to compute their new values; what matters is that the prior values were computed by summarizing copyrighted works.

                  1. 2

                    But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.

                    That’s not the way Copilot or e.g. GPT-3 work. You can indeed approximate compression functions with DNNs, but that’s not what is done here. Anyway, even if they only implemented an indexing and search algorithm on the repositories, without any DNNs, this was no copyright infringement, even when search results would show snippets the same way they do today. There are already precedents for this.

                    1. 2

                      the problem is not so much whether github is infringing but whether as a user of copilot you may unwittingly infringe someone’s copyright without having any way to know whether it happened - google is not infringing by serving up search results, but the fact that you found something on google doesn’t grant you any rights to use or republish that content in your own work

                      if copilot were just a search engine, again, github would be in the clear, but you would still need to check the license to see if you can use it. all that changes by making it a language model is that you can’t easily check so you never know if its output is safe to use in your own projects.

                      1. 1

                        I recommend reading the filed complaint. And it is not Github’s duty to enforce law.

                        1. 3

                          a key part of the complaint is the stripping of license information which they are responsible for preserving, a problem they would not have if they’d simply built a search engine

                          1. 1

                            They’re not stripping license information; they synthesize snippets, a tiny fraction of which might resemble existing code (which is very likely for any sufficiently small snippet and thus barely avoidable). But let’s see what comes out; the litigation is now filed.

                            1. 3

                              If the whole file is “this snippet and a copyright header”, the term “snippet” is misleading.

                              1. 3

                                What is really interesting to me about this whole Copilot situation is how much the zeitgeist has completely flipped.

                                I remember years and years ago people proposed all sorts of weird multi-part scrambling schemes that would take an input and produce a bunch of seemingly-random data blocks, none of which could reproduce the input or a subset of it, but it you had all of them you could recombine in a way that got back the exact input. And people literally thought this was an end run around copyright, since you could have, say, a P2P system where each peer distributes only a subset of the blocks needed to reconstitute a popular song or movie, and thus none of them were distributing a “copy” of it because none of those individual blocks could reconstitute it on their own – the fact that at the end you had a perfect copy didn’t matter, it was claimed, because only the intermediate distribution format mattered for copyright law.

                                And things like this would get lots of attention and hype on tech forums and be cheered on as proof of the incoherency of even the concept of copyright.

                                Now GitHub has invented something not too far off conceptually, and the tech community is screaming for it to be destroyed and arguing about how it’s the output that matters, not the intermediate format.

                                1. 3

                                  Now GitHub has invented something that they seem to think is actually a magic copyright remover, and the tech community is screaming for it to be destroyed.

                                  I think that nobody wants to achieve any “destruction” here. All I want is that my copyright remains intact.

                                2. 2

                                  The best possible outcome is that the defense wins, thus striking a serious blow against the legal fiction of intellectual property. Yeah, I have no love for copilot. It is essentially strip-mining the commons. And Microsoft Github is another tentacle of the surveillance capitalism octopus. In this specific case, I’m rooting for the megacorp. Though I have to admit, a teensy cut of that class action suit would sure help me out right now.

                              2. 2

                                The litigation includes no such examples, which is a pretty strong signal to me that no such examples exist because it would seem to be the exact sort of sample that gives the best (still small IMHO) chance of winning.

                2. 3

                  In this case I’d say Tensorflow (or whatever NN library they use) is the algorithm provider not responsible for its usage, but Microsoft is the user feeding copyrighted data into it.

                3. 1

                  a partially trained DNN is kind of like a zip file with a bunch of files already in it - adding another one is going to take up less of the capacity if its similar to what’s already there, the trick is that information is shared - generative models are kind of like a lossy compressor whose compression artifacts take the form of making the input more generic, more like the training set (“faceapp yourself but don’t actually apply any filters” type distortion), and the degree of distortion is simply a factor of the model capacity

                  training a high capacity model on a small dataset inevitably memorises things verbatim, because the training task for these models is reconstruction, that they appear to be doing something else is mostly a factor of capacity limits and sometimes intentionally distortion-inducing sampling methods

                  and you can observe different degrees of distortion even in text models like copilot - depending on how common a code snippet it is reproducing is and your settings, it may reproduce existing snippets nearly exactly but with different variable names or commenting style, which shows that it has an internal representation that doesn’t necessarily need to store the “stylistic” details, but is still obviously close enough to be license infringement

                  when given a context that isn’t especially close to any single training sample it appears to be more “creative” as its having to mix together information gleaned from multiple samples and rely more on the surrounding context, but the big problem with copilot is you can never really know when its being “creative” and when its just coughing up a training sample almost exactly so its always going to be kind of legally awkward to use

                  the real annoying part is that language model based code completers are still really useful when trained on much less code, and a lot of the code that copilot was trained on isn’t just encumbered by licenses that don’t allow unattributed copying, but is also poor quality. There is conceptually a more useful tool you could build with the same methods by being more selective about its training data, but copilot feels like GitHub and OpenAI trying to retroactively recoup the cost of training a huge model more than an intentionally designed product.

                  1. 3

                    a partially trained DNN is kind of like a zip file

                    No. ZIP is a lossless, deterministic compression algorithm, in no way comparable to what the present DNN or its training algorithms do.

                    Ultimately, the degree of similarity and the degree of creativity of the snippet will be decisive. Unfortunately, however, the value of such snippets is greatly exaggerated in the present discussion. It is undisputed that in copyright law source code (unfortunately) automatically counts as a work, and this (even more unfortunately) also applies to parts of it. However, this is a perversion of the concept of a work. Because the probability that any snippet is present in any number of other source codes in a very similar way is close to 100%. Industry and open source developers are already suffering from the perverted use of patent law; now they are to be bothered also by the perverted use of copyright law. Judging whether or not a snippet meets the creativity requirements is usually arbitrary. Fortunately, problems with this kind of misuse of copyright can be circumvented with its own means relatively easily by simply rewriting the snippet.

                    1. 1

                      The point of the analogy was it containing multiple items and sharing information, read “mpeg file” if you’re hung up on the lossy vs lossless distinction

                      1. 1

                        if you’re hung up on the lossy vs lossless distinction

                        Thanks. I have a formal education in both information technology and law.

                        1. [Comment removed by author]

            2. 2

              copilot only violates copyright 1 percent (or slightly less) of the time - trust us

              Microsoft and you.

      2. 2

        I don’t think anything that github did is likely to have required a license in the first place.

        I don’t think GitHub has any right to other people’s work, unless granted by a license.

        1. 8

          They have legal ownership of the copy that is in their possession, given that they acquired it lawfully (which they did). The same way you own a book.

          You can do lots of things to code without a license. Read it, lend it to your friends, sell it to the used code store, stick it on a shelf so your guests can admire how big a library you have, execute it, etc. They don’t have copyright, but they absolutely have normal property rights.

          1. 1

            I think ownership might be the thing that’s at least debatable here. I don’t think GitHub owns the code it hosts. Similar to a web hoster not owning all the photos and your ISP not owning everything it caches or goes through the network.

            Or more IT comparison. If I am a code reviewer, some consultant or something and code is given to me to inspect it. That doesn’t mean I own it, simply because I legally have the data on my hard drive. If said code was some service and I’d just run it the actual owner would likely be very unhappy.

            I agree this is about copyright and not license. The question is whether what they do is some kind of fair use or anything you are allowed to do under copyright law.

            I’d argue it’s not, because it doesn’t create a benefit for society, like most fair use does for example.

            If it turns out it is, what would happen to let’s say anything that re-compresses an image, maybe lossy, as part of a service. They (likely) do that it in this case even with the explicit authorization of the copyright owner. They run ti through some algorithm and get something new out of it that kind of reassembles the originally, but not rally and certainly not in terms of bytes. Does that make them the owners?

            Or what if someone simply wrote some “AI” that let’s say mostly strips comments, reorganizes code, maybe even just works with some sort of AST. Would it make the output owned by whoever runs it?

            Does that mean I mean one could make an “AI” that disassembles binaries, maybe makes some redundant changes and outputs new modified binaries? Would that work.

            What if it was more involved and you actually train a NN, and just teach it the bytes of some software or even a movie. You have a prompt where you can enter “The bytes on C:\videos\plan-9.mp4 are video files of Plan 9 from Outer Space. Remember this!”. It does, but not just by copying, but by adding it into its (language) model. Then since its your language model you share it on the web. Someone else may download it and say “Hey there. I need the bytes for Plan 9 from Outer Space in C:\warez\plan9.mp4, please store them there for me”. Who holds the copyright on what the AI creates through its language model? It might even have learned to skip redundant license statements of software, strip FBI warning from videos and who knows what.

            What if the AI does more? What if it even can “watch” and “learn” the movie, potentially scale it up to 4k monitors, output to any format, knows how to change it just enough so any AIs looking for copyright infringements can’t differentiate it anymore? What if it can lean to even change movies, just enough so that copyright lawyers consider it a new work of art.

            Where do you draw the line? Where does what’s allowed under copyright law end?

            I really don’t have the answer, but I think with copyright law a huge mess was created in first place, because laws work best when they are something you can agree on at large and they change or come down when a large amount of people change their opinion (homosexuality, slavery, women voting, witchcraft, etc.). I don’t think with copyright there ever was huge amounts of agreements, and if it was was applied to the letter and copyright holders would really sue everyone who crosses its lines the majority of people would have voted to abandon at least large parts of it.

            Besides that the like between being inspired by, learning from something or even learning something (think reciting a poem) simply are some form of copying with some translation. There already are existing huge topics on fair news, see sampling, mixing, etc. and laws that nobody feels comfortable executing, like singing copyrighted songs on parties, or other private settings in some countries.

            I’d say all of this is at least something that’s not so clear in law, so whichever route it takes I am sure there certainly is potential far-reaching effect whatever the conclusion will be.

        2. 5

          Their ToS do grant them some rights, but their Copilot actively violates their own ToS:

          https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#4-license-grant-to-us

          1. 4

            I don’t see at all how Copilot violates their own terms. Could you give an actual explanation, with detailed specific claims?

        3. 3

          It seems that in the above “right” is being used to mean “moral right”, while many here are using “right” to mean “legal right”. Confusing those two things might be the source of some of the misunderstanding that I’m seeing here.

      3. 1

        I don’t think anything that github did is likely to have required a license in the first place.

        It depends on the particular licenses that are violated, I guess.

        1. 9

          The argument is something like:

          1. Copyright law says that there’s a list of things you can’t do without getting permission from the copyright holder.
          2. For open-source/Free Software, the copyright holder does grant permission to do some of those things, in some ways, via a license.
          3. But if the thing you are doing is not one of the list of things that requires the copyright holder’s permission, then the license terms are irrelevant, because you are not dependent on the license for your permission to do those things.

          There is also the other option that if you have access to a piece of software under multiple potential license grants, one of which is more permissive than the other, you can choose the more permissive one without having to observe the less permissive one. I’ve pointed out in past threads about Copilot that I would not be surprised at all if the license grant embedded in GitHub’s terms of service turns out to be more than sufficient to allow everything Copilot does, for example.

        2. 7

          No, if they don’t need a license I don’t think the particular licenses matter at all. Licenses grant permission to do things that were otherwise illegal under law with some conditions attached. If you didn’t do anything that was otherwise illegal, licenses don’t do anything.

          If they did need a license, they’re obviously in trouble with pretty much any license, because they didn’t comply with pretty much any license (other than the CC0 and wtfpl style ones).

          1. 7

            Some licenses grant you permission to reproduce some/all of the work provided you meet the conditions, Microsoft did not meet the conditions in those cases, yet the reproduced the works anyways.

            How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?

            How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?

            1. 8

              How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?

              I’m saying I don’t think they violated copyright law. I’m not saying it doesn’t cover source code, but that I don’t think it covers this kind of use of copyrighted material of any kind.

              How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?

              I don’t think the distinction between text from text books and code is relevant if that’s what you’re asking. If you trained the same kind of model on an equally large collection of equally diverse textbooks and released it as a book writing aid I think you would have exactly the same legal status. (Edit: I should say “and served requests to the model as a book writing aid”, releasing the model vs serving requests is potentially legally relevant, with the latter being slightly more likely to be problematic IMO).

              I don’t think it’s fair to describe what’s happening here as “taking text from books in the library and trying to pass it off as your own work” though. There are many more steps happening here, and they’re achieving a much broader goal than rote copying pieces of text. And sure, sometimes the process manages to create a copy of a small piece of someone else’s copyrighted work that has been copied many times already into other places, but that’s de minimis copying and not copyright infringement.

              1. 7

                It might be worth noting for instance, that Google won Google v. Oracle despite copying 11,500 lines of code. In part because that wasn’t a substantial portion of the work. I’d expect a similar analysis here.

                The samples that were duplicated that they use to justify the lawsuit are things like part of an exercise from a textbook, it’s not a substantial portion of the book.

              2. 0

                And I think they are violating copyright law, but I’m not a lawyer and you probably aren’t either. I hope this goes to trial so we get to hear some random judge’s opinion.

                1. 4

                  Your whole argument rests on “they didn’t follow the license!”

                  If their whole argument is “we didn’t need that license to do what we did”, then your argument is not really relevant. That’s what people are trying to get you to understand – the license terms may literally have no relevance whatsoever.

                  1. 1

                    What’s the point in any license if it’s not relevant here? My point is that argument, that the license is not relevant, is, uhh, not relevant.

                    1. 6

                      A license grants you rights to do things that would not otherwise be covered by copyright law. You do not require a license to do things that are covered by Fair Use / Fair Dealings (delete as appropriate for your jurisdiction) or by other explicit statute law. For example, in the USA you do not require an explicit license to record a cover of a song because compulsory licensing is enshrined in statute law and so you just have to pay a fixed amount for every copy that you sell.

                      The argument (disclaimer: I work for MS but have no connection to this project) it that this kind of use is covered by explicit law. I don’t know precisely what the laws in question are, there are a few things on building machine learning systems and databases that may apply but it will be up to the court to decide.

                      Whether they win or not, I think they’ve achieved their goal. We (MS) have spent a huge amount of time and money building a reputation as a good open source citizen. I’m able to hire great people on the basis of that and the expectation that they will be paid to contribute to the F/OSS ecosystem. Being on the other side of a lawsuit from the SFLC does a lot of damage to that reputation and, in some ways, winning a lawsuit against the SFLC would do more damage than losing.

                      1. 2

                        Still, I’m glad it goes to court and I even hope MS wins. Copyright has limits and that’s important.

                        We’ve had this before. Publishers wanting to block used-book sales, for example.

                    2. 4

                      A license is just that: a license to do something with a copyrighted work. It can’t take away rights that were already granted by copyright law, such as fair use.

                    3. 4

                      Ordinary users of GitHub receive code under the license chosen by the person who posted the code.

                      GitHub has a choice between receiving code under that license, or under the license granted in GitHub’s terms of service.

                      GitHub can simply choose the more permissive of the two, in which case the more restrictive of the two is in fact irrelevant.

                      Think of it like any other dual-licensing scheme. Suppose I write a piece of software, we’ll call it foolib. And I offer it under a choice of BSD or AGPL. If you choose to receive foolib from me under the BSD offer, you will be able to do things with foolib that the AGPL would not have allowed. And you will be able to do that because the AGPL is not the license under which you received foolib and so is not the license which governs your use of it. No amount of yelling “that’s an AGPL violation!” would be relevant there.

                      Similarly, even if I only offer it under AGPL, you could still do certain things with it – such as fair use – without having to follow the AGPL’s terms. And again no amount of yelling “but that’s an AGPL violation!” would matter, because there are things copyright law still lets you do without needing to obtain or follow the terms of a license.

                      The point being made here is simply that saying “But that’s a license violation!” over and over is not relevant, because the original argument is that GitHub either has access under an alternative, more permissive license, or is doing things that do not require a license in the first place. In the former case, the only license terms which matter are the more permissive ones; in the latter case, no license terms matter.

    2. 7

      Plaintiffs estimate that statutory damages for Defendants’ direct violations of DMCA Section 24 1202 alone will exceed $9,000,000,000.

      Whoops!

      1. 18

        I’m loving the irony of legislation as bad as the DMCA being used against those who lobbied for it in the first place, and continued to support it in subsequent years. See e.g.:

        Microsoft Corporation[‘s] … Initial Comments in response to the Copyright Office’s Section 1201 Study

        The hypocrisy here is stomach-turning.

      2. 2

        Even if the plaintiffs absolutely win every single aspect of the case, they’re not going to get a judgment for that amount. Every lawsuit starts out setting a ludicrously high demand for damages, because generally that number can only be pushed down as the case goes on, not up.

        Or more simply: it’s like those news stories that say someone is potentially facing some ludicrous number of years in prison in a criminal case, when that number comes from literally maxing out the severity and sentencing range of all the charges. The prosecutor knows that’s not the sentence that would happen even if the case went perfectly.

        1. 1

          If they win on all their claims I don’t think 9b seems out of the question as a reward. It’s high, but not ridiculously so.

          I take issue with how that number was calculated, but in a less lopsided manner than I would for most damage claims. They’re multiplying a statutory minimum by an estimate on the number of infringements. They estimate that each user (going off of a June 2022 claim by GitHub) infringed an averaged of 3 times… which doesn’t seem unreasonable. The justification for that estimate is based on a “1 infringing output * it counts 3 times”, which I doubt, but it turning out that each user had >= 3 infringing outputs doesn’t seem out of the question to me. I suspect the statutory minimum award really is a minimum reward, if the case goes to trial, too.

          That’s also before considering awards for their other claims.

      3. [Comment removed by author]

    3. 7

      Hosting one’s own code on Microsoft GitHub is a choice, but man is it objectionable when communities require you to use the platform in order to participate in the community (i.e. packages can only be hosted on GitHub, your community identity is tied to a GitHub ID, etc.).

      If GitHub wins the litigation then your community is tied to terms of service for Microsoft GitHub and community members cannot opt-out.

      1. 3

        If GitHub wins the litigation then your community is tied to terms of service for Microsoft GitHub and community members cannot opt-out.

        Perhaps if MS wins on the basis of not having to abide by licenses, people will start pulling out of GitHub more than they already are. Because who knows what other things they are going to do to your code when they realize licenses don’t apply?

        1. 2

          I’d recommend projects start now rather than wait. Unfortunately, a lot of the communities I am or have participated in are pretty locked into GitHub.

          1. 1

            I have a one-way mirror from my Fossil server to GitHub. If I pull the plug, nothing will happen.

            1. 1

              It will become stale. You’ll also create new projects in the future.

      2. 2

        If GitHub wins on some kind of fair-use/transformative argument, then it won’t matter where you host – if it’s publicly visible it will be fair game for training data.

        And people who care about software freedom should want that, because it would effectively be a way to remove proprietary licensing from any piece of software you have a copy of.

        1. 3

          Couldn’t the same be applied to other things. I mean text to image works for pulling out some existing work if you want to. Why not do that with music or executable rather than code? Like Copilot it could take care of stripping any license texts.

          1. 2

            Well, we’re probably going to find out for things like “AI” image and text generators.

            But yeah, if it’s ruled that copyright cannot prevent using material as a training set, then anything you can get your hands on is fair game.

    4. 6

      I don’t understand how the tech community, notoriously suspicious of copyright, have all of a sudden decided that this is a battle worth fighting. Copilot and similar tools are crazy useful and getting better all the time, and now people want to stop all progress on these tools so that nobody rips off their leet left pad javascript code they wrote in 8th grade and dumped onto github.

      If your code is valuable and you want to protect it from anyone looking at it or using it, then don’t release under an open license. If you do release under an open license, then other people and programs are going to look at your code, and possibly make use of the ideas within.

      1. 18

        If you do release under an open license, then other people and programs are going to look at your code, and possibly make use of the ideas within.

        Even most permissive open-source licenses have attribution requirements. I don’t mind if other people use my code - indeed, I want them to - but I do mind if they claim that they wrote it. If they read it and are inspired to do something similar, that’s fine. If they read it and copy it verbatim but give me credit, that’s totally fine too. If they read it, copy it verbatim, and slap their own copyright and license on it, that’s not okay. If they have a machine process it and inject verbatim copies into their own work, that’s absolutely no different from my perspective to the previous case.

      2. 14

        When I release code under a free-as-in-freedom license, it’s because I do not want corporations to take my work, make it closed, and charge people for my free work. There are various power imbalances, vendor lock-ins, and anti-competitive practices that make it possible even if the original remains available.

        It would be a different thing if copyright didn’t exist, and I could take Microsoft’s software and use it however I like. But copyright does exist, and this tech creates a very one-sided situation that Microsoft can take and launder my code, but they’ll still sue me into oblivion if I touch their copyrighted software.

        Despite all of that, I’m not sure that declaring copilot illegal is the best outcome, because such ruling could have negative side effects of centralizing machine learning to corporations that can buy the data, and used as a precedent to limit fair use in other situations.

      3. 6

        “under an open license” - Just because it’s released under an open license doesn’t mean people can do whatever they want with it. Copyleft licenses were made for this sort of situation, to take advantage of the copyright system to ensure code released under an open license stays free.

      4. 5

        This is the monkey’s paw outcome of basing the entire approach to freedom in software on US copyright law. It eventually becomes the end in itself, rather than a means, and we find ourselves, non-lawyers, parsing careless legalese as if it were meaningful.

      5. 3

        If your code is valuable and you want to protect it from anyone looking at it or using it, then don’t release under an open license. If you do release under an open license, then other people and programs are going to look at your code, and possibly make use of the ideas within.

        I want people to use my code, but don’t want them using it to create proprietary software, which is why I use the GPL.

      6. 2

        I think it’s still worthwhile to go to court there, because there is also implications if what is happening is legal. It could potentially be applied to closed source or otherwise copyrighted material. So you are allowed to train an AI, understand it an replicate it in some form, whatever form that might be.

        In either case, I think it’s good if this leads to a legal decision because whatever route it takes it’s good to know you whether you might be sued for something and what the likely outcome will be.

    5. 1

      Wasn’t this posted a couple weeks back?

      1. 5

        This is the next step.