1. 86
  1. 24

    Microsoft’s position on accessing source code in violation of license terms is clear “any use of it – including to study how it is built – is illegal”: https://www.eweek.com/it-management/curious-programmers-face-legal-tangles-with-leaked-windows-code/

    1. 5

      People have posted proprietary, leaked source code to github before. What happens if copilot emits (leaked) windows source code, for example?

      1. 3

        It’s removed from the training set, probably. Or blocked at output.

        1. 12

          It would be ridiculously hypocritical to exclude Windows source that MS owns copyright to from the training data while including copyleft code that they do not.

          I’m generous enough to assume that they included all of the internal Microsoft code in the corpus as part of training copilot. Otherwise they’re either intentionally worsening the product they’re selling or openly violating others’ software licenses. I think Microsoft is better than that.

          1. 25

            I think Microsoft is better than that.

            Seriously? Seriously? Microsoft’s empire was built on screwing over others, from the early days where they sabotaged IBM’s OS/2 efforts with dirty tricks to sell Windows, to killing off Netscape by bundling Internet Explorer, to their smear campaign against Open Source to taking the wind out of the AppGet folks’ sails with NuGet more recently. There are many more examples of dirty politics to kill competitors.

            1. 5

              I think you missed the giant implicit <sarcasm> tag.

    2. 22

      There’s a false dilemma in the heart of the argument. Specifically:

      Thus, those who wish to use open-source soft­ware have a choice. They must either:

      1. com­ply with the oblig­a­tions imposed by the license, or

      2. use the code sub­ject to a license excep­tion—e.g., fair use under copy­right law.

      But there is a third option, which is for GitHub to obtain access to the code under a license which allows GitHub to use it in Copilot. And GitHub’s Terms of Service include a license grant in Section D.4:

      You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

      That “as necessary to provide the Service” thing is worryingly flexible – a similar wording in the US Constitution is infamously called the “Elastic Clause” because of how many things it can be stretched to cover. And the explicit “analyze it on our servers” feels like it easily covers use to train an ML model.

      And I would not be surprised at all if GitHub were simply to point to their terms of service, and accompanying license grant, and say that gives them all the permission they need to train and run Copilot. The only tricky bit would be someone putting open-source code onto GitHub when they lack the legal right to make such a license grant to GitHub, but then GitHub will point to Section P of their terms, and things will go badly for the person who put the code on GitHub, rather than going badly for GitHub.

      1. 10

        But the license does not extend to other parties. IANAL, but I guess one outcome may be that Copilot itself is legal, since users give GitHub this license, but that users of Copilot are in violation, since the license does not extend to them. I wonder if GitHub facilitates copyright infringement in that case.

        1. 6

          If you put code on GitHub, and have the appropriate rights to make the license grant involved in GitHub’s terms of service, then GitHub has the right from that license to distribute to others. Whatever open-source license you put on the code is not necessarily involved in the process at all.

          Also, your thought here is based on the assumption that your original choice of open-source license can even persist through the training of an ML model that eventually spits out something similar-looking, which I don’t think is clearly established (and may not even be clearly established if the model occasionally spits out verbatim copies of some pieces of code). There’s dangerous legal ground here since Copilot veers close to a lot of things the industry has assumed should be legal, like reverse engineering, clean-room reimplementations, and so on.

          1. 3

            I’d think it was the other way around? The GitHub terms of service do not invoke any rights to redistribute, so the only rights to redistribute that GitHub has are the ones from the license the user choose.

            The question is then, is copilot redistributing? Some of the examples I’ve seen show very, very similar code, so, I’d say at least in some cases, yes, it is.

            1. 3

              Then if you care about your license, you should remove your projects from GitHub right now and influence projects you contribute to now before someone gets to decide the interpretation for you.

              1. 2

                The GitHub TOS license grant includes the right to “share … with other users”. That gives them redistribution to users of GitHub’s products and services. Which, if you go with the theory that Copilot’s output is a derivative work of the training data, puts GitHub in the clear.

                I’m not convinced, though, that a court would or necessarily should automatically see Copilot’s output as a derivative work of the training data, which was what the second half of the above comment was about.

                1. 1

                  Do you know or have a link to the specific wording? I ask because I vaguely remember a kerfuffle about this a while ago, and the consensus seemed to be that GitHub only asked for rights to display or share the code in the context of the service, which was what kinda of calmed people down. Although, damn, now that I think about it, Copilot could easily fit “the context of the service” so, yeah.

                  1. 3

                    My top-level comment already quoted the full thing. And in a reply pointed to what the scope of “the Service” is defined as. You can also go read GitHub’s TOS for yourself.

          2. 5

            Not a lawyer, but that section is pursuant to “the Service”, that being the “GitHub” product. GitHub Copilot is not the aforementioned service but a product developed by GitHub, Inc.

            1. 15

              From the definitions at the top:

              The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.

            2. 5

              The problem is that GitHub allows me to upload code that I can legally distribute but don’t have the right to give them rights outside those of the license. So, e.g., code from the Linux kernel is definitely on GitHub in several forms - but did the copyright holders agree to those terms? The GPL is fine with uploading + distributing the code via GitHub but does not convey the rights that Microsoft is claiming.

              1. 6

                I mentioned that, and pointed out the section in GitHub’s terms that covers it. But I’ll point to it again for emphasis here.

                You agree to indemnify us, defend us, and hold us harmless from and against any and all claims, liabilities, and expenses, including attorneys’ fees, arising out of your use of the Website and the Service, including but not limited to your violation of this Agreement

                So basically, if you put code on GitHub and it turns out you did not possess the legal right to grant the license to GitHub that GitHub’s terms require, those terms also say that you are the one who gets to to be responsible.

                1. 4

                  I wonder if that even holds up in court, considering that you need only an e-mail address to sign up. This is not a proper contract, and they don’t even know who their users/customers are, so they have no recourse.

                  Put in another way, someone can sue GitHub for infringement and they’d probably have to pay up. According to the above, GitHub has the right to sue the user who caused the infringement for damages, but if there’s no way to reach that user, they’re SoL.

                  1. 2

                    It probably would hold up for some cases, such as if you uploaded a pile of stolen proprietary code and the owner sued GitHub. For Copilot, if it’s deemed to be illegal use, I don’t think it would be politically feasible to try to enforce this clause because it would need to be applied to the majority of GitHub’s customers and would kill the product entirely. That would be a multi-billion-dollar cost, not counting the long-term cost of leaving that market. If it ever got close to a situation where that needed to be considered as an option, I’d expect to see Kevin Scott leaving the company quite rapidly.

                    Disclaimer: I work at MS but have no connection to the GitHub or Copilot teams.

                    1. 2

                      I wonder if that even holds up in court, considering that you need only an e-mail address to sign up. This is not a proper contract, and they don’t even know who their users/customers are, so they have no recourse.

                      You wonder if what holds up in court? An indemnify-and-hold-harmless clause? Because that’s not some sort of brand-new invention GitHub just recently came up with that’s never been tested. It’s boilerplate. And they have the email address and the IP address used for the account, so they have other entities to send subpoenas to to track down a real-world identity if they feel motivated to do so.

                      Can you still put together contrived scenarios where someone goes “good luck, I’m behind seven proxies” and actually has enough skill to constantly stay one step ahead of attempts to reveal their identity? Sure. But keep in mind that we’re talking about unauthorized code uploads here, which probably are going to be handled via DMCA notice – since GitHub follows the DMCA process, they have safe harbor for user uploads – rather than a multi-vigintillion-dollar lawsuit that somehow wipes Microsoft off the planet.

                      I’m also not sure why you think it’s “not a proper contract”. Can you cite specific reasons for that?

                      1. 1

                        email address and the IP address used for the account, so they have other entities to send subpoenas to to track down a real-world identity if they feel motivated to do so.

                        True, but that requires cooperation from the e-mail provider and/or ISP and might lead to nothing.

                        DMCA notice

                        Removal of the offending original repository isn’t going to undo the usage of that same copyrighted code in other people’s software, especially when introduced by Copilot, which I’d say makes Github/MS liable for the further spreading of that code.

                        I’m also not sure why you think it’s “not a proper contract”. Can you cite specific reasons for that?

                        It’s a very one-sided “agreement”, and everybody (including judges) knows that the vast majority of people just click through these without actually reading them.

                        1. 1

                          True, but that requires cooperation from the e-mail provider and/or ISP and might lead to nothing.

                          A properly-issued subpoena compels cooperation. I don’t see much reason why an email provider or ISP would fight one.

                          Removal of the offending original repository isn’t going to undo the usage of that same copyrighted code in other people’s software, especially when introduced by Copilot, which I’d say makes Github/MS liable for the further spreading of that code.

                          I wouldn’t be surprised if GitHub intends to re-train Copilot on an ongoing basis, and thus has some way to take certain code out of the training corpus. And while statutory damages for copyright infringement are a bit steep for individuals, even in the worst possible outcome of the worst possible case I can imagine, it doesn’t get anywhere near threatening the ongoing existence of GitHub and/or Microsoft. And again that presumes a plaintiff who absolutely refuses to accept any other outcome, which is dangerous – refusing a reasonable settlement offer from the defendant can end up being financially ruinous for the plaintiff (see this recent infamous example in a copyright case).

                          It’s a very one-sided “agreement”, and everybody (including judges) knows that the vast majority of people just click through these without actually reading them.

                          To the extent that a free-tier GitHub account is “one-sided”, it’s in favor of the account holder, not GitHub, because the account holder gets free access to a bunch of resources and services from GitHub, without paying money for them. If you want to argue that courts have ruled all website terms of service are completely unenforceable, please provide specific citations to the relevant cases, preferably under US law (which governs GitHub’s terms).

                          1. 1

                            A properly-issued subpoena compels cooperation. I don’t see much reason why an email provider or ISP would fight one.

                            More like ignore it, if they’re in certain countries there’s nothing you can do about it. And some e-mail providers like Protonmail or ISPs like Freedom in the Netherlands have a reputation to uphold of protecting their customers.

                            I wouldn’t be surprised if GitHub intends to re-train Copilot on an ongoing basis, and thus has some way to take certain code out of the training corpus.

                            Of course they would, but I meant code that was emitted by Copilot and ended up in other code bases before the case.

                            “one-sided”

                            It is one-sided, in that it’s not a contract resulting from a negotiation between account holder and GitHub. GitHub can change their ToS on a whim and the only thing you can do is decide to shut down your account if you disagree (assuming you even read it).

                            free access to a bunch of resources and services from GitHub, without paying money for them

                            Let’s not pretend GitHub is a charity, that’s disingenuous.

                            all website terms of service are completely unenforceable

                            Well, that’s not the case. But that’s also making a much stronger statement than I did; I was considering what a judge would think when a user would get sued for damages over something like this (unless the user was acting deliberately with malicious intent, perhaps).

                            preferably under US law (which governs GitHub’s terms).

                            That’s the kind of US-centric attitude that I not only dislike but is plain incorrect. GitHub is available world-wide and is subject to local laws in the countries where it operates. It has been proven again and again that US companies have to abide by the GDPR, for example, even though it’s not a US law. You cannot simply reject the fact that local law applies through the ToS, even though companies do try.

                            1. 2

                              And some e-mail providers like Protonmail or ISPs like Freedom in the Netherlands have a reputation to uphold of protecting their customers.

                              I don’t think they’re going to “protect” the way you’re thinking they will, but it’s also turning into a diversion from the main point.

                              Well, that’s not the case. But that’s also making a much stronger statement than I did; I was considering what a judge would think when a user would get sued for damages over something like this (unless the user was acting deliberately with malicious intent, perhaps).

                              And yet those circumstances don’t impact the enforceability of the terms of service. If they’re enforceable they’re enforceable. If they’re not, they’re not. There is no “enforceable, but only against people we think deserve to have enforcement”. Selective enforcement is something the law is supposed to go to great lengths to avoid!

                              That’s the kind of US-centric attitude that I not only dislike but is plain incorrect.

                              It is a trivially verifiable fact that section R.1 of the GitHub terms of service says:

                              Except to the extent applicable law provides otherwise, this Agreement between you and GitHub and any access to or use of the Website or the Service are governed by the federal laws of the United States of America and the laws of the State of California, without regard to conflict of law provisions. You and GitHub agree to submit to the exclusive jurisdiction and venue of the courts located in the City and County of San Francisco, California.

                              So when I ask you to provide citations from US case law because US law is what governs GitHub’s terms of service, it is not “plain incorrect”. In fact it is the most correct thing I could ask you to do in the circumstance. Your personal dislike of it – and also the fact that you seem not to understand what it means for a particular jurisdiction’s laws to “govern” an agreement – is neither here nor there, and the rest of your paragraph about this was an extended non sequitur.

                              It is very likely that GitHub’s terms of service are, in fact, enforceable. I’m not sure how to convince you of that, but if you’re simply never going to accept that, then we don’t really have anything else to say to each other.

                  2. 2

                    I imagine it might come down to the people who uploaded the code to Github failing to reconcile the distribution terms of the GPL with the terms of service for using Github.

                2. 6

                  I feel legally, this has potential to be “won”, but ultimately I feel it’s futile. If anything this may push toward code being non-copyrightable (and maybe not just code), which would be pretty funny. It’ll be interesting in any direction. I believe the best solution is to adjust our social behaviors as necessary (want a strict license? don’t share the code at all.)

                  1. 18

                    If anything this may push toward code being non-copyrightable (and maybe not just code)

                    Code being not copyrightable was considered a goal of the FSF early on; they built the GPL on top of copyright specifically in order to subvert copyright, not because they liked it.

                    1. 3

                      Absolutely they didn’t like it… It makes me wonder why FSF isn’t leaning into it too!

                    2. 3

                      I mean, I want to release my code as GPLv3, which is not possible under your suggested social model of ‘either release it and accept copyright laundering or don’t share it at all’.

                      1. 4

                        I think the thing people are getting at is that if there were some method established to legally “launder away” licenses from software, would the GPL even be necessary at that point? In theory the point of the GPL is to prevent someone taking GPL’d code and putting it, or a derivative work of it, under more restrictive terms that prevent you from inspecting, modifying, redistributing, etc., but if you can just regain those freedoms by running the software through a magic license-laundering box, does it even matter?

                        1. 4

                          That only works if the source code is publicly available.

                        2. 3

                          Of course it’s not possible because GPL means nothing in such a social context :)

                        3. 3

                          If anything this may push toward code being non-copyrightable (and maybe not just code), which would be pretty funny.

                          Funny and karma. GitHub is owned by Microsoft and the Windows source code has been leaked several times.

                          1. 1

                            Is there any proof that this leaked code has been included in the Copilot training set?

                            1. 5

                              Shouldn’t be hard to prove, just try to use Copilot and use some unique identifiers from the Windows source and have it autocomplete that. If it works, it’s included, otherwise we can be reasonably confident they blacklisted their own software from it. If so, that might actually be good fodder for the plaintiff.

                              1. 3

                                […] we can be reasonably confident they blacklisted their own software from it. If so, that might actually be good fodder for the plaintiff

                                Or Github only included code with an explicit open source license in the Copilot training set.

                                1. 4

                                  That assumes all repositories are correctly-labeled… It wouldn’t surprise me if there are copies of the Windows code out there marked as “public domain” or some such ;)

                              2. 3

                                I meant that if this pushes code to being non-copyrightable (I don’t believe it will, the copyright lobby is big and powerful), then this would apply equally to Microsoft’s leaked Windows code.

                            2. 3

                              I feel legally, this has potential to be “won”

                              Very unlikely; the resulting fragments are likely not covered by copyright law, and using open source code for training a neural net and using this neural net for commercial purposes neithers seems to violate any law; I would think it more likely that users who add the generated code fragments to their code could be held liable for copyright or patent infringement; but that’s not in the liability of the service provider; nevertheless, the case can be used well as a promotional event for a few lawyers.

                              1. 2

                                Yes, because individual action alone always solves large systemic problems.

                              2. 5

                                Fooling around with text to image made me think about this topic as well. It’s really hard to draw lines with copyright. I think it’s a lot clearer for code and copilot, which in my opinion is kind of similar to a file sharing application that once existed (can’t find it right now), which argued that it doesn’t transfer bytes, but rather did something in the lines of just transmitting instructions, which would then build the binary. The argument was that it wasn’t really copying.

                                If someone here knows what I’m talking about I’m curious if any case or conclusion came out of that case.

                                I wonder if that could be re-created in a way that just matches the requirements, if Copilot is considered to be okay.

                                1. 4

                                  I think the copyright arguments are weaker when it comes to this kind of code (assuming the Copilot only uses “open source” input). The code creator has explicitly and actively chosen a license that allows others to use the code. It’s much less clear to me that copyrighted images are ok to be used as input to tools like DALL-E etc.

                                  Anyway, the way things work under an adversarial legal system in the US is that someone tells the court they have been wronged and the court decides. So the project under discussion is a part of that process.

                                  1. 6

                                    The code creator has explicitly and actively chosen a license that allows others to use the code

                                    …under the license terms, which Copilot does not respect. And there is proprietary code on GitHub.

                                    1. 1

                                      The license issue is what’s under debate. From the linked article

                                      If Microsoft and OpenAI chose to use these repos sub­ject to their respec­tive open-source licenses, Microsoft and OpenAI would’ve needed to pub­lish a lot of attri­bu­tions, because this is a min­i­mal require­ment of pretty much every open-source license. Yet no attri­bu­tions are appar­ent.

                                      There­fore, Microsoft and OpenAI must be rely­ing on a fair-use argu­ment. In fact we know this is so, because for­mer GitHub CEO Nat Fried­man claimed dur­ing the Copi­lot tech­ni­cal pre­view that “train­ing [machine-learn­ing] sys­tems on pub­lic data is fair use”.

                                      (my emphasis)

                                      Note that I am personally critical to the ethics if not the legality of using copyrighted images as fodder for machine learning, and I am in principle critical to Copilot on the same grounds. But anyone who releases code under an open source license does it under the knowledge that it can be used without renumeration, unlike many, many image creators. This muddies the waters.

                                      1. 6

                                        In that case, it’s not a matter of “choosing the license”, since fair use applies to any license, including none.

                                        1. 1

                                          That’s correct, and I hadn’t thought of that!

                                    2. 1

                                      First of all copyright isn’t licenses. I think you mean licenses. The author usually clearly state either “You have to put my name onto it” (attribution) or “You have to use the same license for distribution”. I have so far not seen anything in Copilot that makes sure of this.

                                      1. 1

                                        As far as I know, no open source licenses have been challenged in (US) court, so any claims would probably be that Copilot violates the creators copyright (which they implicitly “give up” by using a specific open source license). But IANAL.

                                        1. 2

                                          As far as I know, no open source licenses have been challenged in (US) court

                                          The GPL has been upheld in (German) court by Harald Welte of gpl-violations.org. See also this incomplete list of GPL lawsuits

                                          1. 1

                                            Oh, I was talking about what they clearly state, not what courts found.

                                            However what makes you think that choosing an open source license makes one give up copyright? Would you say the same thing when you accept EULAs or License Agreements on Closed Source Software? Why would it be different for Open Source Licenses?

                                            Most licenses start with a copyright notice, like “Copyright (c) 2009 The Go Authors. All rights reserved.” (BSD License).

                                            There’s also copyleft licenses, like the GPL that in my understanding also only are able to do what they do because of copyright laws. And that one has been challenged.

                                            https://en.wikipedia.org/wiki/GNU_General_Public_License#Legal_status

                                            It would be different for source code in the Public Domain, which you might even be able to get your work into (while you are alive) outside the US though, because the law might say there needs to always be a copyright holder.

                                            1. 1

                                              However what makes you think that choosing an open source license makes one give up copyright?

                                              I don’t, which is why I put the term in quotes.

                                              I’m familiar with the interplay between FLOSS licenses and copyright and the subtleties in that interplay.

                                              But from a layman’s perspective, a FLOSS licenser “waives” copyright when it comes to claiming the rights that this generally grants (again in quotes!), for the work that is licensed. They don’t claim exclusive access or payment or anything like that.

                                      2. 3

                                        Here is a prototype: https://github.com/xigoi/xopilot

                                        1. 1

                                          Would be interesting to have something that doesn’t go for code but for binaries.

                                          1. 1

                                            It works for binaries too.

                                            1. 1

                                              Whoops, looked at the wrong project. But love it. Nice one! :)

                                      3. 4

                                        I want a clause that when code is used as a part of machine learning algorithms, its dataset needs to be licensed in a GPL or CC BY-SA-compatible manner (not sure if a dataset is code or creative work). I like the idea of this open source project data being used to assist developers, but I don’t like that it’s packaged up with in a proprietary manner–especially when sold by a publicly-traded corporation.

                                        1. 5

                                          I have already seen some Free Software folks I respect regretting the fact that Copilot seems to be driving people toward more, and more extreme, copyright maximalism. I think that’s the path you’re on here and I think if you consider yourself a Free Software advocate you might want to stop and think long and hard about whether that’s the path you want to be on.

                                          I’m also not at all convinced that current US copyright law would let you license something in a way that affects the eventual licensing of the output of some ML model that happened to include your stuff in its training set. Even if GitHub never tries to make the fair-use argument for Copilot, your approach gets too close to being able to impose license restrictions that persist through a fair use than I’m comfortable with.

                                          1. 1

                                            I think this recent comment from Drew DeVault is a strong defense of using copyright as a practical means of protecting free software in the current world, even when one’s ideal world would not have copyright.

                                            1. 1

                                              The basic argument form here is that if it’s possible to “launder away” license terms from a piece of software through a mechanism like training a machine-learning model, then the “we need copyleft to prevent exploitation of our Free Software” position starts to fall apart. If someone can launder away your copyleft license and incorporate your code against your will into their proprietary software, it would also be possible to launder away their proprietary license and produce Free Software (in the sense that you have the freedoms the FSF cares about, not in the sense of “has an FSF copyleft license attached”) out the other end.

                                              So that comment doesn’t really refute or rebut – it still operates from an assumption that license terms stick to the code when it goes into the ML model, while the whole point is that if license terms don’t stick through the ML model, then you have the Holy Grail of software freedom, because you no longer have to carefully protect against people imposing restrictive terms on software (since you can just launder those terms away).

                                          2. 2

                                            Licenses can’t create new restrictions where none already exist.

                                            So, either copilot is lawful, not copyright infringement, and the terms of your license don’t matter one wit.

                                            Or copilot is illegal copyright infringement, and this new license you’re imagining would only be useful if enough people applied it to their code that someone training a new model was interested in including it in the training data (the entirety of which would have to be licensed in a manner in which they could comply with the terms of the license).

                                            Realistically, I don’t see someone managing to scourge together enough code with licenses like that unless they figure out how to do it by complying with existing open source licenses - at which point it seems very unlikely they’d be interesting in including your code with a bespoke license.

                                          3. 4

                                            I was with this right up until this line:

                                            Arguably, Microsoft is cre­at­ing a new walled gar­den that will inhibit pro­gram­mers from dis­cov­er­ing tra­di­tional open-source com­mu­ni­ties. Or at the very least, remove any incen­tive to do so.

                                            In my experience, people mostly engage with open source communities because they A) need a library, or B) want to contribute back. They’re not copying individual lines of open source code, but using libraries, and I think that wouldn’t really change with Copilot in play.

                                            1. 2

                                              If Microsoft had a reputation for being pro-openness this could get by. But we all know how aggressive they can be when it’s them who want to stop innovation

                                              1. 2

                                                The argument seems to be something like “if ebooks aren’t stopped, there won’t be any physical books.” Nothing in law really obligates the preservation of any particular way of creating software. If that were a principle car manufacturers could have shut down electric cars.