CoPilot definitely can, and has been demonstrated to, produce nontrivial chunks of GPL’d code verbatim with the license header removed.
This feels like a buyer beware situation. If you use copilot, and it generates nontrivial code, maybe realize that’s too good to be true and Google for the source.
However, there should be no restriction on using Open Source software for training. Actually, what Microsoft does here, is the scalable version of someone learning from Open Source code and then starting a consulting business.
It is not the same. Computers do not “learn” in the way that humans do, they do not possess semantic understanding. Computers do not have the capacity for creativity that humans do. This argument only serves to further confuse people about what ML actually is and what it isn’t.
But argument is about whether looking at a bunch of source code is copyright infringement, not the nature of consciousness or something. I can’t think of how training a neural network to, say, identify possible authorship, using copyrighted images as a training set is any different than me studying a bunch of books of copyrighted images to gain enough expertise to do the same.
No, the problem with this is not the “looking”, they already do that for their search engine and nobody cared.
The problem is the thoughtless regurgitation of other people’s code, without attribution, and without regard for whether it constitutes fair use or not. Since authorship has been scrubbed from the model, there is no way to determine for yourself if you are in violation, and MS absolves themselves of responsibility. It’s a time bomb.
We have to. All this code doesn’t fit into our brain. We can’t even turn it off, you can’t read a piece of code without imagining how it will continue.
Attention models will just embed whole files if that’s the best way to do it. They don’t care about scope.
Primarily, as I have experienced it, GitHub CoPilot is used, as the name suggests, as a co-pilot for inconvenient glue code or well-known algorithms. It’s basically an AI automation of copying code from StackOverflow. So, I still cannot see the big affair here.
I’m also not a lawyer, but if we look at the popular MIT License, for instance, it says:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
The key word in the MIT license snippet that you’ve quoted is ‘substantial’. This matters because not everything in a codebase is subject to copyright. For example, if you copied for (int i=0 ; i<100 ; i++) { from some MIT-licensed source file, the attribution requirement wouldn’t hold because that line is not sufficiently original to qualify for copyright (copyright covers original works, which is quite a subjective thing, but some things are obviously on the not-qualified line). I am not a lawyer, and so I can’t comment on whether the things generated by Copilot meet this bar (from what I’ve read, being a lawyer would still not let me confidently make such a claim, there have been a number of expensive court cases to try to make this determination over shortish snippets of code).
If you type in some well-known class like “FizzBuzz” then it may paste the entire class that was grabbed from a specific piece of code somewhere. I have seen videos of people testing CoPilot where it pastes enough to fill a page. Given that possibility, I think the possibility of violating an MIT license is very likely to happen to many people.
People say that Copilot is the same as a human learning from reading GPL code or whatever, but humans don’t produce long verbatim transcriptions of the fast inverse square root. It’s not the same. Microsoft should not have included GPL code in Copilot. It’s a PITA that they did because it just muddies up all the water and makes what should be a useful tool into a potential legal time bomb. They played fast and loose and now everyone else is in legal limbo until it gets resolved in the courts.
GPL isn’t special here. Unless the code is CC0 or WTFPL or whatever, reproduction of substantial parts with no credit is a violation. Basically all OSS code is under a license that has restrictions on reuse.
Violating the MIT license strikes me as less worrying because you’re only distributing the license to make sure that the lack of warranty is known, but yes, that’s also a concern.
Maybe they could make a reverse Copilot that looks at a snippet, figures out where it is substantially from, and adds a license comment. 🙃
Speaking as someone who has released a load of MIT-licensed code, there are two things that I want to get out of releasing it:
If it’s useful to other folks, it’s easy for them to contribute things back (saving me work). Attribution matters here so that they know where to send bug fixes.
It helps build my reputation, which makes it easier for me to get people to pay me for other work (including adding features to the MIT-licensed projects).
It’s not clear that the warranty disclaimer is even necessary anymore. It was based on a ‘80s view of copyright and contract law. These days, there’s very little expectation that software comes with an implied warranty (even less if you’re not paying anything for it) and so the disclaimer of warranty is probably (though not definitely) superfluous. If someone did sue you over bugs in code that you have away for free, it’s likely that the court would require you to give them a full refund and make them pay their own legal fees. The warranty disclaimer bit is a CYA in case this interpretation of the current legal climate is wrong (or in case it changes in the future). In contrast, the attribution is core to the reasons that I publish the code in the first place.
I’m getting tired of posts that seem to assume Microsoft/GitHub have not considered the possibility of being sued for license violations, or proposing “one weird trick” type stuff to try to create contradictions or whatever.
Either somebody sues them and it gets hashed out in court, or nobody sues them and it’s de facto legal as a result. There. That’s the legal analysis of Copilot.
I’m curious what learning people think they are doing when they type in “TwoSum” and it pastes the answer from Leet Code. Based on what I have seen from some online coding academies, this may actually be someone’s idea of coding education.
In theory, Microsoft could be called to account for this.
In practice, an employee of some small company maintains open source for work and uses copilot. Said employee makes a submission to some source base with a BSD license recreating a routine not under a BSD license. Original author of this code sues the employee and the employer for infringement and statutory damages, in the venue of the author. No mention is made of Microsoft or Copilot.
CoPilot definitely can, and has been demonstrated to, produce nontrivial chunks of GPL’d code verbatim with the license header removed.
This feels like a buyer beware situation. If you use copilot, and it generates nontrivial code, maybe realize that’s too good to be true and Google for the source.
It is not the same. Computers do not “learn” in the way that humans do, they do not possess semantic understanding. Computers do not have the capacity for creativity that humans do. This argument only serves to further confuse people about what ML actually is and what it isn’t.
But argument is about whether looking at a bunch of source code is copyright infringement, not the nature of consciousness or something. I can’t think of how training a neural network to, say, identify possible authorship, using copyrighted images as a training set is any different than me studying a bunch of books of copyrighted images to gain enough expertise to do the same.
No, the problem with this is not the “looking”, they already do that for their search engine and nobody cared.
The problem is the thoughtless regurgitation of other people’s code, without attribution, and without regard for whether it constitutes fair use or not. Since authorship has been scrubbed from the model, there is no way to determine for yourself if you are in violation, and MS absolves themselves of responsibility. It’s a time bomb.
See this copilot is stupid for a deeper exploration of this problem.
But the quote in your comment only deals with “looking”.
Not “looking”, training. Training what? An ML model. For what purpose? Producing code.
Now prove that humans possess semantic understanding
Proof presupposes it.
We have to. All this code doesn’t fit into our brain. We can’t even turn it off, you can’t read a piece of code without imagining how it will continue.
Attention models will just embed whole files if that’s the best way to do it. They don’t care about scope.
I’m also not a lawyer, but if we look at the popular MIT License, for instance, it says:
Isn’t this part definitely missing?
The key word in the MIT license snippet that you’ve quoted is ‘substantial’. This matters because not everything in a codebase is subject to copyright. For example, if you copied
for (int i=0 ; i<100 ; i++) {
from some MIT-licensed source file, the attribution requirement wouldn’t hold because that line is not sufficiently original to qualify for copyright (copyright covers original works, which is quite a subjective thing, but some things are obviously on the not-qualified line). I am not a lawyer, and so I can’t comment on whether the things generated by Copilot meet this bar (from what I’ve read, being a lawyer would still not let me confidently make such a claim, there have been a number of expensive court cases to try to make this determination over shortish snippets of code).If you type in some well-known class like “FizzBuzz” then it may paste the entire class that was grabbed from a specific piece of code somewhere. I have seen videos of people testing CoPilot where it pastes enough to fill a page. Given that possibility, I think the possibility of violating an MIT license is very likely to happen to many people.
People say that Copilot is the same as a human learning from reading GPL code or whatever, but humans don’t produce long verbatim transcriptions of the fast inverse square root. It’s not the same. Microsoft should not have included GPL code in Copilot. It’s a PITA that they did because it just muddies up all the water and makes what should be a useful tool into a potential legal time bomb. They played fast and loose and now everyone else is in legal limbo until it gets resolved in the courts.
GPL isn’t special here. Unless the code is CC0 or WTFPL or whatever, reproduction of substantial parts with no credit is a violation. Basically all OSS code is under a license that has restrictions on reuse.
Or at the very least requires reproduction of the license by the user.
Violating the MIT license strikes me as less worrying because you’re only distributing the license to make sure that the lack of warranty is known, but yes, that’s also a concern.
Maybe they could make a reverse Copilot that looks at a snippet, figures out where it is substantially from, and adds a license comment. 🙃
Primarily to give credit / attribution, I would say
Speaking as someone who has released a load of MIT-licensed code, there are two things that I want to get out of releasing it:
It’s not clear that the warranty disclaimer is even necessary anymore. It was based on a ‘80s view of copyright and contract law. These days, there’s very little expectation that software comes with an implied warranty (even less if you’re not paying anything for it) and so the disclaimer of warranty is probably (though not definitely) superfluous. If someone did sue you over bugs in code that you have away for free, it’s likely that the court would require you to give them a full refund and make them pay their own legal fees. The warranty disclaimer bit is a CYA in case this interpretation of the current legal climate is wrong (or in case it changes in the future). In contrast, the attribution is core to the reasons that I publish the code in the first place.
I use MIT so that when I change jobs, I can still use all the code that I’m writing at my current job.
I’m getting tired of posts that seem to assume Microsoft/GitHub have not considered the possibility of being sued for license violations, or proposing “one weird trick” type stuff to try to create contradictions or whatever.
Either somebody sues them and it gets hashed out in court, or nobody sues them and it’s de facto legal as a result. There. That’s the legal analysis of Copilot.
There’s a difference between learning to code from everyone’s code and using that infused knowledge, and copying byte-for-byte someone else’s code.
I’m curious what learning people think they are doing when they type in “TwoSum” and it pastes the answer from Leet Code. Based on what I have seen from some online coding academies, this may actually be someone’s idea of coding education.
Maybe reactions to umicte should be merged?
Well, there is legal theory and legal practice.
In theory, Microsoft could be called to account for this.
In practice, an employee of some small company maintains open source for work and uses copilot. Said employee makes a submission to some source base with a BSD license recreating a routine not under a BSD license. Original author of this code sues the employee and the employer for infringement and statutory damages, in the venue of the author. No mention is made of Microsoft or Copilot.
Have a nice day.