1. 42
  1.  

  2. 50

    This seems to contradict some statements from the article (assuming it is true) in that it reproduces non-trivial GPL code and relicenses it on the fly.

    1. 17

      I think what Julia is missing is that this is something that the big coprs can do, but individual developers will get sued into the ground.

      1. 7

        “Copilot relicences the code on the fly” is a bit of an iffy characterisation. Armin gets the inverse square root, then goes back to the top of the file and then says “fill out this thing for a license!”. This isn’t some fully-automated process, this is purposefully being guided.

        I don’t think that Armin was implying that this tool should have properly guessed the license here. There’s already enough stuff to be said about Copilot, there’s not that much of a need for silly demos like this to be taken as a legal argument.

        1. 7

          The licensing portion of that demo is basically irrelevant. If Copilot lifts a function verbatim from one codebase and puts it into another, then the licenses of those two codebases are what matter.

        2. 5

          Corollary: even if you put in some sort of wrapper/check to avoid outputting GPL-licensed snippets like this, or try to add attributions or whatever, the network contains a copy of the GPL code (encoded as weights and so on) so you have to deal with whether that copying/derivation is legitimate.

          It doesn’t necessarily make it go away if the network sometimes produces output way different from any input (my copies are still copies even if I also do original work), or if copies are often inexact (I can’t get away with distributing that photo editor just ‘cause I edited its splash screen to say “Photoslop”). Law is currently that humans can add a lot but still infringe (see the “My Sweet Lord” lawsuit), weird to give a well-funded company’s computer program a pass. In general I think IP laws should be weaker, but I don’t think they should be selectively weaker for companies skirting open-source licenses!

          Seems like the robust solution is using code you actually have permission to use, either because the licenses are explicitly compatible with hoovering up source into a commercial product, or copyright owners relicensed the code for this use, e.g. if MS used their own code or got separate permission from companies or open source authors to train on their stuff. (Some risk of your model getting ‘infected’ by someone else’s bad license/authorship labeling though.) Or (I think?) you could even have a GPL-licensed model that produces GPL-licensed code from GPL-licensed source.

        3. 16

          The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality.

          The thing is that the core logic of a tricky problem can very well be very little code. Take quicksort for instance. It is super-clever algorithm, yet not much code. Luckily quicksort is not under some license that hinders its use in any setting, but it could very well be. Just because it is only 10 lines, it does not mean it is not an invention that is copy-rightable. Code is very different from written language in that regard.

          1. 16

            Yeah the title is ignoring this important bit. The claim is precisely that Github Copilot can suggest code that DOES exceed the threshold of originality.

            Based on machine learning systems I’ve used, it seems very plausible. There is no guarantee that individual pieces of the training data don’t get output verbatim.

            And in fact I say that’s likely, not just plausible. Several years ago, I worked on a paper called Deep Learning with Differential Privacy that addresses the leakage from the training data -> model -> inferences (which seems to have nearly 2000 citations now). If such things were impossible then there’s no reason to do such research.

            1. 2

              That was/is still a concern at least about two years ago, because it was one of the topics I was contemplating for my bachelor’s thesis. The hypothesis I was presented was that there (allegedly) is a smooth transition: the better your model the more training data it leaks, and vice versa. Unfortunately, I chose a different topic so I can’t go into detail here.

              1. 1

                There is no guarantee that individual pieces of the training data don’t get output verbatim.

                The funny thing is that humans do that too. They read something and unknowingly reproduce the same thing in wiring as their own w/o bad intents. I think there is a name for that effect, but I fail to find it atm.

                1. 5

                  I’d say it depends on the scale. Sure in theory it’s possible for a human to type out 1000 lines of code that are identical to what they saw elsewhere, without realizing it, but vanishingly unlikely. They might do that for 10 lines, but not 1000 lines, and honestly not even 100.

                  On the other hand it’s pretty easy to imagine a machine learning system doing that. Computers are fundamentally different than humans in that they can remember stuff exactly … It’s actually harder to make them remember stuff approximately, which is what humans do :)

              2. 8

                Quicksort is an algorithm, and isn’t covered under copyright in the first place. A specific implementation might be, but there’s a very limited number of ways you can implement quicksort and usually there’s just one “obvious” way, and that’s usually not covered under copyright either – and neither should it IMO. It would be severely debilitating as it would mean that the first person to come up with a piece of code to make a HTTP request or other trivial stuff would hold copyright over that. Open source code as we know it today would be nigh impossible.

                1. 6

                  So that means the fast square root example from the Quake engine source code can be copied by anyone w/o adhering to the GPL? It is just an algorithm after all. If that is truly the case, then the GPL is completely useless, since I can start copy & pasting GPL code w/o any repercussions, since it “just an algorithm”.

                  1. 9

                    If your friend looks at the Quake example and describes it to you without actually telling you the code – by describing the algorithm – and you write it in a different language, you are definitely safe.

                    If your friend copies the Quake engine code into a chat message and sends it to you, and you copy it into your codebase and change the variable names to match what you were doing, you are very probably in the wrong.

                    Somewhere in between those two it gets fuzzy.

                    1. 11

                      It looks like my friend copilot is willing to paste the quake version of it directly into my editor verbatim, comments included, without telling me its provenance. If a human did that, it would be problematic.

                    2. 4

                      In the Quake example it copied the (fairly minor) comments too, which is perhaps a bit iffy but a minor detail. There is just one way to express this algorithm: if anyone were to implement this by just having the algorithm described to them but without actually having seen the code then the code would be pretty much identical.

                      I’m not sure if you’re appreciating the implications if it would work any different. Patents on these sort of things are already plenty controversial. Copyright would mean writing any software would have a huge potential for copyright suits and trolls; there’s loads of “only one obvious implementation” code like this, both low-level and high-level. What you’re argueing for would be much much worse than software patents.

                      People seem awfully focused on this Copilot thing at the moment. The Open Source/Free Software movement has spent decades fighting against expansion of copyright and other IP on these kind of things. The main beneficiaries of such an expansion wouldn’t be authors of free software but Microsoft, copyright trolls, and other corporations.

                      1. 2

                        Semi-related, Carmack sorta regrets using the GPL license

                        https://twitter.com/ID_AA_Carmack/status/1412271091393994754

                        1. 5

                          Semi-related, Carmack sorta regrets using the GPL license

                          …in this specific instance.

                          Quote from Carmack in that thread: “I’m still supportive of lots of GPL work, but I don’t think the restrictions helped in this particular case.”

                          I’m not implying you meant to imply otherwise, I’m just adding some context so it’s clearer.

                          1. 1

                            That is fine, I am not the biggest GPL fan myself, yet if I use GPL code/software, I try to adhere to its license

                      2. 5

                        quicksort is not under some license that hinders its use in any setting, but it could very well be

                        Please do not confuse patent law and copyright law. By referring to an algorithm you seem to be alluding to the patent law. And yet you mixed the term “invention” and “copy-rightable”. Please explain what you mean further because as far as I know this is nonsense. As far as I know “programs for computers” can’t be regarded as an invention and therefore patented under the european legal system, this is definitely a case in Poland. This is a separate concept from source code copyright.

                        Maybe your comment is somehow relevant in the USA but I am suspicious.

                      3. 10

                        Maybe someone could „train AI“ on Microsoft’s binaries and replicate their software… I am sure, that they would not be OK with that and would attack you with their lawyers. Today, it is more difficult, because lot of their software runs as a service in so called cloud (SaaS), but still there are significant portions of client-side code that could be replicated this way.

                        1. 7

                          I’ve thought about this with the idea of a clean room implementation robot. Code in -> spec out; throw the spec over the wall; spec in > code out.

                          1. 6

                            My biggest dream associated with this is drivers and hardware support. My kingdom for every driver’s clean room spec!

                        2. 22

                          I’m honestly appalled that such an ignorant article has been written by a former EU MEP. This article completely ignores the fact that the creation of Copilot’s model itself is a copyright infringement. You give Github a license to store and distribute your code from public repositories. You do not give it a permission to Github to use it or create derivative works. And as Copilot’s model is created from various public code, it is a derivative of that code. Some may try to argue that training machine learning models is ‘fair use’, yet I doubt that you can call something that can regurgitate the entire meaningful portion of a file (example taken from Github’s own public dataset of exact generated code collisions) is not a derivative work.

                          1. 13

                            In many jurisdictions, as noted in the article, the “right to read is the right to mine” - that is the point. There is already an automatic exemption from copyright law for the purposes of computational analysis, and GitHub don’t need to get that permission from you, as long as they have the legal right to read the code (i.e. they didn’t obtain it illegally).

                            This appears to be the case in the EU and Britain - https://www.gov.uk/guidance/exceptions-to-copyright - I’m not sure about the US.

                            Something is not a derivative work in copyright law simply due to having a work as an “input” - you cannot simply argue “it is derived from” therefore “it is a derivative work”, because copyright law, not English language, defines what a “derivative work” is.

                            For example, Markov chain analysis done on SICP is not infringing.

                            Obviously, there are limits to this argument. If Copilot regurgitates a significant portion verbatim, e.g. 200 LOC, is that a derivative? If it is 1,000 lines where not one line matches, but it is essentially the same with just variables renamed, is that a derivative work? etc. I think the problem is that existing law doesn’t properly anticipate the kind of machine learning we are talking about here.

                            1. 3

                              Dunno how it is in other countries, but in Lithuania, I can not find any exceptions to use my works without me agreeing to it that fit what Github has done. The closest one could be citation, but they do not comply with the requirement of mentioning my name and work from which the citation is taken.

                              I gave them the license to reproduce, not to use or modify - these are two entirely different things. If they weren’t, then Github has the ability to use all AGPL’d code hosted on it without any problems, and that’s obviously wrong.

                              There is no separate “mining” clause. That is not a term in copyright. Notice how research is quite explicitly “non-comercial” - and I very much doubt that what Github is doing with Copilot is non-comercial in nature.

                              The fact that similar works were done previously doesn’t mean that they were legal. They might have been ignored by the copyright owners, but this one quite obviously isn’t.

                              1. 8

                                There is no separate “mining” clause. That is not a term in copyright. Notice how research is quite explicitly “non-comercial” - and I very much doubt that what Github is doing with Copilot is non-comercial in nature.

                                Ms. Reda is referring to a copyright reform adapted on the EU level in 2019. This reform entailed the DSM directive 2019/790, which is more commonly known for the regulations regarding upload filters. This directive contains a text and data mining copyright limitation in Art. 3 ff. The reason why you don’t see this limitation in Lithuanian law (yet), is probably because Lithuania has not yet transformed the DSM directive into its national law. This should probably follow soon, since Art. 29 mandates transformation into national law until June 29th, 2021. Germany has not yet completed the transformation either.

                                That is, “text and data mining” now is a term in copyright. It is even legally defined on the EU level in Art. 2 Nr. 2 DSM directive.

                                That being said, the text and data mining exception in Art. 3 ff. DSM directive does not – at first glance, I have only taken a cursory look – allow commercial use of the technique, but only permits research.

                                1. 1

                                  Oh, huh, here it’s called an education and research exception and has been in law for way longer than that directive, and it doesn’t mention anything remotely translatable as mining. It didn’t even cross my mind that she could have been referring to that. I see that she pushed for that exception to be available for everyone, not only research and cultural heritage, but it is careless of her to mix up what she wants the law to be, and what the law is.

                                  Just as a preventative answer, no, Art 4. of DSM directive does not allow Github to do what it does either, as it applies to work that “has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.”, and Github was free to get the content in an appropriate manner for machine learning. It is using the content for machine learning that infringes the code owners copyright.

                                2. 5

                                  I gave them the license to reproduce, not to use or modify - these are two entirely different things. If they weren’t, then Github has the ability to use all AGPL’d code hosted on it without any problems, and that’s obviously wrong.

                                  Important thing is also that the copyright owner is often different person than the one, who signed a contract with GitHub and uploaded there the codes (git commit vs. git push). The uploader might agree with whatever terms and conditions, but the copyright owner’s rights must not be disrupted in any way.

                                  1. 3

                                    Nobody is required to accept terms of a software license. If they don’t agree to the license terms, then they don’t get additional rights granted in the license, but it doesn’t take away rights granted by the copyright law by default.

                                    Even if you licensed your code under “I forbid you from even looking at this!!!”, I can still look at it, and copy portions of it, parody it, create transformative works, use it for educational purposes, etc., as permitted by copyright law exceptions (details vary from country to country, but the gist is the same).

                                3. 10

                                  Ms. Reda is a member of the Pirate Party, which is primarily focused on the intersection of tech and copyright. She has a lot of experience working on copyright-related legislation, including proposals specifically about text mining. She’s been a voice of reason when the link tax and upload filters were proposed. She’s probably the copyright expert in the EU parliament.

                                  So be careful when you call her ignorant and mistaken about basics of copyright. She may have drafted the laws you’re trying to explain to her.

                                  1. 16

                                    It is precisely because of her credentials that I am so appalled. I cannot in a good mind find this statement not ignorant.

                                    The directive about text mining very explicitly specifies “only for “research institutions” and “for the purposes of scientific research”.” Github and it’s Copilot doesn’t fall into that classification at all.

                                    1. 3

                                      Indeed.

                                      Even though my opinion of Copilot is near-instant revulsion, the basic idea is that information and code is being used to train a machine learning system.

                                      This is analogous to a human reviewing and reading code, and learning how to do so from lots of examples. And someone going through higher ed school isn’t “owned” by the copyright owners of the books and code they read and review.

                                      If Copilot is violating, so are humans who read. And that… that’s a very disturbing and disgusting precedent that I hope we don’t set.

                                      1. 6

                                        Copilot doesn’t infringe, but GitHub does, when they distribute Copilot’s output. Analogously to humans, humans who read do not infringe, but they do when they distribute.

                                        1. 1

                                          Why is it not the human that distributes copilots output?

                                          1. 1

                                            Because Copilot first had to deliver the code to the human. Across the Internet.

                                        2. 4

                                          I don’t think that’s right. A human who learns doesn’t just parrot out pre-memorized code, and if they do they’re infringing on the copyright in that code.

                                          1. 2

                                            The real question, that I think people are missing, is learning itself is a derivative work?

                                            How that learning happens can either be with a human, or with a machine learning algorithm. And with the squishiness and lack of insight with human brains, a human can claim they insightfully invented it, even if it was derived. The ML we’re seeing here is doing a rudimentary version of what a human would do.

                                            If Copilot is ‘violating’, then humans can also be ‘violating’. And I believe that is a dangerous path, laying IP based claims on humans because they read something.

                                            And as I said upthread, as much as I have a kneejerk that Copilot is bad, I don’t see how it could be infringing without also doing the same to humans.

                                            And as a underlying idea: copyright itself is a busted concept. It worked for the time before mechanical and electrical duplication took hold at a near 0 value. Now? Not so much.

                                            1. 3

                                              I don’t agree with you that humans and Copilot are learning somewhat the same.

                                              The human may learn by rote memorization, but more likely, they are learning patterns and the why behind those patterns. Copilot also learns patterns, but there is no why in its “brain.” It is completely rote memorization of patterns.

                                              The fact that humans learn the why is what makes us different and not infringing, while Copilot infringes.

                                              1. 2

                                                Computers learn syntax, humans learn syntax and semantics.

                                                1. 1

                                                  Perfect way of putting it. Thank you.

                                              2. 3

                                                No I don’t think that’s the real question. Copying is treated as an objective question (and I’m willing to be corrected by experts in copyright law) ie similarity or its lack determine copying regardless of intent to copy, unless the creation was independent.

                                                But even if we address ourselves to that question, I don’t think machine learning is qualitatively similar to human learning. Shoving a bunch of data together into a numerical model to perform sequence prediction doesn’t equate to human invention, it’s a stochastic copying tool.

                                            2. 3

                                              It seems like it could be used to shirk the effort required for a clean room implementation. What if I trained the model on one and only one piece of code I didn’t like the license of, and then used the model to regurgitate it, can I then just stick my own license on it and claim it’s not derivative?

                                            3. 2

                                              Ms. Reda is a member of the Pirate Party

                                              She has left the Pirate Party years ago, after having installed a potential MEP “successor” who was unknown to almost everyone in the party; she subsequently published a video not to vote Pirates because of him as he was allegedly a sex offender (which was proven untrue months later).

                                              1. 0

                                                Why exactly do you think someone from the ‘pirate party’ would respect any sort of copyright? That sounds like they might be pretty biased against copyright…

                                                1. 3

                                                  Despite a cheeky name, it’s a serious party. Check out their programme. Even if the party is biased against copyright monopolies, DRM, frivolous patents, etc. they still need expertise in how things work currently in order to effectively oppose them.

                                              2. 4

                                                Have you read the article?

                                                She addresses these concerns directly. You might not agree but you claim she “ignores” this.

                                                1. 1

                                                  And as Copilot’s model is created from various public code, it is a derivative of that code.

                                                  Depends on the legal system. I don’t know what happens if I am based in Europe but the guys doing this are in USA. It probably just means that they can do whatever they want. The article makes a ton of claims about various legal aspects of all of this but as far as I know Julia is not actually a lawyer so I think we can ignore this article.

                                                  In Poland maybe this could be considered a “derivative work” but then work which was “inspired” by the original is not covered (so maybe the output of the network is inspired?) and then you have a separate section about databases so maybe this is a database in some weird way of understanding it? If you are not a lawyer I doubt you can properly analyse this. The article tries to analyse the legal aspect and a moral aspect at the same time while those are completely different things.

                                                2. 7

                                                  Machine-generated code is not a derivative work

                                                  What? You can’t make blanket statements about any or all machine algorithms, that’s completely the wrong way to think about the problem. If I write a program that reads code and then prints it out again… I can claim the code is machine generated and thus I don’t have to worry about copyright infringement?

                                                  $ cat sourcecode_to_major_company_product.c > non_derivative_work.c
                                                  

                                                  There is lots of grey area that you could use make good arguments about authorship and derivatives for the output of machine algorithms, but instead the article’s take on the problem is just as absolutist and polarised as my cat example above.

                                                  If a “machine algorithm” creates what look like new works: then you could argue they’re not derivatives. You could argue that the new works are the output of a mind that learned themes & ideas whilst reading the other works, rather than just regurgitating pieces back in a way that would be marked as plagiarism if done by a human.

                                                  If a “machine algorithm” creates what looks like a scrapbook of prior works: it’s a derivative just as much as if a person made the same using cut and paste. How could you prove any different? It takes many “machine algorithms” to achieve the cut and paste in your text editor and operating system anyway.

                                                  In real life: any machine algorithm, no matter how good, is going to output a combination of “original” work and “plagiarised” work in varying mixtures. You then need a human to judge these individual outputs, just like we judge individual essays as plagiarised or not. You cannot make blanket assumptions like “all work by algorithm X is not plagiarism” just as you can’t make blanket assumptions that “all essays by John Smith are not plagiarism”.

                                                  1. 7

                                                    A big question is ‘how much of the machine is mechanical, and how much of it is just the programmer’s tool?’

                                                    Does a drawing not become copyrightable because I used a tool called a pencil?

                                                    Does a drawing not become copyrightable if I hand you a mechanical duplication of it?

                                                    What if that mechanical duplication machine is AI powered?

                                                    Does a drawing not become copyrightable because I used a lossy compression tool?

                                                    What if that lossy compression tool is AI powered?

                                                    I think it wouldn’t stand up in court, just like that previous situation with the person offering music that “wasn’t” the Beatles, but sounded exactly like the Beatles to a human.

                                                    1. 5

                                                      If it were not possible to prohibit the use and modification of software code by means of copyright, then there would be no need for licences that prevent developers from making use of those prohibition rights (of course, free software licenses would still fulfil the important function of contractually requiring the publication of modified source code). That is why it is so absurd when copyleft enthusiasts argue for an extension of copyright.

                                                      This is a deeply disappointing read. I hope it’s a sign of ignorance rather than anti-copyleft propaganda, because it’s functionally identical to that. It misses the fundamental point of copyleft over the copycenter sensibility, namely the explicitly political stance that the ability to use and modify software is not worth much without possession of the code in the first place.

                                                      This is the scenario copyleft is intended to prevent:

                                                      1. I write code and give it away.
                                                      2. Some other party takes the code, makes important changes to it (e.g. to allow the code to work with hardware created by the same entity), then distributes the code in build artefact form only.
                                                      3. Users of this party’s version of the code cannot (effectively) examine or modify the code, even though I intended for users of my code to be able to.

                                                      As the original author of the code my only recourse against this is to attach terms to the code which allow users to compel the middle party to hand over the source to them – with modifications.

                                                      If copyright simply didn’t exist then neither would that recourse.

                                                      The point of copyleft is not my own ability to use and modify someone else’s software, as Ms. Reda appears to assume. The point of copyleft is my users’ ability to use and modify my software. The point of copyleft is to allow me to ensure that as its author, software I want to give away to users actually is.

                                                      That is why it is, beg your pardon, anything but “absurd” from a copyleft perspective to want a reasonable level of copyright protection.

                                                      1. 4

                                                        No it’s fine Your Honour, the internet told me so.

                                                        1. 4

                                                          Assuming it’s not creating derivative work, does this mean you can’t modify the generated algorithm? Does it need to be specially annotated in your codebase, by “Generated by GitHub Copilot”? Can you fix bugs or other issues in what’s generated or does that make it into a “derivative work” since it’s not only machine generated code?

                                                          I have trouble seeing Github Copilot as anything other than an antipattern. It generates code of questionable origin, copyright, and quality without thought or testing of that code. This makes it a legacy code and technical debt generator.

                                                          1. 3

                                                            The article says that it’s in the public domain, but who knows if that would keep SCO from suing you…

                                                            The legality of something is different than the safety of something.

                                                            1. 3

                                                              Actually, this is a really interesting point which I haven’t seen brought up enough. You might be able to lay out a very nice, elaborate argument for why using Copilot-generated code isn’t copyright infringement. But until that gets tested in court, you don’t know, and you really don’t want to be the person/company responsible for testing that legal theory.

                                                          2. 4

                                                            Right now we are talking about a specific case focused on Github Autopilot. It may or may not infringe copyright in its current form. Current legislation was not designed for this case. In my opinion a great debate needs to happen and new legislation which directly asses the issue needs to come out of it.

                                                            Following her assessment I could scrape all The Guardian articles (no paywall) and then (re-)write articles (behind a paywall) with the help of an AI assistant. At first it may only finish sentences for me, but later it might write whole articles. How much is too much?

                                                            1. 4

                                                              I find discussions about what is or isn’t strictly legal to be somewhat boring. I mean, of course it matters, but it’s also not a very interesting as such. What should be the right thing is much more interesting.

                                                              Copyleft and GPL also doesn’t strike me as especially relevant; the question is whether a license should apply at all. The restrictions this license may or may not have is a secondary matter.


                                                              As a thought experiment, imagine a savant colleague who has read thousands of the top GitHub repos. Let’s call this person Spock. Now, Spock doesn’t rote memorize all the code character-by-character, instead he just learns about how various things are done by reading these millions and millions of lines of code.

                                                              I walk up to Spock, and as him “Spock, do you know how people usually make HTTP requests in Python?” Based on his experience from reading all the code Spock answers, “sure, I think people usually do it like this, but I’ve also seen some people do it like that”.

                                                              Should this be copyright infringement? I think most people would say it wouldn’t be; at least, that’s what I would say. Would it be any different if Spock started Spock’s Code Academy™ where he answers questions from the public for a fee? I don’t see why it should.

                                                              I’m not entirely sure how this case is fundamentally different, other than that our Vulcan savant (well, half-Vulcan) is replaced with a machine learning algorithm.

                                                              A ML algorithm can do much more than even the most savant of humans; but should this factor in? Why should it? I can’t really see a good reason why. I don’t think “I created this in my spare time and GitHub stands to make a lot of money from it” is a very strong argument; we could apply the same argument to Fair Use or pretty much any exception that the copyright system allows. Besides, GitHub is offering free hosting service to a large chunk of the free software community – it’s not like you’re not getting anything in return either.

                                                              And what would this mean for ML in general? Having ML fall under copyright would also restrict all sorts of other usages. It’s a double-edged sword. Expanding copyright in the past has often been done with the best of intentions, and look where we ended up. I can already see DMCA notices and lawsuits coming over things that were never anticipated. Extensions of copyright rarely – if ever – benefit individuals like you or me, but corporations like Microsoft.

                                                              1. 4

                                                                interesting point of view, but then where does this end? i can very easily data mine one of the many internal leaks of microsoft code. if copyright does not apply at all then the provenance of such code doesn’t really matter

                                                                but if i were to do such a thing i doubt it would end well

                                                                1. 4

                                                                  Fundamentally, the difference here is that we probably assume the savant is applying some degree of judgment and understanding in the process - Spock has some idea of what an HTTP request is, why someone might make it, and why various libraries might make them. Natural language models don’t work that way - they’re pure correlation, no causation. It can recognize that, given a set of surrounding strings, the string “requests” is likely to be what will show up next. This is actually incredibly useful, properly used, and I don’t want to diminish the technical wizardry on display here. That’s cool as hell. And, total honesty, it’s a huge part of what a programmer typically does in a day; most code is not going to be channeling the unbridled creativity of the Muses into your IDE, it’s going to be assembling problems you’ve already solved into form that does what you need now.

                                                                  The problem is copyright explicitly deals in this vague, wishy-washy concept. Copyright does not protect ideas - that’s why it can’t protect algorithms and why patents need to be a separate branch of IP law. Copyright protects expressions of ideas. But it’s also not purely literal - one song sampling another can infringe even though the audio of the new song is not identical to the original.

                                                                  Now that said, I think the process of analogy is a good one. I don’t think the presence of ML fundamentally changes anything. The “creator” is whatever person or organization put the algorithm in place, just as it usually is the person or organization that commissioned a work in whatever other way. Allowing evasion of legal principles by saying “a machine did it, not a human, so nobody’s responsible” is far more dangerous than allowing copyright to apply to machine-produced text.

                                                                  1. 4

                                                                    I don’t think Copilot and Spock are analogous. A human or vulcan who learns doesn’t just parrot out pre-memorized code, and if they do they’re infringing on the copyright in that code.

                                                                    Copilot seems like a roundabout way of laundering copying through a non-linear model.

                                                                    1. 2

                                                                      A human or vulcan who learns doesn’t just parrot out pre-memorized code,

                                                                      I certainly do, at least some of the time, as there is a lot of boilerplate and repetition in everything. Deciding whether what copilot does is qualitatively different from what I do at those times basically requires answering the question‘what is human intelligence’. However, that doesn’t matter, because:

                                                                      and if they do they’re infringing on the copyright in that code.

                                                                      This is right when it crosses the threshold for copying something original and creating a derivative work. Most of the time neither you, nor copilot, are doing so. But ‘some of the time’ matters.

                                                                    2. 1

                                                                      Expanding copyright? Current state is (simply said): everything is prohibited despite thing explicitly allowed to you by the license or by the law. So, does it really mean a copyright expansion? I do not think so.

                                                                      If we focus on the ideas and intention behind the copyleft licenses, it means: anyone can benefit from this free software provided that he preserves the free nature of the work. If someone takes free software created by others and feeds it into a blackbox and then gets pieces of derived code from it and inserts them into his software under incompatible license and without acknowledging the original authors, it should be a copyright infringement and it is against the idea of free software. Bit different case are non-copyleft licenses that explicitly allow you to do whatever you want – however usually you still have to acknowledge the authors.

                                                                      1. 2

                                                                        If someone takes free software created by others and feeds it into a blackbox and then gets pieces of derived code from it and inserts them into his software under incompatible license and without acknowledging the original authors

                                                                        I do this all the time. Everyone does. I’ve probably read over a hundred thousand lines of copyleft code in my life, and the “black box” that is my brain generated all sorts of derived code from this. That was the point of my thought experiment.

                                                                        it is against the idea of free software

                                                                        Maybe, but just because someone had an idea doesn’t mean you can impose that idea on everyone else.

                                                                        Expanding copyright? Current state is (simply said): everything is prohibited despite thing explicitly allowed to you by the license or by the law. So, does it really mean a copyright expansion? I do not think so.

                                                                        It’s the exact reverse: only if it has been established (by law enacted through parliament, or case law though the courts) does copyright apply. It doesn’t somehow “automaticity” apply to everything in the universe unless specified otherwise. Exact details differ on all of this differ per jurisdiction, but generally speaking ideas are not subject to copyright, creative works are (and then only if they meet some threshold of originality).

                                                                        There is some grey area between the boundary of “algorithm” (which is an idea) and “algorithm implementation” (which is a creative work), which is why we had the entire Google vs. Oracle lawsuit for example, but the notion that anything anyone wrote is subject to copyright as some absolute is really not how it works. Look up the history of copyright in software: there was a time when people thought copyright didn’t apply to software at all. Courts had to rule on this and/or laws had to change for this.

                                                                        1. 2

                                                                          my thought experiment

                                                                          At least, there is a difference: you are a citizen and a free man, while AI (or other software/tool) is just a thing owned and controlled by someone else. Humans are explicitly allowed to study the source code (e.g. in GNU GPL) and it is one of intended purposes when someone publishes software under such license. But they are not allowed to compile/transpile/convert/obfuscate/etc. the source code (using any tool/blackbox/etc.) to another work under an incompatible license and without acknowledging the original author.

                                                                          1. 3

                                                                            You keep harping on about the GPL, but the question is whether copyright (and therefore a license – any license) applies at all.

                                                                    3. 3

                                                                      Honest question: say you use Copilot, and end up distributing 50 substantial lines of code verbatim that you don’t hold any copyright for, in violation of a license. Why does it matter that you used Copilot? Why would it matter if you used Tabnine, VSCode, Emacs, or vim? Isn’t the publication and/or distribution the thing that matters? Why would a jury care how you got there?

                                                                      I would not want to try to explain why my use of a special new AI text generator somehow changed the very real fact that I was distributing someone else’s copyrighted code, verbatim, without license or permission.

                                                                      1. 3

                                                                        It sounds like “I used Copilot” is going to be a handy get-out-of-jail-free card for future license infringers.