Threads for GBrayUT

  1. 39

    If this is trained on copyleft code under CDDL, CC-By-SA, GPL, etc then presumably the code it outputs would be a derived work, as would the code you use it in?

    Most code is licensed to impose some restrictions on use & distribution, from retaining copyright messages to patent indemnification. I wonder how that was overcome here.

    Worst case it’s a really interesting machine learning copyright case study.

    1. 10

      Would it be reasonable to say that anybody who unknowingly writes code that is similar to copyleft code that they’ve at some point read is producing a derived work? I realize it’s not exactly the same scenario, but presuming the AI consumes and transforms the original work in some fashion and doesn’t just copy it then it seems that it wouldn’t constitute a derived work by the same measure.

      1. 10

        The FSF told us that when we’re working on Octave, do not read Matlab source code at all, much of which is available for inspection, and do not use Matlab at all either. They in effect told us to do clean-room reverse engineering. Someone else could run code on Matlab and tell us what Matlab did and then we could try to replicate it in Octave, but it had to be a different person, and they had to just tell us what happened. Using Matlab documentation to implement Octave code was also considered safe.

        Yes, copyright cases have been lost over people being told that their derivative work is very similar and was produced with knowledge of the copyrighted work. I’m thinking about musical riffs in particular. Overtly stating you’re reading copyrighted work to produce derivative work seems to put github in weird legal waters, but I assume they have lots of lawyers that told them otherwise, so this is probably going to be okay, or they’re ready to fight off the smalltime free software authors who try to assert their copyleft.

        1. 7

          IANAL, but there are definitely questions here. There’s a history of questions around clean-room design, when you need it, what it gets you, etc.

          1. 5

            I was chatting with someone on Twitter about this. If clean room design is “demonstrably uncontaminated by any knowledge of the proprietary techniques” then Copilot being a black box seems to not fit that definition. What comes out of it has no clear provenance, since it could be synthetic or a straight copy from various sources. The Copilot FAQ on Protecting Originality has already demonstrated occasionally (“0.1% of the time”) copying code directly from the copyrighted corpus.

          2. 6

            But in this case the author hasn’t written the code. It wasn’t a creative process that accidentally ended up looking like another work. They’ve taken a copy supplied by the Copilot.

            Copyright laws are centered around creative work, and were written with humans in mind. I don’t think “AI” would count as a creative thinker for the purpose of the law. One could argue that existing AIs are just obfuscated databases with a fancy query language.

            1. 1

              The original comment was in regards to considering the code being output as a derived work. My comment wasn’t about the consumer of the AI output, it was about whether or not the AI output itself would constitute a derived work. I was making the comparison between the AI output and some unknowingly written copyleft similar work.

            2. 4

              I have been warned to keep gpl code off slides because employers have policies about this

              1. 2

                Wouldn’t GPL code on slides mean that your slides are now subject to the GPL?

                Assuming GPLv2, how does clause 2c apply to “power point code”. Power point reads “commands” (space bar) interactively when you “run” (open) it. Are you required to “to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License”?

                I hate to say it, but the employers policy is paranoid, but probably sensible (especially if you work for them).

            3. 6

              On HN, the author of the announcement blogpost claims, that “jurisprudence” (whatever that means) and common understanding in the ML community is that “ML training is fair use”. I have my doubts about that, basically along what you wrote below about cleanroom design/rev-eng. But IANAL, and I suspect it will need some lawyering action to sort this out one way or another. Personally, I also suspect that might be one of the reasons they kept it as a limited preview for now.

              1. 11

                Copying an idea from someone else: the litmus test is easy: will MS train the model on their wind32 and azure repos? If not, why?

                1. 10

                  Similarly, if someone else were to train a model on leaked MS code, how quickly would the cease and desists start arriving?

                2. 4

                  Hmm, that sounds very… dubious to me.

                  Like, what exactly is “machine learning training”? Is building an n-gram model fair use? What if I train my n-gram model on too little data so that it’ll recreate something which has a relatively high level of similarity to the copyrighted training set? Many statistical machine learning models can be viewed as a lossy encoding of the source text.

                  It seems plausible to me that the “copilot” will auto-generate blocks of source code that’s very similar to existing code. Proving that the AI “came up with it on its own” (whatever that even means when we’re talking about a bunch of matrixes transforming into into output) and didn’t “copy it” from something copyrighted seems extremely legally difficult.

                  If machine learning training is “fair use”, and the output generated by the model is owned by the maker of the model, then overfitting ML models becomes automatic copyright stripping machines. I wouldn’t mind, but that sounds very weird.

                3. 4

                  Even if the code used to train wasn’t copyleft, the attribution/license inclusion requirement still stands. And I am quite sure that GitHub doesn’t include all the licenses from all the code repositories they have used to train the model (they say billions of lines, so that’s at least 10k licenses, good luck with that, and their compatibility).

                  1. 4

                    CNPLv6 has a clause explicitly forbidding this.

                    CNPLv6 - 1b “…In addition, where the Work is designed to output a neural network the output of the neural network will be considered an Adaptation for the purpose of this license.”

                    Are there examples of other licenses, besides mine, which explicitly forbid laundering the work of the commons into something proprietary using machine learning?