As unsexy as talk about AI or licensing may be, I find it really important that an “AI thing” shouldn’t be called “open” unless the input data is also open and the end product can be reproduced.
There is a lot of open-washing in the AI space. First of all I totally agree with @freddyb’s comment that the only open source AI should be reproducible models weights with open-licensed input.
But as of today the industry publishes “open models” with just code, no weights. Imagine I create a piece of software where the entire logic is defined in the configuration, and I release only the code, no documentation, no default configuration, and start claiming around “hey, I open sourced my application.” You would say “dude.. You’re a liar. Your application is unusable without the configuration which defines what it does.”
Right now we have startups publishing code which is basically 5 lines of keras or pytorch, and writing a press release about their new “open-source” model. And people in the data science and AI community just applaud… What the hell?!?
What we really need is some sort of massive collection of royalty-free content to train on (I believe this exists but I forget the name, probably exists in several emanations by now, honestly), and create models trained exclusively on that, and not just whatever random stuff various crawlers regurgitated into it. Not only would that be open and legal, but would clear up absolutely any moral concerns around using content generated by a model. I just wish everyone complaining about copyrighted material in models would actually help contribute to such causes, instead of just being the ones holding the torches on their way to the textile factory.
And I’m never, ever going to buy the argument of “but open models are dangerous” when the only people making that point are the ones making billions of dollars off of their secret sauce model. Ideally these people should face legal prosecution, not be allowed to stand on their high-rise soap box.
That comment isn’t taking some other context around the field of AI into account. A lot of people/companies are considering deliberately marking generated content as such, I’ve even seen someone compare it to a “farm raised” label. I think in the future, human-made art will be seen as a premium, not a deprecated practice. It certainly would be nice to have these models so that human-made art isn’t the only solution to content creation, as that’s quite limiting to people without any money or talent (who want to make genuine content, not scams, anyway). I for one can’t see a future where anyone enjoys reading AI-generated written works on a mass scale, no matter how well it’s generated. Maybe on a local basis as a summary, but certainly not an entire book or article, outside of gimmicks like “I Am Code”.
I think in the future, human-made art will be seen as a premium
Indeed! Without a steady stream of original, human-generated inputs, LLM outputs will soon stagnate around culture around 2018 - even without model collapse induced by desperately feeding it “synthetic data” in an AI analog of BSE. So it’s in the LLM industry’s best interest for content providers to clearly label their work as not contaminated by AI!
Of course, the industry is not going to be pay for it. That VC money isn’t going to repay itself.
As unsexy as talk about AI or licensing may be, I find it really important that an “AI thing” shouldn’t be called “open” unless the input data is also open and the end product can be reproduced.
So they’d feel bad about releasing the data, but don’t feel bad abo’t releasing a derivative work?
There is a lot of open-washing in the AI space. First of all I totally agree with @freddyb’s comment that the only open source AI should be reproducible models weights with open-licensed input.
But as of today the industry publishes “open models” with just code, no weights. Imagine I create a piece of software where the entire logic is defined in the configuration, and I release only the code, no documentation, no default configuration, and start claiming around “hey, I open sourced my application.” You would say “dude.. You’re a liar. Your application is unusable without the configuration which defines what it does.”
Right now we have startups publishing code which is basically 5 lines of keras or pytorch, and writing a press release about their new “open-source” model. And people in the data science and AI community just applaud… What the hell?!?
Confusing times…
What we really need is some sort of massive collection of royalty-free content to train on (I believe this exists but I forget the name, probably exists in several emanations by now, honestly), and create models trained exclusively on that, and not just whatever random stuff various crawlers regurgitated into it. Not only would that be open and legal, but would clear up absolutely any moral concerns around using content generated by a model. I just wish everyone complaining about copyrighted material in models would actually help contribute to such causes, instead of just being the ones holding the torches on their way to the textile factory.
And I’m never, ever going to buy the argument of “but open models are dangerous” when the only people making that point are the ones making billions of dollars off of their secret sauce model. Ideally these people should face legal prosecution, not be allowed to stand on their high-rise soap box.
[Comment removed by moderator pushcx: Pruning off the start of a political thread. ]
That comment isn’t taking some other context around the field of AI into account. A lot of people/companies are considering deliberately marking generated content as such, I’ve even seen someone compare it to a “farm raised” label. I think in the future, human-made art will be seen as a premium, not a deprecated practice. It certainly would be nice to have these models so that human-made art isn’t the only solution to content creation, as that’s quite limiting to people without any money or talent (who want to make genuine content, not scams, anyway). I for one can’t see a future where anyone enjoys reading AI-generated written works on a mass scale, no matter how well it’s generated. Maybe on a local basis as a summary, but certainly not an entire book or article, outside of gimmicks like “I Am Code”.
Indeed! Without a steady stream of original, human-generated inputs, LLM outputs will soon stagnate around culture around 2018 - even without model collapse induced by desperately feeding it “synthetic data” in an AI analog of BSE. So it’s in the LLM industry’s best interest for content providers to clearly label their work as not contaminated by AI!
Of course, the industry is not going to be pay for it. That VC money isn’t going to repay itself.
[Comment removed by moderator pushcx: Pruning off the start of a political thread. ]