It’s always easier to make a mess than to clean up after one. For those of you playing with LLMs and posting the resulting sludge online, take a long, hard look at yourself and reflect on the damage you’ve already done to the web ecosystem. Doubly so if you’re building the models, or actively helping people to use them in more places.
To quote Dr. Seuss,
Unless someone like you cares a whole awful lot, Nothing is going to get better. It’s not.
I don’t think there’s any reason to think that using a LLM to generate text and then posting it online is harmful in a way people should care about (or be legally/socially compelled to care about). I don’t publish text on the internet based on what makes things nice for people doing corpus linguistics on English text.
All the LLM-generated garbage drowns out the good stuff. I care about the fact that I can’t search for anything without 90% of the results being worthless nonsense spewed out by LLMs.
I don’t think that “All” is fair. People generating garbage to game SEO are doing harm, but there is surely plenty of AI generated content that is neutral at worst.
LLMs produce harmful garbage that is at best a lossy imitation of something they were trained upon.
There are naive users who can’t tell the difference between LLM slop and real information, there are grifters and hustlers who are happy to burn the world down if they can extract a profit, there are “useful idiots” who choose not to consider the externalities of their actions, and there are plenty of folks who fall into multiple such categories at once. The fact that you can personally extract benefit from applying this technology to problems does not change the fact that it is a massive and obvious net-negative for society and the environment. Lots of people found leaded gasoline useful, too!
You say that “LLMs produce harmful garbage” but also “it is a massive and obvious net-negative”.
The latter is a far weaker assertion than the former. If all you want to say is that you think that, overall, the technology causes more harm than it’s worth, that’s radically different from saying that it is useless or strictly harmful garbage.
It doesn’t follow (if you intended this to be a formal syllogism). The people who find AI to produce useful output could be uniformly mistaken.
If you didn’t mean it formally, consider ~lonjil may not have either — there may be extremely small amounts of useful AI generated content on the Internet, little enough to make it, as ~Internet_Janitor says, a massive and obvious net negative.
I could restructure it to address that, but I’m using words like “plausible” for a reason.
If people believe that LLMs provide value then LLMs plausibly provide value
People believe that LLMs provide value
LLM content exists on the internet
It is plausible that some of the content on the internet is the content that people find valuable
This isn’t intended to be a formal syllogism, strictly, it’s just a way of structuring my statements for clarity.
But just because I’m speaking somewhat formally doesn’t mean that I assume that they are. But they are making an extremely strong assertion, and they were not ambiguous. If they want to say “there’s little useful content from LLMs” or “the majority of LLM content on the internet is bad” that’s fine but I don’t consider that to be “close enough” to just hand wave it away as harmless hyperbole. I might disagree with those statements but I doubt I’d disagree with them to the extent that I consider them obviously wrong.
I dislike these extreme, obviously objectionable statements quite a lot. I think they’re a bad way to discuss interesting, complex topics. That’s why I responded in a way that I had hoped would be clear. A response of “that doesn’t follow” is a good response.
If people believe that LLMs provide value then LLMs plausibly provide value
Hm. Here you seem to assert that
“people believe $claim” entails “$claim is plausible”.
I guess I can see that as one definition of “plausible”, but, when lonjil asserted “There is no useful AI generated content on the internet”, you called their assertion “obviously implausible”. How does that reconcile with (1)?
I think it appears that way because I’ve structured my argument in a way that’s much less ambiguous. If you were to structure their argument such that it’s similarly express I suspect you’d find that they’re not symmetric in this regard - that is, they would have to add a premise like “I do not believe that there is useful LLM information therefor it is plausible that there is no useful LLM information” etc and you’d have to weight the arguments against one another by evaluating those premises, understanding “useful”, etc.
Regardless, someone saying “something is useless” is a far stronger assertion than “something may be useful”.
I reject this argument, on the basis that just because something is useful to you, doesn’t mean that it will be useful to anyone else. For example, let’s say someone uses ChatGPT instead of Google because Google never returns useful results anymore. If ChatGPT then gave a useful response to this person, would it be useful to put that ChatGPT output onto the public web? I would say no. The original information is already on the web, and putting more LLM-content on the web just makes it harder to find it.
It should be labeled, at the very least, so that LLMs don’t consume their own excrement thinking it was human-generated. The result would be the textual equivalent of high-pitched feedback from a microphone too close to a speaker.
Okay, I think that’s fine to say but it doesn’t really change things - the people who are causing these problems are not going to label their outputs, but that doesn’t mean that every output from an LLM is bad.
I’m skeptical that anybody on this site is creating vast quantities of slop to game SEO, which is what is actually causing the problem in TFA, versus posting interesting things they did with new technology, which is cool and they shouldn’t feel bad about.
Many of the people who wrote the software used by other(?) people in the SEO industry to generate vast quantities of slop thought what they were doing was just playing with “interesting new technology”, which doesn’t make them any less responsible for the outcomes. Designing bombs for the intellectual thrill isn’t free of ethical implications just because someone else is responsible for choosing where and when to drop them.
Reminds me of the recruiter trying to get me to work for a weapons manufacturer (sorry, “battlefield intelligence”) by telling me that my work would be good because it would help make the weapons more accurate and only kill the people they were aimed at more often and lessen collateral damage.
which doesn’t make them any less responsible for the outcomes
That’s a really strong stance to take and most people probably would and should reject it intuitively. Intent is almost universally understood to be a critical component of moral responsibility.
I’m not totally sure what point you’re trying to make. I think you’re equating my statement that intent factors into moral judgments to the (commonly rejected) argument that Nazis were “just following orders” and are therefor innocent? Or perhaps the “Wernehr von Braun” character’s uncaring attitude towards where bombs land?
I don’t think it’s contentious at all to say that intent factors into moral judgments, but it would certainly be contentious to say that nazis acted without intent.
Designing bombs for the intellectual thrill isn’t free of ethical implications just because someone else is responsible for choosing where and when to drop them.
Which I take to be more or less a reference to Wernher von Braun (who was a real person, not just a character). I’m not arguing that intent is irrelevant to moral culpability, but that intent only takes you so far, as the example of Wernher von Braun illustrates. Sometimes you aim for the stars, but nevertheless hit London.
That doesn’t sound quite right to me. A drunk driver’s intent may be simply to get from A to B, but most people would probably hold them responsible for the consequences of their recklessness.
Of course. Intent is just one factor. Similarly, if I got into my car and intentionally tried to run people over to kill them, people would take that into consideration.
I was under the impression you were claiming it was necessary. If you merely think it’s relevant, then yes, I agree. But I think what matters is the intent to act, not the intent to cause any particular outcome; if you work for FooCorp deliberately because you want them to pay you¹, as opposed to being tricked into working for them or something, IMO you are morally responsible for anything they do that you could reasonably² have predicted.
1. and you don’t consider yourself to be actively defrauding them
2. I realise “reasonably” is extremely woolly, but it’s hard to be quantitative about morality.
The grandposter said “any less responsible”, and you translated that to “any less morally responsible”, which I think is a misread.
If I push a button thinking I will get a coffee, when it actually sets fourteen people and a dog on fire, I may not be morally responsible, but I definitely am responsible–without my action, the subsequent suffering would not have happened.
I guess so. I’m not sure if the word “responsible” is ever used that way. I think if you said “I’m responsible for that” a ton of people would say “no it’s not your fault”. But sure, they may mean it in the sense that you are still causally prior to.
You back out of your garage. You do not check your rear view. You kill a small dog with your car.
Regardless of your frame of mind, you are responsible for that act. Not just causally, but also morally and legally. If frame of mind had no impact, then “murder” and “manslaughter” would not be separate crimes*. On the other hand, if “had no ill intent” was moral clearance to do anything, then “manslaughter” wouldn’t be a crime at all.
Your moral framework encourages willful ignorance to remain morally clear. It does not accurately reflect most moral frameworks in our world.
[*] In the US, both are (roughly) “responsible for death”. “Murder” is (again, roughly) for intentional death, while “manslaughter” is for unintentional but still culpable death. Local laws vary, but most jurisdictions have that distinction.
I think you’re misunderstanding me, but I don’t think it really matters, so long as we understand two things:
That I consider “responsible” as “morally responsible”, which you can say is not the case, in which case I simply have misunderstood the poster - but I believe they were speaking morally based on the context.
That intent factors into moral judgments.
As for various cases where intent factors in in a weak way, strong way, or perhaps even not at all, it doesn’t really change those points.
In the effort to understand what you’re saying then, are you suggesting that somebody who is “just playing with new interesting technology” is “manslaughter” responsible, but not “murder” responsible for the destructive results?
I think that whatever they are, they are not “just as” responsible.
For example, I think that AI may at some point be abused in a way that I don’t like. I also pay for ChatGPT, and perhaps in the future my dollar contribution to the company will somehow lead to that “badness” occurring. I do not believe that I am as morally responsible for that as whoever leverages the technology with direct intent to commit some evil deed.
Concretely, I am not convinced at all that AI is radically more “bad” than it is “good”. I am definitely convinced that it provides multiple “goods” and that it has concrete use cases that are neutral or good. It absolutely has some terrible use cases, there’s undoubtedly harm that can be done (I think deepfakes are quite a scary thing). I also think that, for those reasons and others, being a user of AI does not make someone “just as” responsible as someone who explicitly leverages AI to do something terrible.
We don’t have reliable information on language usage by humans for most centuries, and arguably not even for the 20th and 21st centuries, since the Internet is obviously not necessarily representative of all human language use.
Sure, but we have an NLP researcher basically saying
we can’t genuinely discern generative AI content well enough to distinguish it from human produced content
We simply haven’t encountered that before. It’s not the “oh no computers are alive” kind of scary. It’s kind of hard to even put a metaphor to it. I guess it’s something like being an uninformed patient at a hospital staffed with very confident but extremely stupid doctors:
I’m sorry sir but we had to amputate your leg
Wait. But I thought I had a…cold?
Delving quickly, I noticed it was moving to your legs. We had to operate immediately.
A loose analogy might be if the medical community started spraying antibiotics of last resort on every available surface (you’ll have to imagine there was some sort of tenuous potential for short-term profit and career advancement here), and then feigned surprise when untreatable antibiotic-resistant bacteria spread like wildfire. LLM output is very difficult to distinguish from semantically meaningful text with the statistical approaches that have traditionally been applied to spam, so while it is very cheap and easy to create this form of information pollution, we don’t have scalable techniques for eliminating it from a corpus. It is a grim sliver of comedy that the over-enthusiasm for this machine learning technique will assuredly stymie many future efforts at more sophisticated approaches, and leave the open web as useless for machines as it is becoming for humans.
deeply tragic. I think everyone who was paying attention to linguistics research or to corpora saw this day coming, but that doesn’t make it any less sad now that it has.
the short answer to this is no. (edit to add: that is, no, there are no automated tools for it)
to reliably detect LLM output would require having more training data for the detector than the company that made the LLM had. it’s more profitable to sell LLMs than LLM detectors.
the fact that Google is drowning in spam search results (in my subjective opinion based solely on public information) should be a strong indicator that nobody on the defense side has a clear advantage.
I’m not entirely convinced that “LLM detectors” based on existing techniques could be made reliable even with vast quantities of training data. If you had one, you could use it to adversarially train an LLM to smooth out whatever statistical/syntactic wrinkles the detector was keying upon. If anything could identify the semantic flaws in LLM output, you’d have something resembling the mythical “strong AI” that would make LLMs obsolete, and nobody credibly knows how to do that (not for lack of snake oil sales pitches for it, of course!)
The situation is also muddled by the fact that SEO sites can directly make money for Google, so they have perverse incentives at play across the organization to do a poor job filtering out spam, even if they had the technology. They don’t even bother to give users tools to filter spam domains out of their personal results!
In places valued for the quality of their non-bullshit, at least ones which are run and frequented by people who care about that, bullshit should be curated out.
In a few years, places like this which actively check if submissions are sufficiently coherent to be authored by humans will be even juicer targets for LLM ingestion operations. That is if there are any submissions left that are authored by humans.
Is 2021 identified as the year when LLMs started polluting content? Is there a month we see things decline? IIRC, the general public didn’t get access to LLMs until late 2022, so I’m a little confused on 2021 being mentioned.
ChatGPT was released in 2022, but GPT-2 and GPT-3 were available before then. Random people weren’t using it but SEO spammers definitely were. 2021 sounds about right for when it started to become a larger problem.
How do various AI players collect word frequency data? My naive thought is that they have teams whose purpose is specifically to clean data from various sources (in particular, NOT the web, but books) regardless of copyright (as they are immune to copyright as confirmed in US courts)?
So the author’s hope that OpenAI and Google come to regret what they have brought on seems to be…unlikely to pan out?
Are you suggesting they are not using web content? Because that seems a very strange assertion given that Google seems keen to replace its web search with an LLM chatbot.
If they are using web content, and they are also publishing to the web, and part of their success criteria is that their models generate content that is mathematically indistinguishable from human-generated content (based on their model’s calculations), then of course they’ll come to regret it. If the indistinguishability is the goal, and training on LLM output slowly degrades the models, then their products will become inevitably worse. Regret seems obvious to follow.
I am suggesting that they are not as dependent on web content as other people might be, due to having the resources to curate a wide variety of input types, and the understanding to not drink their own kool-aid.
While “AI” has definitely made web content much less informative, pre-“AI” web wasn’t that much better once it came to be dominated by marketing and advertising.
(“AI” isn’t a dissatisfying tool solely because of its training material.)
It’s always easier to make a mess than to clean up after one. For those of you playing with LLMs and posting the resulting sludge online, take a long, hard look at yourself and reflect on the damage you’ve already done to the web ecosystem. Doubly so if you’re building the models, or actively helping people to use them in more places.
To quote Dr. Seuss,
I don’t think there’s any reason to think that using a LLM to generate text and then posting it online is harmful in a way people should care about (or be legally/socially compelled to care about). I don’t publish text on the internet based on what makes things nice for people doing corpus linguistics on English text.
All the LLM-generated garbage drowns out the good stuff. I care about the fact that I can’t search for anything without 90% of the results being worthless nonsense spewed out by LLMs.
I don’t think that “All” is fair. People generating garbage to game SEO are doing harm, but there is surely plenty of AI generated content that is neutral at worst.
There is no useful AI generated content on the internet. Even the best AI generated content competes with real content.
Okay, well, that’s a pretty extreme assertion and I think it’s at least obviously implausible.
There are people who find AI to be useful, and to produce useful output (this is a fact, I am one of them)
Some of that content is likely to end up on the internet
Ergo there is probably useful AI content on the internet.
Unless our definition of “useful” is radically divergent I just don’t see how what you’ve said could be plausible.
LLMs produce harmful garbage that is at best a lossy imitation of something they were trained upon.
There are naive users who can’t tell the difference between LLM slop and real information, there are grifters and hustlers who are happy to burn the world down if they can extract a profit, there are “useful idiots” who choose not to consider the externalities of their actions, and there are plenty of folks who fall into multiple such categories at once. The fact that you can personally extract benefit from applying this technology to problems does not change the fact that it is a massive and obvious net-negative for society and the environment. Lots of people found leaded gasoline useful, too!
You say that “LLMs produce harmful garbage” but also “it is a massive and obvious net-negative”.
The latter is a far weaker assertion than the former. If all you want to say is that you think that, overall, the technology causes more harm than it’s worth, that’s radically different from saying that it is useless or strictly harmful garbage.
It doesn’t follow (if you intended this to be a formal syllogism). The people who find AI to produce useful output could be uniformly mistaken.
If you didn’t mean it formally, consider ~lonjil may not have either — there may be extremely small amounts of useful AI generated content on the Internet, little enough to make it, as ~Internet_Janitor says, a massive and obvious net negative.
I could restructure it to address that, but I’m using words like “plausible” for a reason.
This isn’t intended to be a formal syllogism, strictly, it’s just a way of structuring my statements for clarity.
But just because I’m speaking somewhat formally doesn’t mean that I assume that they are. But they are making an extremely strong assertion, and they were not ambiguous. If they want to say “there’s little useful content from LLMs” or “the majority of LLM content on the internet is bad” that’s fine but I don’t consider that to be “close enough” to just hand wave it away as harmless hyperbole. I might disagree with those statements but I doubt I’d disagree with them to the extent that I consider them obviously wrong.
I dislike these extreme, obviously objectionable statements quite a lot. I think they’re a bad way to discuss interesting, complex topics. That’s why I responded in a way that I had hoped would be clear. A response of “that doesn’t follow” is a good response.
Hm. Here you seem to assert that
I guess I can see that as one definition of “plausible”, but, when lonjil asserted “There is no useful AI generated content on the internet”, you called their assertion “obviously implausible”. How does that reconcile with (1)?
I think it appears that way because I’ve structured my argument in a way that’s much less ambiguous. If you were to structure their argument such that it’s similarly express I suspect you’d find that they’re not symmetric in this regard - that is, they would have to add a premise like “I do not believe that there is useful LLM information therefor it is plausible that there is no useful LLM information” etc and you’d have to weight the arguments against one another by evaluating those premises, understanding “useful”, etc.
Regardless, someone saying “something is useless” is a far stronger assertion than “something may be useful”.
I reject this argument, on the basis that just because something is useful to you, doesn’t mean that it will be useful to anyone else. For example, let’s say someone uses ChatGPT instead of Google because Google never returns useful results anymore. If ChatGPT then gave a useful response to this person, would it be useful to put that ChatGPT output onto the public web? I would say no. The original information is already on the web, and putting more LLM-content on the web just makes it harder to find it.
It should all be labeled.
It could be a start, but essentially just an evil bit
I don’t follow.
It should be labeled, at the very least, so that LLMs don’t consume their own excrement thinking it was human-generated. The result would be the textual equivalent of high-pitched feedback from a microphone too close to a speaker.
Okay, I think that’s fine to say but it doesn’t really change things - the people who are causing these problems are not going to label their outputs, but that doesn’t mean that every output from an LLM is bad.
I’m skeptical that anybody on this site is creating vast quantities of slop to game SEO, which is what is actually causing the problem in TFA, versus posting interesting things they did with new technology, which is cool and they shouldn’t feel bad about.
Many of the people who wrote the software used by other(?) people in the SEO industry to generate vast quantities of slop thought what they were doing was just playing with “interesting new technology”, which doesn’t make them any less responsible for the outcomes. Designing bombs for the intellectual thrill isn’t free of ethical implications just because someone else is responsible for choosing where and when to drop them.
Reminds me of the recruiter trying to get me to work for a weapons manufacturer (sorry, “battlefield intelligence”) by telling me that my work would be good because it would help make the weapons more accurate and only kill the people they were aimed at more often and lessen collateral damage.
That’s a really strong stance to take and most people probably would and should reject it intuitively. Intent is almost universally understood to be a critical component of moral responsibility.
Nazi, Schmatzi, says Wernehr von Braun!
I’m not totally sure what point you’re trying to make. I think you’re equating my statement that intent factors into moral judgments to the (commonly rejected) argument that Nazis were “just following orders” and are therefor innocent? Or perhaps the “Wernehr von Braun” character’s uncaring attitude towards where bombs land?
I don’t think it’s contentious at all to say that intent factors into moral judgments, but it would certainly be contentious to say that nazis acted without intent.
It’s more about ~Internet_Janitor’s phrase:
Which I take to be more or less a reference to Wernher von Braun (who was a real person, not just a character). I’m not arguing that intent is irrelevant to moral culpability, but that intent only takes you so far, as the example of Wernher von Braun illustrates. Sometimes you aim for the stars, but nevertheless hit London.
Thanks, that clears things up re: Wernher von Braun.
There’s no question that intent is a mere factor. I just reject this:
That doesn’t sound quite right to me. A drunk driver’s intent may be simply to get from A to B, but most people would probably hold them responsible for the consequences of their recklessness.
Of course. Intent is just one factor. Similarly, if I got into my car and intentionally tried to run people over to kill them, people would take that into consideration.
I was under the impression you were claiming it was necessary. If you merely think it’s relevant, then yes, I agree. But I think what matters is the intent to act, not the intent to cause any particular outcome; if you work for FooCorp deliberately because you want them to pay you¹, as opposed to being tricked into working for them or something, IMO you are morally responsible for anything they do that you could reasonably² have predicted.
1. and you don’t consider yourself to be actively defrauding them
2. I realise “reasonably” is extremely woolly, but it’s hard to be quantitative about morality.
The grandposter said “any less responsible”, and you translated that to “any less morally responsible”, which I think is a misread.
If I push a button thinking I will get a coffee, when it actually sets fourteen people and a dog on fire, I may not be morally responsible, but I definitely am responsible–without my action, the subsequent suffering would not have happened.
I guess so. I’m not sure if the word “responsible” is ever used that way. I think if you said “I’m responsible for that” a ton of people would say “no it’s not your fault”. But sure, they may mean it in the sense that you are still causally prior to.
Another illustrative hypothetical:
You back out of your garage. You do not check your rear view. You kill a small dog with your car.
Regardless of your frame of mind, you are responsible for that act. Not just causally, but also morally and legally. If frame of mind had no impact, then “murder” and “manslaughter” would not be separate crimes*. On the other hand, if “had no ill intent” was moral clearance to do anything, then “manslaughter” wouldn’t be a crime at all.
Your moral framework encourages willful ignorance to remain morally clear. It does not accurately reflect most moral frameworks in our world.
[*] In the US, both are (roughly) “responsible for death”. “Murder” is (again, roughly) for intentional death, while “manslaughter” is for unintentional but still culpable death. Local laws vary, but most jurisdictions have that distinction.
I think you’re misunderstanding me, but I don’t think it really matters, so long as we understand two things:
That I consider “responsible” as “morally responsible”, which you can say is not the case, in which case I simply have misunderstood the poster - but I believe they were speaking morally based on the context.
That intent factors into moral judgments.
As for various cases where intent factors in in a weak way, strong way, or perhaps even not at all, it doesn’t really change those points.
In the effort to understand what you’re saying then, are you suggesting that somebody who is “just playing with new interesting technology” is “manslaughter” responsible, but not “murder” responsible for the destructive results?
I think that whatever they are, they are not “just as” responsible.
For example, I think that AI may at some point be abused in a way that I don’t like. I also pay for ChatGPT, and perhaps in the future my dollar contribution to the company will somehow lead to that “badness” occurring. I do not believe that I am as morally responsible for that as whoever leverages the technology with direct intent to commit some evil deed.
Concretely, I am not convinced at all that AI is radically more “bad” than it is “good”. I am definitely convinced that it provides multiple “goods” and that it has concrete use cases that are neutral or good. It absolutely has some terrible use cases, there’s undoubtedly harm that can be done (I think deepfakes are quite a scary thing). I also think that, for those reasons and others, being a user of AI does not make someone “just as” responsible as someone who explicitly leverages AI to do something terrible.
This is scary.
We don’t have reliable information on language usage by humans for most centuries, and arguably not even for the 20th and 21st centuries, since the Internet is obviously not necessarily representative of all human language use.
Sure, but we have an NLP researcher basically saying
We simply haven’t encountered that before. It’s not the “oh no computers are alive” kind of scary. It’s kind of hard to even put a metaphor to it. I guess it’s something like being an uninformed patient at a hospital staffed with very confident but extremely stupid doctors:
Wait. But I thought I had a…cold?
Um well, ok great. Glad to live another day.
A loose analogy might be if the medical community started spraying antibiotics of last resort on every available surface (you’ll have to imagine there was some sort of tenuous potential for short-term profit and career advancement here), and then feigned surprise when untreatable antibiotic-resistant bacteria spread like wildfire. LLM output is very difficult to distinguish from semantically meaningful text with the statistical approaches that have traditionally been applied to spam, so while it is very cheap and easy to create this form of information pollution, we don’t have scalable techniques for eliminating it from a corpus. It is a grim sliver of comedy that the over-enthusiasm for this machine learning technique will assuredly stymie many future efforts at more sophisticated approaches, and leave the open web as useless for machines as it is becoming for humans.
“Delving”. I see what you did there.
deeply tragic. I think everyone who was paying attention to linguistics research or to corpora saw this day coming, but that doesn’t make it any less sad now that it has.
Those are good reasons.
I’ve seen one one or two stories posted to lobsters that I suspected were written by a LLM.
Apart from humans, are there any good tools to inspect content and make a judgement as to whether it was generated or not?
Should content generated in whole or part by LLMs be banned from places like lobsters?
the short answer to this is no. (edit to add: that is, no, there are no automated tools for it)
to reliably detect LLM output would require having more training data for the detector than the company that made the LLM had. it’s more profitable to sell LLMs than LLM detectors.
the fact that Google is drowning in spam search results (in my subjective opinion based solely on public information) should be a strong indicator that nobody on the defense side has a clear advantage.
I’m not entirely convinced that “LLM detectors” based on existing techniques could be made reliable even with vast quantities of training data. If you had one, you could use it to adversarially train an LLM to smooth out whatever statistical/syntactic wrinkles the detector was keying upon. If anything could identify the semantic flaws in LLM output, you’d have something resembling the mythical “strong AI” that would make LLMs obsolete, and nobody credibly knows how to do that (not for lack of snake oil sales pitches for it, of course!)
The situation is also muddled by the fact that SEO sites can directly make money for Google, so they have perverse incentives at play across the organization to do a poor job filtering out spam, even if they had the technology. They don’t even bother to give users tools to filter spam domains out of their personal results!
Should bullshit be banned?
In places valued for the quality of their non-bullshit, at least ones which are run and frequented by people who care about that, bullshit should be curated out.
If you check the moderation logs, you’ll see @pushcx regularly deletes articles for being LLM slop and bans sites that repeatedly post them.
In a few years, places like this which actively check if submissions are sufficiently coherent to be authored by humans will be even juicer targets for LLM ingestion operations. That is if there are any submissions left that are authored by humans.
If the content has value I see no reason to care if it’s human or LLM generated with regards to it being submitted to lobsters.
Is 2021 identified as the year when LLMs started polluting content? Is there a month we see things decline? IIRC, the general public didn’t get access to LLMs until late 2022, so I’m a little confused on 2021 being mentioned.
ChatGPT was released in 2022, but GPT-2 and GPT-3 were available before then. Random people weren’t using it but SEO spammers definitely were. 2021 sounds about right for when it started to become a larger problem.
Sic transit gloria mundi, dude.
How do various AI players collect word frequency data? My naive thought is that they have teams whose purpose is specifically to clean data from various sources (in particular, NOT the web, but books) regardless of copyright (as they are immune to copyright as confirmed in US courts)?
So the author’s hope that OpenAI and Google come to regret what they have brought on seems to be…unlikely to pan out?
Are you suggesting they are not using web content? Because that seems a very strange assertion given that Google seems keen to replace its web search with an LLM chatbot.
If they are using web content, and they are also publishing to the web, and part of their success criteria is that their models generate content that is mathematically indistinguishable from human-generated content (based on their model’s calculations), then of course they’ll come to regret it. If the indistinguishability is the goal, and training on LLM output slowly degrades the models, then their products will become inevitably worse. Regret seems obvious to follow.
I am suggesting that they are not as dependent on web content as other people might be, due to having the resources to curate a wide variety of input types, and the understanding to not drink their own kool-aid.
While “AI” has definitely made web content much less informative, pre-“AI” web wasn’t that much better once it came to be dominated by marketing and advertising.
(“AI” isn’t a dissatisfying tool solely because of its training material.)
From four years ago:
https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main/resources/lexicons/en.txt
History:
https://gitlab.com/DaveJarvis/KeenWrite/-/commits/main/src/main/resources/lexicons/en.txt