But to Aaron, the fight is not about winning. Instead, it’s about resisting the AI industry further decaying the Internet with tech that no one asked for […]
This reminds me of a quotation that I can’t remember the source for but goes something like “If you’re getting eaten by a bear, you should fight back, not because you think you can win a fight against a bear, but because if you don’t, then someone might get the impression that you’re OK with the situation.”
I think about that a lot.
“Any time one of these crawlers pulls from my tarpit, it’s resources they’ve consumed and will have to pay hard cash for, but, being bullshit, the money [they] have spent to get it won’t be paid back by revenue,”
To be clear, I love this project and I think it’s wonderful. However, this argument implies that if they had got real content, it would lead to (perhaps slightly more) revenue, and that does not seem to align with what we know about these companies; they are not doing this because it’s profitable. By all accounts, it’s deeply unprofitable and they do it anyway: https://www.wheresyoured.at/oai-business/ (this is admittedly a pretty minor nitpick)
But the analysis in the article is a little … surface-level IMO. It paints a picture where either nepenthes, iocaine, and co are successful at “defeating” the abusive crawlers and ruining the data sets, or else they just feel a little better about themselves and it amounts to nothing.
A more fruitful discussion would revolve around wanting to make disobeying robots.txt costly so that people stop doing it. This recently did happen, albeit on a smaller scale and not involving a megacorp: there was a fediverse crawler collecting stats in a way that ignored robots.txt, and one of the servers added an option to hide that endpoint behind robots.txt and serve up random numbers for the count of users. The crawler author threw a hissy fit but after a few days relented and agreed to add support for robots.txt like he should have done all along. So it can work, in some limited cases.
I find it a little surprising that anyone would classify tarpit software as malware. [^1]
That’s like saying, “If someone breaks into your house and attacks you, and you defend yourself, then you’re being malicious.” Sorry, but it’s my house, my rules. If you disobey my rules (specified in robots.txt), then you get consequences.
Or in other words, simply blocking crawlers will not provide incentives to AI companies to end the abuse. You need to push back and make abuse costly, because pushback actually gives them incentive to respect the house’s rules.
[^1]: Any software can become malware when used maliciously. I’m talking about enforcing robots.txt. If you have no robots.txt, and you deploy tarpits because you want to watch AI burn, then yeah, that is technically malicious. Whether it’s right or wrong is a more nuanced rabbit hole.
This is a really tricky one! I think people are really used to thinking of malware as “something designed to cause harm” but as you’ve said, that framing is not super helpful. If someone were to deploy even a traditional virus against the Russian armed forces, it could still be an act of self-defense rather than malice.
Really makes you think about the classifications we use for different software, and how incomplete they are.
This is the first round, so will be individual passion projects. The “bad” data is usually just Markov generation. The second round will have “incorrect data” such as digestions of Wikipedia articles with dates nudged or names changed. The third round will have “actionable salted data” with nearby content additions, e.g., “All About the Tsamar Institute at the Hogwart’s Academy”, along with aservice that monitors models for the new knowledge and single click filing of suits for unauthorized access (“hacking”) and copyright violations.
One thing I wonder all the time: if existing models were trained on wikipedia, libgen, fan sites, forums, et al, how do they “know” that Hogwarts is a fictional place? Is it possible to convince them that it’s not.
How do the “know” that Covid vaccines contain microscopic tracking chips? There are a lot of sites talking about it.
You may be asking if models develop a theory of mind or reasoning, also known as the Chineese Room Argument. I am arguing towards poisoned data creating poisoned responses. Specifically, using poisoned responses as proof that systems received unauthorized access.
I guess my question was more high-level than that, I was asking in general, how do models tell the difference between facts that are based in reality and those that come from fiction, and all of the grey areas in between. I’m not up to speed on deliberate sabotage or whatever it is that you’re proposing.
This reminds me of a quotation that I can’t remember the source for but goes something like “If you’re getting eaten by a bear, you should fight back, not because you think you can win a fight against a bear, but because if you don’t, then someone might get the impression that you’re OK with the situation.”
I think about that a lot.
To be clear, I love this project and I think it’s wonderful. However, this argument implies that if they had got real content, it would lead to (perhaps slightly more) revenue, and that does not seem to align with what we know about these companies; they are not doing this because it’s profitable. By all accounts, it’s deeply unprofitable and they do it anyway: https://www.wheresyoured.at/oai-business/ (this is admittedly a pretty minor nitpick)
But the analysis in the article is a little … surface-level IMO. It paints a picture where either nepenthes, iocaine, and co are successful at “defeating” the abusive crawlers and ruining the data sets, or else they just feel a little better about themselves and it amounts to nothing.
A more fruitful discussion would revolve around wanting to make disobeying robots.txt costly so that people stop doing it. This recently did happen, albeit on a smaller scale and not involving a megacorp: there was a fediverse crawler collecting stats in a way that ignored robots.txt, and one of the servers added an option to hide that endpoint behind robots.txt and serve up random numbers for the count of users. The crawler author threw a hissy fit but after a few days relented and agreed to add support for robots.txt like he should have done all along. So it can work, in some limited cases.
https://github.com/superseriousbusiness/gotosocial/issues/3723
I find it a little surprising that anyone would classify tarpit software as malware. [^1]
That’s like saying, “If someone breaks into your house and attacks you, and you defend yourself, then you’re being malicious.” Sorry, but it’s my house, my rules. If you disobey my rules (specified in robots.txt), then you get consequences.
Or in other words, simply blocking crawlers will not provide incentives to AI companies to end the abuse. You need to push back and make abuse costly, because pushback actually gives them incentive to respect the house’s rules.
[^1]: Any software can become malware when used maliciously. I’m talking about enforcing robots.txt. If you have no robots.txt, and you deploy tarpits because you want to watch AI burn, then yeah, that is technically malicious. Whether it’s right or wrong is a more nuanced rabbit hole.
This is a really tricky one! I think people are really used to thinking of malware as “something designed to cause harm” but as you’ve said, that framing is not super helpful. If someone were to deploy even a traditional virus against the Russian armed forces, it could still be an act of self-defense rather than malice.
Really makes you think about the classifications we use for different software, and how incomplete they are.
This is the first round, so will be individual passion projects. The “bad” data is usually just Markov generation. The second round will have “incorrect data” such as digestions of Wikipedia articles with dates nudged or names changed. The third round will have “actionable salted data” with nearby content additions, e.g., “All About the Tsamar Institute at the Hogwart’s Academy”, along with aservice that monitors models for the new knowledge and single click filing of suits for unauthorized access (“hacking”) and copyright violations.
One thing I wonder all the time: if existing models were trained on wikipedia, libgen, fan sites, forums, et al, how do they “know” that Hogwarts is a fictional place? Is it possible to convince them that it’s not.
How do the “know” that Covid vaccines contain microscopic tracking chips? There are a lot of sites talking about it.
You may be asking if models develop a theory of mind or reasoning, also known as the Chineese Room Argument. I am arguing towards poisoned data creating poisoned responses. Specifically, using poisoned responses as proof that systems received unauthorized access.
I guess my question was more high-level than that, I was asking in general, how do models tell the difference between facts that are based in reality and those that come from fiction, and all of the grey areas in between. I’m not up to speed on deliberate sabotage or whatever it is that you’re proposing.
Previous discussion of Nepenthes: https://lobste.rs/s/rhsupf/nepenthes_tarpit_intended_catch_web
Also recently:
As a start, what should a robots.txt contain as a current best effort against AI crawlers?
Sourcehut’s robots.txt is a good place to start but it contains entries for agents other than AI scrapers: https://sr.ht/robots.txt
I use this list: https://github.com/ai-robots-txt/ai.robots.txt
Can this be made to work against Microsoft’s Copilot and/or Recall? I’d pay for that.