1. 35
  1.  

  2. 6

    Perhaps a useful perspective is thinking of spam against your site as a war of attrition. If your costs are equal or lower to your enemy (per-query) then they won’t act much and you can sit back in your deck chair on the castle. If every action they take is cheap for them but expensive for you then you’re pretty much skewered.

    The “obvious” way around this is to make your site as cheap per query as possible. Whilst that’s dead easy for my boring blog I suspect this is a completely stupid suggestion for Marginalia :)

    The only other option is to make their queries more costly:

    • Using a 3rd party spam filtering service (like the article details)
    • Requiring registration and login before submitting queries
    • Requiring human-solved captchas
    • Requiring (not in a crypto-money sense) “proof of work” client-solved captchas.

    The last item might (might?) be a possibility. Some annoying javascript algorithm that takes several seconds of processing to solve a puzzle the server already (cheaply) knows the answer to. If the correct answer isn’t submitted with the query then the query is blocked. Sadly this also batters accessibility.

    On my blog I implement a combo of a single-word captcha and a blank-field captcha. I’m only happy with it because I have a global limit if things go askew, which provides a nice human-scale backup. That’s also not really feasible for your site :/

    Users accounts and logins is a sucky solution. I’ve read before that often the users that sign up are the worst abusers :| and handling user registrations is a PITA for everyone including you.

    Maybe even something simpler and stupider like getting them to send a predefined 100KiB block of text in a <textarea></textarea> every time they query (which still costs you, but perhaps overall deflects enough/all of the bots to gain a net win). Keep upping its size until you see an impact. It would be a minor accessibility cost to users on limited bandwidth or capped connections and a zero accessibility cost for people using non-standard browsers (screen readers, etc).

    1. 5

      upwards of 15 queries per second from bots

      The search engine is hosted on residential broadband, it’s hosted on a souped up PC.

      I feel like that mostly sums up the particular problem for the author. I understand that a search engine probably requires significantly more resources to serve requests than other kinds of sites or web apps, but yeah.

      Bots are definitely problems for some of the other situations that are mentioned like web forums and comment sections. That being said, you can usually get a lot of that fixed using simple methods like captchas or having any kind of login system in front of it.

      On the other hand, consider the search engine itself. It wouldn’t be possible without a bot crawling + scraping huge numbers of websites repeatedly. Yeah I know there are “good bots” and “bad bots” but the idea is that bots are a feature rather than a bug of the internet.

      I really don’t believe that bots are “absolutely crippling the Internet ecosystem” or that “They’re a major part in killing off web forums, and a significant wet blanket on any sort of fun internet creativity or experimentation”.

      1. 1

        I feel like that mostly sums up the particular problem for the author. I understand that a search engine probably requires significantly more resources to serve requests than other kinds of sites or web apps, but yeah.

        On the other hand, I suspect there are enough SEO/adspam bots out there that Jevons Paradox applies. Run on a faster server, get more bot queries.

      2. 3

        “Botspam Apocalypse” would make a great name for a band.

        1. 2

          Would have been a good name for this lot …

          https://www.youtube.com/watch?v=9gMX_hR-RoM

        2. 2

          I doubt anyone is coding bots for marginalia. Captcha: “write ‘hello’ in this text field”. Need something harder? “draw an X using this 3x3 grid of checkboxes”.