1. 1

    meh, this is really a cat and mouse game. just test it like:

    if (navigator.webdriver || navigator.hasOwnProperty('webdriver')) {
      console.log('chrome headless here');
    }
    

    And there goes the article until the author can find a way to bypass this now…

    1. 6

      The point of the article is sort of that it’s a cat and mouse game. The person doing the web browsing is inherently at the advantage here because they can figure out what the tests are and get around them. Making the tests more complicated just makes things worse for your own users, it doesn’t really accomplish much else.

      const oldHasOwnProperty = navigator.hasOwnProperty;
      navigator.hasOwnProperty = (property) => (
        property === 'webdriver' ? false : oldHasOwnProperty(property)
      );
      Object.defineProperty(navigator, 'webdriver', {
        get: () => false,
      });
      
      1. 1

        Yet there are other ways that surely make it possible for a given time window, like testing for a specific WebGL rendering that chrome headless cannot perform. Or target a specific set of bugs related only to chrome headless.

        https://bugs.chromium.org/p/chromium/issues/detail?id=617551

        1. 1

          Well, eventually you just force people to run Chrome with remote debugging or Firefox with Marionette in a separate X session, mask the couple of vars that report remote debugging, and then you have to actively annoy your users to go any further.

          I scrape using Firefox (not even headless) with Marionette; I also browse with Firefox with Marionette because Marionette makes it easy to create hotkeys for strange commands.

          1. 1

            Even if there were no way to bypass that, don’t you think that you’ve sort of already lost in some sense once you’re wasting your users’ system resources to do rendering checks in the background just so that you can restrict what software people can choose to use when accessing your site?

            1. 3

              If headless browser is required to scrape data (and not just requesting webpages and parsing html), then website is already perverse enough. Noone will be suprised more if it would also run webgl-based proof of work before rendering most expensive thief-proof news articles from blob of malbolge bytecode with webgl and logic based on GPU cache timing.

              1. 1

                You’re paying a price, certainly. But depending on your circumstances, the benefits might be worth the cost.

        1. 3

          Great article!

          Weirdly, when I go to that link (or even reload the page) I end up 3/4 of the way down the page instead of at the top. Browser bug or something to do with the page scripting?

          1. 1

            This appears to be caused by the embedded iframes stealing focus. I tried some workarounds, but they unfortunately don’t seem to resolve the issue. If anybody knows a better fix for this, then I would love to hear it!

          1. 2

            Oh man, so many cool tools!

            BTW, you have a typo in the link to the powerline-shell page.

            And the x-macro link too!

            1. 1

              Thanks, that should be fixed now!

            1. 3

              A very Pareto optimal post :)

              I would question whether or not the approach taken is suitable for finding “good” blog posts. Hacker News gets gamed by plenty of people. There’s also a cult of multiple personalities that seems to take hold with people regularly getting their blog posts submitted because their posts will be guaranteed to be upvoted. It guarantees votes for the poster to be first to submit a Gabriel Weinberg or Daring Fireball link, regardless of quality.

              Still, the Pareto approach is really well explained here, and it’s a shining example of the difference between a HN optimal and good post IMHO :)

              1. 3

                Thanks! I completely agree about the cults of personality. That was my motivation for the second list of posts where I restricted the maximum number of distinct submitters for a blog. It significantly limited the number of candidate blogs, but it did effectively eliminate the blogs that people race to submit.

              1. 3

                How did you handle duplicates in this analysis? Many posts get submitted multiple times and it’s not clear if you counted those as different or combined them.

                1. 3

                  I didn’t do any sort of deduplication, so articles that were submitted multiple times were considered distinct in the analysis. I think that makes sense for the average/mean/median scores, but perhaps the duplicates should have been subtracted from the total article count for each blog. I just did a quick pass over the data and it looks like 92.4% of the URLs submitted to Hacker News are unique. When limiting that to the submissions that were identified as blog articles, the fraction is only slightly higher at 93.5%. I don’t think that should make that much of a difference, but you still raise a good point!

                  1. 2

                    I would expect a really high percentage of URLs to be unique because of spam (and articles that are so low-quality as to be almost indistinguishable from spam). I don’ t mean to make work for you, but what does it look like if you only consider articles that get 5 upvotes? Or reach the front page?

                    1. 3

                      97.1% of blog submissions that get at least 5 upvotes are unique and that rises to 98.2% for submissions that get at least 10 (which is the number that I like to use as an approximation for making the front page). This is probably partially caused by even really great articles having a good chance of never getting upvoted though (in addition to spammy submissions being removed).