1. 6
  1.  

  2. 3

    Here’s the relevant Javascript that makes an XMLHttpRequest which is then interpreted by an AWK script scanning the web server logs:

    ...
    function reportOrNot()
    {
        if(hadActivity) {
            activityCount++;
            if(Math.random() < reportingProbability) {
                let url = '/articles/report.json?scrollPerc='+Math.round(scrollPerc)+"&count="+activityCount;
                var oReq = new XMLHttpRequest();
                oReq.open("GET", url);
                oReq.setRequestHeader("Cache-Control", "no-cache, no-store, max-age=0");
                
                // fallbacks for IE and older browsers:
                oReq.setRequestHeader("Expires", "Tue, 01 Jan 1980 1:00:00 GMT");
                oReq.setRequestHeader("Pragma", "no-cache");
    
                oReq.send();
            }
            hadActivity=false;
        }
    }
    ...
    

    the entirety of the JavaScript code is barely larger than the above snippet but I didn’t want to wash out the comments section.

    This is so simple and yet so powerful! I wish I had thought of this.

    1. 2

      I wonder if it would be acceptable under the GDPR to include something like a 16-bit cryptographically random number in a cookie in combination with this (Note: As I understand it, GDPR does not require consent for cookies, it requires consent for tracking, irrespective of the underlying technology).

      This would let you spot return visitors (for long articles, it might be interesting to know if people save the link and come back later or if they just leave - I’m somewhere in the middle and typically leave the browser tab lying around for a few days until I have some time), with some error margin: if you have more than about 32K visitors to the site, you’ve got a high probability of collisions in the identifiers and so there’s some anonymity. If that’s not allowed, what’s the threshold? Would an 8-bit identifier be sufficient? Is an 8-bit identifier + an IP address regarded as tracking?

      1. 2

        I found some useful information related to this problem in Introduction to the hash function as a personal data pseudonymisation technique by the European Data Protection Supervisor.

        1. 1

          Thanks. That’s a great read, but it doesn’t quite address the use case I was suggesting. The identifier that I’m considering (hypothetically - I don’t really do anything in this space, so this isn’t something I’m actually going to implement) is a random number that’s chosen with a uniform distribution and with no checks for collisions. If I picked an 8-bit identifier and I had a thousand visitors to my site then I’d expect about four of them to have the identifier 42, about four to have the identifier 255, and so on. Unlike a hash function, there’s no way of going from the user to the identifier: they just pick a random number and store it in a cookie. If they clear their cookies, they’ll get a different random number (and I’ll see a few more unique visitors than I expect), if I get more collisions on a particular number, I’ll see fewer unique users than are really there. This doesn’t matter hugely if what I’m trying to track is whether people bookmark and come back to long articles, because I’m only going to get a rough approximation anyway.

          It gets interesting when you add in IP addresses in logs because, although the short random ID is not a unique tracking token, I probably have very few (on the order of 1) visitors per IP address and so it may be sufficient to differentiate between people in a household, which might make it into PII.

          1. 1

            Oh cool! Thanks for expanding. Sounds interesting.