One thing I like about Goat Counter is its lightweight UI that loads without a noticeable delay. I may go for Plausile for some sites for the features Goat Counter doesn’t have, but it’s trading those features for spinner-free UI.
Also, not using Google Analytics (or anything) on your site is easy… when other people are involved, it becomes much harder, especially if your project is not a website/service, and you want to offload that work to someone else.
Staying lean and lightweight has been a goal of mine with Plausible. Of course, tradeoffs have been made. As the UI grew more complex I started using React but now I’m planning to move to Preact to save on bundle size.
The spinners are there because the actual stats engine is quite naive at the moment. I don’t pre-aggregate any of these graphs and they take linear time to draw. It worked fine for a while but this approach is starting to become a problem.
The next things on my list are adding annual plans and then re-writing the stats engine. The goal is to fetch these graphs in constant time, under 500ms at minimum but a good goal would be 200ms.
No promises, but you can expect less spinning in the near-term future :)
We’re using Gitbook for docs and it looks like they include the GA script even when you don’t have it configured. That’s terrible, thanks for pointing it out.
I guess it shows how pervasive GA really is.
I’ve considered self-hosting our docs previously. This is a good reason to stop using Gitbook and make the move. Cheers.
[…]
To enhance the visitor privacy, we don’t actually store the raw visitor IP address in our database or logs. We run it through a one-way hash function to scramble the raw IP addresses and make them impossible to recover.
[…]
hash(website_domain + ip_address + user_agent)
To call that “impossible to recover” is quite strange. Let’s assume that they use a fast hash, then an attacker could calculate billions of combinations a second. I.e. all possible IPv4s and the 100 most popular user agents in a couple of minutes.
A random cookie with a daily expiry date seems to be safer than that to me.
You are technically correct. IP addresses protected with this method could be recovered via an offline brute force cracking attack. I don’t think this is an important vulnerability though.
From a technical perspective, sensitive information like passwords are routinely stored the same way. There are well-established ways to do this securely, including strong hash functions, salt & pepper, securing the underlying storage, rate limiting, etc. These techniques can be applied here if additional security is needed. Plausible’s Data Policy Page, which you quoted, does not provide sufficient detail to asses the quality of their hash implementation.
From a business perspective, even weak hashing significantly reduces the value of the data. It makes it much harder to join Plausible’s data with other analytics datasets on IP address. You would have to calculate an equivalent hash on the target dataset or carry out the attack you described. Hashing sends a signal that Plausible doesn’t plan to do this and makes it harder for them to change their minds about this later.
Finally, I think it’s strictly better to have weakly hashed IP addresses than to have IP addresses in the clear. It’s a step you can take to protect the privacy of your users while still improving your site and finding problems.
Sadly, there’s no hope for anonymizing IPs (same for tel numbers). For IPv4 the search space is just too small. Even properly hashed passwords are crackable if the passwords are weak. And IP is a weak password.
Password hashing uses slow hashes, but for request logging this would be staggering amount of CPU. And if you accelerate that by keeping IP->hash cache, then you’re creating a deanonymizing device.
website_domain is known from the database, and user_agent presumably too (and if it’s not, there are only ~1k common User-Agents, and ~25k in total being used). The number of IPv4 addresses is limited to ~4 billion (and you can reduce this number by excluding AWS netblocks and such).
So storing it with this predictable data results in so few permutations that you might as well not do it at all. It’s most certainly not “impossible to recover”, as claimed.
Regardless of whether or not this is a bad thing, it’s probably not considered anonymisation under GDPR.
Yeah, it’s unfortunate. I looked at this just last week; to quote myself:
I considered generating the ID on the client side as a session cookie or localStorage, but this is tricky due to the ePrivacy directive, which requires that “users are provided with clear and precise information in accordance with Directive 95/46/EC about the purposes of cookies” and should be offered the “right to refuse”, making exceptions only for data that is “strictly necessary in order to provide a [..] service explicitly requested by the subscriber or user”.
Ironically, using a cookie would not only make things simpler but also more privacy friendly, as there would be no salt stored on the server, and the user has more control. It is what it is 🤷♂️
The (potentially) good news is that the ePrivacy regulation is slated to replace the ePrivacy directive, but I’m not entirely sure what the current status of that is; the most recent update I could find says that “It’s unlikely that the regulation will be passed until the second half of 2019 and could be delayed even further in 2020”, but without any details, and I can’t really find anything on the EU website either (such a cluttered website). I’m guessing that the pandemic also isn’t helping.
I’m also not sure if it will really help here, I haven’t looked at the details but from a quick glance several months ago it seems to propose a kind of legally enforceable Do-Not-Set-Cookie which would prevent setting cookies. A focus on storing data on the client seems wrong anyway, since there are many ways to identify people without that anyway; but again, I haven’t looked at all the details (going through all of the ePrivacy directive and GDPR was enough legalese for me 😅)
Instead of google analytics, I am using Google Search Console. Does this have any downsides?
I am able to get traffic stats that come from google, links from other websites (found by google), reports about mobile user experience issues, speed reports, etc. And all that is required is adding a user ID in a html tag, no javascript.
That is what I use myself to check how I’m ranking on Google’s search results. Accurate data directly from Google but without me needing to install any Google scripts on my site.
For years we’ve had a number of very advanced solutions like GA and Matomo, as well as very simple solutions like log parsing and whatnot; but what’s been missing for a while are “middle-ground” solutions; here’s what I wrote a while ago:
I tried a number of existing solutions, and found existing solutions are either very complex and designed for advanced users, or far too simplistic. In addition almost all hosted solutions are priced for business users (≥$10/month), making it too expensive for personal/hobby use.
What seems to be lacking is a “middle ground” that offers useful statistics to answer business questions, without becoming a specialized marketing tool requiring in-depth training to use effectively.
To quote one of the Microsoft Word designers: “most people only use 5% of Word, but everyone uses a different 5%”. I rather like that quote, and it gives a different perspective on “software bloat”; I think you can apply this quote to a lot of “bloated” software.
Tools like Plausible (and various others that have been released in the last year or so) only have maybe 25-50% of the features and may not offer what everyone needs, but turns out that’s actually just fine! Writing software to solve every use case isn’t always needed.
I guess comments are a good place to comment on some points, so:
Yes, Google is large. But most of us would have liked to be the founders of a company that grew large. Google was small and loved once. But like the story of David & Goliath which ends with David becoming the king and quite evil, Google became the next Microsoft, aka a geek scapegoat.
Google’s infrastructure is really good at delivering any bloated script very very fast.
Very much yes.
Not really, it’s not that hard to configure it to be GDPR compliant.
It doesn’t use cookies nor IPs if you configure it right.
Funny thing, CloudFlare offers free analytics at the HTTP request level and the difference in numbers is staggering. I’ve seen 1000x more unique users detected. Unfortunately it’s hard to say how many of them are bots…
www.goatcounter.com by @arp242 might be an alternative, too.
One thing I like about Goat Counter is its lightweight UI that loads without a noticeable delay. I may go for Plausile for some sites for the features Goat Counter doesn’t have, but it’s trading those features for spinner-free UI.
Also, not using Google Analytics (or anything) on your site is easy… when other people are involved, it becomes much harder, especially if your project is not a website/service, and you want to offload that work to someone else.
Fair point.
Staying lean and lightweight has been a goal of mine with Plausible. Of course, tradeoffs have been made. As the UI grew more complex I started using React but now I’m planning to move to Preact to save on bundle size.
The spinners are there because the actual stats engine is quite naive at the moment. I don’t pre-aggregate any of these graphs and they take linear time to draw. It worked fine for a while but this approach is starting to become a problem.
The next things on my list are adding annual plans and then re-writing the stats engine. The goal is to fetch these graphs in constant time, under 500ms at minimum but a good goal would be 200ms.
No promises, but you can expect less spinning in the near-term future :)
Looks similar to Clicky.
https://simpleanalytics.com/ is another alternative in the same space (no cookies, GDPR-compliant, more privacy oriented, “we don’t track you”).
yeah that’s another option!
47 Percent Of Consumers Are Blocking Ads wow! Pretty good stuff!
PS: https://docs.plausible.io/ has google analytics running - WTF?!!
We’re using Gitbook for docs and it looks like they include the GA script even when you don’t have it configured. That’s terrible, thanks for pointing it out.
I guess it shows how pervasive GA really is.
I’ve considered self-hosting our docs previously. This is a good reason to stop using Gitbook and make the move. Cheers.
If I made one domain remove GA, it is a happy day for me – cheers!
PS: I’m happy to see that your analytics is not blocked in uBlock Origin.
Hosting your docs on GitHub Pages is pretty easy. Just need to learn your way around Jekyll.
Wow, that’s great news!
To celebrate, here’s a list of uBlock Origin-compatible filters to block all the annoyances on Forbes:
The approach that plausible.io takes seems naive:
To call that “impossible to recover” is quite strange. Let’s assume that they use a fast hash, then an attacker could calculate billions of combinations a second. I.e. all possible IPv4s and the 100 most popular user agents in a couple of minutes.
A random cookie with a daily expiry date seems to be safer than that to me.
You are technically correct. IP addresses protected with this method could be recovered via an offline brute force cracking attack. I don’t think this is an important vulnerability though.
From a technical perspective, sensitive information like passwords are routinely stored the same way. There are well-established ways to do this securely, including strong hash functions, salt & pepper, securing the underlying storage, rate limiting, etc. These techniques can be applied here if additional security is needed. Plausible’s Data Policy Page, which you quoted, does not provide sufficient detail to asses the quality of their hash implementation.
From a business perspective, even weak hashing significantly reduces the value of the data. It makes it much harder to join Plausible’s data with other analytics datasets on IP address. You would have to calculate an equivalent hash on the target dataset or carry out the attack you described. Hashing sends a signal that Plausible doesn’t plan to do this and makes it harder for them to change their minds about this later.
Finally, I think it’s strictly better to have weakly hashed IP addresses than to have IP addresses in the clear. It’s a step you can take to protect the privacy of your users while still improving your site and finding problems.
Sadly, there’s no hope for anonymizing IPs (same for tel numbers). For IPv4 the search space is just too small. Even properly hashed passwords are crackable if the passwords are weak. And IP is a weak password.
Password hashing uses slow hashes, but for request logging this would be staggering amount of CPU. And if you accelerate that by keeping IP->hash cache, then you’re creating a deanonymizing device.
website_domain is known from the database, and user_agent presumably too (and if it’s not, there are only ~1k common User-Agents, and ~25k in total being used). The number of IPv4 addresses is limited to ~4 billion (and you can reduce this number by excluding AWS netblocks and such).
In 2017 AWS could do 59,971 millions of hashes per second. That’s 59 billion hashes per second. Even on (much) slower hardware it’s feasible to just unhash the entire database.
So storing it with this predictable data results in so few permutations that you might as well not do it at all. It’s most certainly not “impossible to recover”, as claimed.
Regardless of whether or not this is a bad thing, it’s probably not considered anonymisation under GDPR.
Random cookies with a short expiry are simpler and much better for privacy.
So why design something more complicated and weaker? While claiming their approach is better. And making strong and wrong claims like this :
I agree, but this method avoids the ePrivacy directive which requires you to annoy your users with cookie notices.
Oh man, I have read up on the GDPR but hadn’t yet on that directive. You seem to be right facepalm
Yeah, it’s unfortunate. I looked at this just last week; to quote myself:
The (potentially) good news is that the ePrivacy regulation is slated to replace the ePrivacy directive, but I’m not entirely sure what the current status of that is; the most recent update I could find says that “It’s unlikely that the regulation will be passed until the second half of 2019 and could be delayed even further in 2020”, but without any details, and I can’t really find anything on the EU website either (such a cluttered website). I’m guessing that the pandemic also isn’t helping.
I’m also not sure if it will really help here, I haven’t looked at the details but from a quick glance several months ago it seems to propose a kind of legally enforceable
Do-Not-Set-Cookie
which would prevent setting cookies. A focus on storing data on the client seems wrong anyway, since there are many ways to identify people without that anyway; but again, I haven’t looked at all the details (going through all of the ePrivacy directive and GDPR was enough legalese for me 😅)Instead of google analytics, I am using Google Search Console. Does this have any downsides?
I am able to get traffic stats that come from google, links from other websites (found by google), reports about mobile user experience issues, speed reports, etc. And all that is required is adding a user ID in a html tag, no javascript.
That is what I use myself to check how I’m ranking on Google’s search results. Accurate data directly from Google but without me needing to install any Google scripts on my site.
For years we’ve had a number of very advanced solutions like GA and Matomo, as well as very simple solutions like log parsing and whatnot; but what’s been missing for a while are “middle-ground” solutions; here’s what I wrote a while ago:
To quote one of the Microsoft Word designers: “most people only use 5% of Word, but everyone uses a different 5%”. I rather like that quote, and it gives a different perspective on “software bloat”; I think you can apply this quote to a lot of “bloated” software.
Tools like Plausible (and various others that have been released in the last year or so) only have maybe 25-50% of the features and may not offer what everyone needs, but turns out that’s actually just fine! Writing software to solve every use case isn’t always needed.
Exactly! And thanks for sharing!
I guess comments are a good place to comment on some points, so:
Funny thing, CloudFlare offers free analytics at the HTTP request level and the difference in numbers is staggering. I’ve seen 1000x more unique users detected. Unfortunately it’s hard to say how many of them are bots…
FYI, Plausible does not have a free tier.