1. 16
  1.  

  2. 10

    In order for a system to understand sessions, a user must be identifiable (in some way)

    Minor nitpick: I wouldn’t necessarily say that a user needs to be identifiable, but rather that you need to have some way to identify a “browsing session”. I think there’s a bit of a subtle difference between the two: you don’t really care about tracking users/people and who is doing what, but rather about “what is being click on from where”.

    Also, one thing you need to be very careful with in SQLite is concurrent writes; I’d recommend connecting with journal_mode=wal should fix most of these issues (this is something that’s probably not a small issue with a low number of pageviews, but becomes much more of an issue with dozens of pageviews/second).

    1. 3

      Minor nitpick: I wouldn’t necessarily say that a user needs to be identifiable, but rather that you need to have some way to identify a “browsing session”. I think there’s a bit of a subtle difference between the two: you don’t really care about tracking users/people and who is doing what, but rather about “what is being click on from where”.

      Thank you, this is a good point. I will update that.

      journal_mode=wal

      Very useful advice! I wasn’t familiar.

      1. 1

        journal_mode=wal

        Very useful advice! I wasn’t familiar.

        Yeah, I learned this the hard way as well 😅 In spite of being just a simple parameter it took me quite some time to figure out the root cause and solution. You can design things so that concurrent writes never occur, but you need to be really careful and it’s easy to make a “mistake” and write to a database concurrently (which is what happened to me).

    2. 3

      creates a server-side hash by combining the website’s ID, a User-Agent, an IP address, and a rotating salt

      For anyone who is interested in the privacy implications of using hashed IPs, UAs and such this is a comprehensive and real world read on the topic: https://edps.europa.eu/data-protection/our-work/publications/papers/introduction-hash-function-personal-data_en

      Ironically, one might argue that assigning a random identifier to a user (yes, this requires user consent, but this is also a good thing from a privacy perspective) that you will use for session tracking is still the best option when it comes to protecting user privacy.

      1. 1

        An important aspect that isn’t mentioned is that none of this gets stored to the database; it’s just an in-memory map of hash → random UUID (and the UUID is what gets stored to the database, and the hash is only kept in memory for 8 hours at the most).

        The only exception is that it gets stored to disk on shutdowns, to be re-read and deleted when the app starts again. I might actually remove that, I’m kinda on the fence about it, but persisting this information between restarts seemed a decent enough trade-off. In a way, the whole salted hash thing is kind of superfluous (I mean, it’s not like nginx keeps this information in memory in some sort of hash, never mind access logs), and perhaps a bit of a distraction since people seem to focus on that more than “it’s never stored to disk”, which is probably the more important bit (this is perhaps also my failure in not emphasizing that enough).

        So long-term identification is essentially a non-issue as far as I can see.

        1. 1

          I mean, it’s not like nginx keeps this information in memory in some sort of hash, never mind access logs

          I mean, it’s not like it’s required to use nginx when you expose a web service to the internet.

          In addition to that I have to say it’s an odd argument to defer the definition of your privacy standards to the weakest link in your stack. Especially when labeling yourself “privacy-friendly” I’d expect something more ambitious.


          Sidenote: the OP is not about you or your product, so implementation details of your service might be off topic, especially if you don’t even give any additional context. Just as a hint.

          1. 1

            In addition to that I have to say it’s an odd argument to defer the definition of your privacy standards to the weakest link in your stack. Especially when labeling yourself “privacy-friendly” I’d expect something more ambitious.

            I’m not entirely sure what you mean with this to be honest, but my comment wasn’t intended to give a comprehensive overview of all my thoughts about this, just give some extra information that it’s not really about hashing as such.

            Sidenote: the OP is not about you or your product, so implementation details of your service might be off topic, especially if you don’t even give any additional context. Just as a hint.

            Huh? The part you quoted talked about specifically and by name. I find it a bit strange to make a comment about it yourself first and then give this sort of reply when I offer some additional information and context. Anyone could have done that, it just so happens I’m posting here already.

        2. 1

          Concur. Hashing sensitive attributes has some subtle implications you need to think through.

        3. 2

          I don’t understand tracking pixels — why are people so opposed to tracking via server side logs? “Nginx access log analyzer as a service” seems like something that should exist.

          1. 4

            This question often comes up; I don’t think anyone is “opposed” to using server logs, but there are a few issues:

            The first issue is that a lot of people just don’t have access to the server logs; I use Netlify and it doesn’t have server logs. Similar for GitHub pages and a lot of similar services. Especially for simple(-ish) static sites running your own nginx or whatnot is kind of overkill.

            Even when you do have server logs then reliably parsing them is kinda tricky; it’s certainly not impossible and is done well by e.g. goaccess (and support for GoatCounter should be there soon™, including as “log analyzer as a service”) but it’s certainly a lot more complex than just adding a tracking pixel.

            Some information isn’t possible to record in access logs. I’m mostly thinking about screen size (it’s pretty useful to know how many people are using mobile, for example), some attributes to identify bots (window._phantom, for example; quite a few bots are filtered out with this), and the page <title> (which is just useful for UX, although you can also fetch this from the source). It’s also harder to get some details right, like using the canonical path (if any). Doing any sort of “event tracking” is also rather hard, never mind doing anything with SPA frameworks.

            It’s also not really more “privacy friendly”; access logs tend to record more PII and are persisted longer.

            CC: @ianloic

            1. 1

              I’m working on this: https://adi.tilde.institute/cbl/.

            2. 1

              What ever happened to just analyzing server log. We did that in the 90s and it was fine. If you’re using 3rd party hosting or CDNs you’re already giving someone else access to all your users’ activity so you might as well use an off the shelf sophisticated analytics tool.