1. 15
  1.  

      1. 3

        Also https://matomo.org/ and https://usefathom.com/ are popular alternatives

      2. 3

        A little off-topic, but it is really fun seeing analytics for sites you visit. It’s like seeing how many votes or comments a post gets on here/reddit/etc. I wish more sites allowed you to see their analytics! I know that’s not feasible or necessary, but it would be fun.

        1. 1

          glad you like it! its also much more convenient than logging into a separate domain.

        2. 3

          Does anyone still parse access logs? Seems like a good option for a small site with limited aims. It adds no page weight, can’t be blocked, and doesn’t even require much back-end infrastructure.

          1. 2

            Yes, but there are a couple of niggles that I may or may not work on something to try and ‘solve’:

            (a) all the log analysers pretty much assume you have just one web server.

            (b) they’re either quite featured, and look like they were designed in the 90s, or they look nice, but have some glaring gaps in functionality.

            I’ve toyed with the idea (only to the point of some PoC stuff to test it out so far) of a “simpler” analyser that would work for the use-cases I’ve seen: a really-simplistic (i.e. probably just shell, for the initial version) ‘parsing’ of the log entries, and then relying on Redis’ increment functionality to bump up scores based for the various metrics, using a few date-related keys.

            1. 1

              Let me know if you ever get around to building such a thing. I would be happy to test it. All I really want is a graph of visitors over time broken down by page. I had been using Google Analytics, which was overkill and I was feeling guilty about supplying traffic data to Google. Now I just run less on the access file occasionally, which is nearly enough for the traffic volume (can I call less an MVP for web traffic analysis?)

              1. 3

                Thanks for the offer. I’ll be sure to post something here if I get something working.

                You may want to also look at Goaccess (https://goaccess.io) - it does does static log analysis and might well be enough for what you need.

                The issue for us has been (a) it’s a PITA to make it work across multiple servers and (b) it has no built in ability to filter to a given date range. On the CLI it’s possible (although not necessarily simple) to just filter the input for an ad-hoc run, but from the ‘HTML’ frontend (i.e. what business people want to use) it’s just not possible.

                1. 2

                  To gather logs from multiple web servers, I am using:

                  for h in web01 web02 web03 web04; do
                    ssh $h zcat -f /var/log/nginx/vincent.bernat.ch.log\* | grep -Fv atom.xml
                  done | goaccess --output=goaccess.html ...
                  
                  1. 1

                    Thanks for the suggestion. I’ll be sure to check it out.

                    1. 1

                      I love goaccess and use it all the time. I try to keep things in one server but I have used it with multi server setups.

                      Could you be specific about what it is a PITA when handling multi server setups. How is it any more complicated (or simpler) than any other tool? You always need to agregate the data, whatever solution you use. What’s specific about go-access?

                      1. 1

                        So the problem is that we want the analytics frontend to be served from multiple servers too, and want it to work in realtime html mode.

                        As much as analytics isn’t really business critical, the goal here is that nothing we control for prod, is a SPOF.

                        So the kind of working setup now relies on rsyslog to take varnishncsa access logs and send/receive to/from peer syslog servers and also write to disk locally, which goaccess consumes. This isn’t what I’d call a robust setup.

                        The plan in my head/on some notes/kind of in a PoC is to have the storage layer (i.e. redis is my idea for now, might end up being either something else, or being adaptable to a couple of options) be the point of aggregation, so each log producing service (in our case varnish, but in another case it might be Nginx or apache or whatever that can produce access logs) has something locally running which does really basic handling of the log entry, and then just increments a series of counters in the shared storage layer, based on the metrics from the log entry.

                        1. 1

                          Rsyslog, fluentd, or just watch the logs with tail or what have you and append to a remote server via socket.

                          I don’t really see the use case of serving the UI from several servers. They are behind a proxy anyways.

                          Personally I would just reach the files via SSH like @vbernat suggests.

                          1. 1

                            They are not behind a single proxy, that’s the point.

                            Copying files via ssh means you lose any capability for real-time logs too.

                          2. 1

                            GoatCounter supports log parsing as well; the way it works is a bit different than e.g. goaccess: you still have your main goatcounter instance running like usual, and you run goatcounter import [..] which parses the logfiles and uses the API to send it to goatcounter. The upshot of this is that it should solve problems like this, and generally being a bit more flexible.

                            (disclaimer: I am the GoatCounter author, not trying to advertise it or anything, just seems like a useful thing to mention here)

                            1. 1

                              Thats interesting, thanks.

                              1. 1

                                Hey I don’t want to turn this into a GoatCounter FAQ, but there’s no way to have the computed metrics be shared somehow is there (i.e. so the analytics are not reliant on a single machine being up to record/view) ?

                                1. 1

                                  I would solve that by running two instances with the same PostgreSQL database.

                                  Other than that, you can send the data to two instances, and you can export/import as CSV. But in general, if there isn’t really a failover solution built-in. But I think using the same (possibly redundant) PostgreSQL database should work fairly well, but it’s not a setup I’ve tried so there may be some issues I’m not thinking of at the moment (but if there are, I expect them to be solvable without too much problems).

                                  1. 1

                                    The shared DB solution sounds most like what I had in mind, thanks - I wasn’t even aware it supports Postgres. I guess it’s a deliberate decision to leave the self-hosting info on GH and have the main site be more about the hosted version?

                                    1. 1

                                      I guess it’s a deliberate decision to leave the self-hosting info on GH and have the main site be more about the hosted version?

                                      Yeah, that’s pretty much the general idea; a lot of people looking for the self-hosted option aren’t really interested in details of the SaaS stuff, and vice versa. Maybe I should make that a bit clearer actually 🤔

                  2. 1

                    Analytics projects are fun because they have a very reachable MVP.

                    For JavaScript analytics on a technical blog, there’s precedent to assume 60% are using an ad-blocker (I’m not sure if this would filter yours).

                    My estimate is that I got ~2.5x more unique, non-robot visitors than reported by analytics (~38k versus ~13k), meaning that roughly 60% of users are using an adblocker. Read to the end to see how I got these numbers. source

                    1. 1

                      Good point - now that you mention it I remember reading that in your original blog post.

                      Ad blockers don’t disable all JavaScript, and my custom snippet that calls a cloud function wouldn’t be on any block lists. Therefore I don’t see why it would be blocked by default. What do you think? and how many robots are out there - what are they all doing?!

                      Do you know what the non-JavaScript analytics options are? Sneaky urls inside img tags?

                      1. 1

                        Analysts of server logs

                        1. 1

                          Ad blockers don’t disable all JavaScript

                          As long as the script or endpoint isn’t called analytics or something. Plus that someone doesn’t add it to a long list somewhere.

                          Do you know what the non-JavaScript analytics options are? Sneaky urls inside img tags?

                          (Assuming this is for a static website)

                          Tiny images are normally dinged. There are ways of getting the initial visit e.g. by loading CSS from an analytics end point but there are usually caching problems (same with img antics).