1. 12
  1.  

  2. 4

    This article is full of great advice. Thank you for sharing!

    I get to be happy about an article about logging, for a change \o/

    1. 3

      I first read this article by the pool in Jamaica and then proceeded to reread it several times. Eventually it made it’s way to a semi-permanent tab for a period of months because I love it so much.

    2. 2

      Logging is a user interface for the developers. Like any UX some care has to be taken to design it properly so that it becomes useful.

      1. 2

        At work, we log everything via sysog(), which is feed into splunk which means we can search in pretty much real time across the entire system. For each message we log via syslog(), a unique tag is used, for example:

        T0011: Can not open configuration file %s: %s
        T0180: Significant clock skew detected (%llu uS)
        

        The tag (the T0011 or T0180) are just allocated sequentially as needed (the T component has 246 defined log messages). Each component has a separate prefix (they are not restricted to a single letter—I wrote a component with the prefix NIEnnnn for example). This makes it easy to search for a particular log message or for messages for a particular component.

        The other thing we do is log key performance indicators (KPI). Receive a request? Log it. Experienced an error? Log it. I took the idea of statd from Etsy that makes it easy to log such events. For example, when a component I use experiences a DNS timeout, I run:

        stat.incr("nie.dns.timeout")
        

        which sends a message to the statd to increment the counter “nie.dns.timeout”. At regular intervals, statd will output all the information via syslog() (which leads into splunk). It was simple enough to write my own version of statd (the Etsy one does more than we wanted). It’s easy to just add new KPI entries (if statd receives a name it doesn’t have, it starts a counter for it) and I’ll add metrics just because.

        Given all that, we have debugged some pretty hairy situations, like bad network routing, memory leaks, certain crashes, and the utter garbage requests we can’t parse from our customers (who I would have expected to know better).

        1. 1

          The other thing we do is log key performance indicators (KPI).

          You might want to have a look at Prometheus with Grafana for that.

          1. 1

            I’ve looked at Prometheus (another team at work uses that) and the thing I didn’t like about it was that it polls for data instead of receiving data. Yes, there is a way to feed data into Prometheus, but from what I’ve read, it is extremely discouraged to use that, and instead, let Prometheus poll for the data. In my case, this means embedding a web server into several components that don’t use HTTP (or even TCP for that matter—they all use UDP) for network transfers.