1. 29
  1.  

  2. 21

    The article never mentions the, in my humble opinion, most important part of good logging practices and that is structured logging. Without it you end up with weird regexes or other hacks trying to parse your log messages.

    1. 4

      As a sibling post notes, if you use structured logging you’re mostly throwing away the idea that the entries must be easily parsable by a human. If that’s the case, and we’ll need a custom method of displaying the structured logs in a human friendly way, I believe we should forego plain text all together and gain the benefits of logging directly to binary.

      1. 5

        You can do human readable structured logging if you use key="value" formats inside text messages. Some people still prefer json, but there is a middle ground.

        1. 2

          If you need just key=value, that’s not really structured in my opinion.

          1. 4

            Why not?

            1. 2

              Because the amount of information added by this format would be infinitesimal over a line based logger with manual tokenization. The reason why you’d want a structured logger is to allow proper context to a message. Unless you’re working with simple cases, the structure that would offer such context is more than one level deep.

              1. 3

                Hmm, definitely not.

                Structured logging is about decorating log events with just enough of a schema to make them machine parseable, so that searching, aggregating, filtering, etc. can more than a crapshoot. Deeply nested events significantly increase the complexity of that schema, and therefore the requirements of the consumer.

                By default, structured logs should be flat key/value pairs. It gets you the benefits of richer parseability, without giving up the ability to grep.

      2. 2

        Excellent point. That’s become such second nature to me by now, that I forgot to even mention it!

        1. 2

          I’m surprised it wasn’t mentioned, but the larger advantage of passing a logger around to constructors is the ability to then have nested named loggers, such as

          Battery.ChargingStatus.FileReader: Failed to open file { file: "/tmp/battery charge", error: ... }
          Battery.ChargingStatus: Failed to access status logs, skipping report
          
          1. 1

            On top of that, structured logger if implemented properly, can often be faster and be operated at granular levels (like the other comments pointed out, sometimes you do want to on-fly turn on some logs at some locations, not all logs at all locations).

            1. 1

              I love structured logging, with one caveat: the raw messages emitted (let’s assume JSON) are harder for me to scan when tailing directly (which I usually only do locally as we have better log querying tools in the cloud), in contrast to a semi-structured simple key-value format. Do you all use a different format than JSON? Or a tool that transforms structured logs to something more friendly to humans, eg. with different log levels displayed in different appropriate colors, eg. JSON syntax characters diminished, for local tailing?

              1. 5

                At Joyent, we used the Bunyan format. Each line in the file was a separate JSON object with standard properties, some mandatory and some optional, and freeform additional properties. We shipped a tool, bunyan, that was capable of acting as a filter that would render different human readable views of the JSON. For example, you would often run something like:

                tail -F $(svcs -L manatee) | bunyan -o short
                

                It also had some rudimentary filtering options. It also had a relatively novel mode that would, instead of reading from a file or standard input, use DTrace probes for different log levels to allow you to dynamically listen for DEBUG and TRACE events even when those were not ordinarily present in the log files. The DTrace mode could target a particular process, or even all processes on the system that emitted Bunyan logs.

                1. 1

                  Hi, what were the required fields? Was it just a unique request ID? Thanks for sharing about bunyan. Even though it’s been out for a while I was unaware of it.

                2. 5

                  Do you all use a different format than JSON? Or a tool that transforms structured logs to something more friendly to humans, eg. with different log levels displayed in different appropriate colors, eg. JSON syntax characters diminished, for local tailing?

                  We use JSON and the only tools I use are grep and jq. And although I am pretty much still a novice with these two, I found that with the power of shell piping I can do almost anything I want. Sometimes I reach for the Kibana web interface, get seriously confused and then go back to the command line to figure out how to do it there.

                  I wrote a simple tutorial for the process, just a couple of weeks ago.

                  1. 1

                    If you rely on external tools to be able to make sense of your logs, why not go all the way, gain the speed and size benefits that binary logs would bring, and write your own log pager? I feel like the systemd folks had the right idea even when everyone was making fun of them.

                    1. 3

                      I don’t think the average employer would be happy subsidizing an employee writing a log pager instead of implementing something that would bring a tangible result to the business. The potential money savings by using binary logs probably doesn’t outweigh the new subs/increased profits of churning out more features.

                      1. 1

                        To me that sounds like an excuse. The world is not made up of only software that is beholden to the all mighty shareholder.

                        1. 1

                          I mean, yes, if you’re developing something in your personal time, go bananas on what you implement.

                          But I also know my manager would look at me funny and ask why I’m not just shoving everything into CloudWatch/<cloud logging service>

                      2. 2

                        I’m sure most problems with systemd journals are fixable, but they’ve left a very bad taste in my mouth for two main reasons: if stuff gets deleted from under them they apparently never recover (my services continue to say something like “journal was rotated” until I restart them), and inspecting journals is incredibly slow. I’m talking magnitudes slower than log files. This is at its worst (I often have time to make a cup of tea) when piping the output into grep or, as journalctl already does by default, less, which means every byte has to be formatted by journalctl and copied only to be skipped over by its recipient. But it’s still pretty bad (I have time to complain on IRC about the wait) when giving journalctl filters that reduce the final output down to a few thousand lines, which makes me suspect that there are other less fundamental issues.

                        I should note that I’m using spinning disks and the logs I’m talking about are tens to hundreds of GB over a few months. I feel like that situation’s not abnormal.

                        1. 1

                          If you rely on external tools to be able to make sense of your logs, why not go all the way, gain the speed and size benefits that binary logs would bring, and write your own log pager?

                          It’s hard to imagine a case at work where I could justify writing my own log pager.
                          Here are some of the reasons I would avoid doing so:

                          • Logs are an incidental detail to the application.
                          • Logs are well understood; I can apply a logging library without issues.
                          • My application isn’t a beautiful and unique snowflake. I should use the same logging mechanisms and libraries as our other applications unless I can justify doing something different.
                          • JSON is boring, has a specification, substantial library support, tooling, etc.
                          • Specifying, documenting, and testing a custom format is a lot of work.
                          • Engineering time is limited; I try to focus my efforts on tasks that only I can complete.
                          1. 2

                            Logs are an incidental detail to the application.

                            I think this is trivially disproved by observing that if the logs stop working for your service, that is (hopefully!) a page-able event.

                            Logs are a cross-cutting concern, but as essential as any other piece of operational telemetry.

                            1. 1

                              Logs are a cross-cutting concern, but as essential as any other piece of operational telemetry.

                              I rely heavily on logging for the services I support but the applications I wrote for work have only error reporting. They are used by a small audience and problems are rare; I might get a crash report every 18 months or so.

                              1. 1

                                Ah, yeah, I presume the context here is services.

                        2. 1

                          Agreed. jq is a really nice tool. It made the decision to transition to using JSON for logging very easy.

                        3. 3

                          Don’t use JSON, use logfmt.

                          1. 1

                            Yes! Logfmt is the good stuff. But it’s only semi-structured. Why not use JSON and a tool to transform to logfmt (with nested data elided probably) when needing to scan as a human?

                            1. 1

                              Logfmt is fully structured, it just doesn’t support nesting, which is an important feature! Structured logs should be flat.

                      3. 15

                        Very nice rephrasing (with clear definitions and usage boundaries) of Dave Cheney’s three level system:

                        Debug: Things only developers care about while developing.
                        Info: Things users/administrators of your product care about.
                        Error: Things users care about.

                        These three levels may not be the best or they may be too broad, but having a small number of options, each associated with a clear meaning removes a lot of guesswork and bikeshedding.

                        Orthogonally, I’d love to see more logging frameworks adopt KDE’s approach of kDebug areas (now QLoggingCategory). It allows you to decide at runtime that, for example, you want debug-level info for a specific library, but otherwise only error-level statements for the rest of the application.

                        1. 9

                          This goes strongly against my experience, maybe (again) because I don’t work on server code.

                          Having only one log level for “stuff admins care about” implies that logging is free — that you should always be logging everything that might be useful for maintenance or troubleshooting. This is no bad idea when your target system does not have 32 cores nor super fast I/O nor unlimited disk space.

                          • Logging can consume significant CPU (printf functions are surprisingly expensive) and cause noticeable slowdowns in mobile or embedded apps.
                          • Logs take up space, which may be limited, especially if log rotation is missing or insufficient. (This is exactly why Tesla is having to recall car console displays right now — the firmware logged too much crap and it caused the flash storage to wear out too early.)

                          My support team has made it very clear to us developers that they need configurable logging to respond to customer issues. When a customer is having weird problems, they need to be able to have the customer turn on detailed logging in the problem area, so support or devs can pore over them.

                          But normally you don’t want the firehouse, but neither do you want no logging or only error logging. Frequently a problem will first manifest in normal usage. And it might be difficult to reproduce. So you want some logs all the time — not just errors, because that tells you nothing the end user couldn’t already tell you; you want to know about anything unusual or unexpected that happened beforehand. Voila, warnings.

                          1. 5

                            Having only one log level for “stuff admins care about” implies that logging is free — that you should always be logging everything that might be useful for maintenance or troubleshooting.

                            […]

                            My support team has made it very clear to us developers that they need configurable logging to respond to customer issues. When a customer is having weird problems, they need to be able to have the customer turn on detailed logging in the problem area, so support or devs can pore over them.

                            This is why debugging zones are more important than debug levels and, in fact, fundamental. And fundamental is also the ability to change which zone should be logged and at what level.

                            What I would suggest is:

                            • Three levels: DEBUG, INFO, ERROR. Always “compiled in”.
                            • One zone for each subsystem (say, library, service or application module).
                            • Normally, for all zones: only ERROR statements are processed (= evaluated + sent + stored). INFO are also processed but dropped after a few minutes/hours/days. DEBUG are turned into NOPs.
                            • When debug-time comes: Increase decay time for INFO statements and start processing DEBUG statements, but only for the zones that you care about (and progressively).

                            The aforementioned QLoggingCategory in Qt allow you to do all this, and so do other mature frameworks like Log4j.

                            You want to know about anything unusual or unexpected that happened beforehand. Voila, warnings.

                            If you stretch the definition of ERROR to mean “Anything that is unusual, not well handled and thus may lead to user-visible errors”, then those “warnings” would fit that level very well. :)

                            1. 2

                              INFO are also processed but dropped after a few minutes/hours/days.

                              What do you mean by this? They’re kept, but in a high-rotation store, or that they’re kept like error logs, but after a warmup period the app stops logging them?

                              1. 1

                                INFO are also processed but dropped after a few minutes/hours/days.

                                What do you mean by this? They’re kept, but in a high-rotation store, or that they’re kept like error logs, but after a warmup period the app stops logging them?

                                It depends on the capabilities of the logging framework you are working with, as well as other characteristics of your infrastructure (and infrastructural budget).

                                A couple of options:

                                • Send and rotate: Statements are evaluated (≈ formatted) by the client, then send to the server and the server rotates them quickly (simple client-side logging code, high network costs, low server storage costs).
                                • Send if requested: Statements are evaluated by the client and locally stored for a few minutes/hours, ready to be sent to the server if/when needed (complex client-side logging code, low network costs, low server storage costs).
                                • Grab when needed: Statements are evaluated by the client and locally stored for a few minutes/hours, ready to be grabbed by the server if/when needed (simple client-side logging core, requires access to the client machine, low network costs, low server storage costs).
                                • Send if frequent: Statements are evaluated by the client and locally stored in a cache. If a statement is logged more than x/times per hour, it is sent to the server and there stored permanently (semi-complex client-side logging code, low network costs, lowish server storage costs).
                        2. 10

                          The solution is to use an object instance for your logger, the same as you should with a database, or other external dependencies, then use dependency injection to provide the logger to the code that generates logs.

                          I know that’s the proper way to do things, but I hate having a logger instance past around to every single script and object. That’s what I used to do and I went back to a global instance, which can be initialised differently for test units or the app. Much simpler and good enough for my use. It makes the code a bit less modular since you have to make sure that this global logger has been initialised somewhere, but so much better than this extra dependency everywhere.

                          In fact, I wonder if an object that is passed absolutely everywhere via dependency injection doesn’t in the end count as a global object anyway, and could be made one to simplify the code.

                          1. 7

                            I don’t think there can ever be a single “right” way to do this. The key is to know the caveats. If you’re willing to accept the extra cognitive load that comes with handling a global safely, as a trade-off against the extra overhead of passing objects around, I think that’s fine.

                            I prefer the more explicit approach (especially on large projects, with many developers–many of whom will NOT be careful about global state), but of course YMMV.

                            1. 2

                              The way I like to look at it is that logging is an added behavior that we can separate from business logic. With this in mind, we can decorate the pieces of biz logic with logging at the entry point of the application. This way we can properly test biz logic without knowing anything about logging, and test logging without knowing anything about the biz logic. We can even inject faulty implementations to test logging when something goes wrong. And we can also decorate the dependencies of our dependencies… so turtles all the way down. An approach like this will inevitably push your codebase to be extremely modular. Too modular, to the point it can make people uncomfortable. I see this modularity as good thing, but not a lot of people share this same point of view… In any case, this is how I usually approach it. So +1 for dependency injection :-)

                              Here’s a pseudo-code example:

                              // ---
                              
                              interface Emailer {
                                Send(EmailMessage)
                              }
                              
                              // ---
                              
                              interface Logger {
                                Trace(string)
                              }
                              
                              // ---
                              
                              class StdoutLogger implements Logger {
                                public Trace(message: string) {
                                  print(os.Stdout, message)
                                }
                              }
                              
                              // ---
                              
                              class SendGridEmailer implements Emailer {
                                constructor(client: SendGridClient) {…}
                              
                                public Send(e: EmailMessage) {
                                  client.SendEmail(client.NewEmail(…))
                                }
                              }
                              
                              // ---
                              
                              class VerboseEmailer implements Emailer {
                                constructor(logger: Logger, emailer: Emailer)
                              
                                public Send(e: EmailMessage) {
                                  logger.Trace("sending email")
                                  emailer.Send(e)
                                  logger.Trace("email sent")
                                }
                              }
                              
                              // ---
                              
                              main() {
                                new VerboseEmailer(
                                  new StdoutLogger(),
                                  new SendGridEmailer(…)
                                ).Send(new EmailMessage(…))
                              }
                              
                            2. 2

                              So don’t give it to everything. That means not everything can log. Which is great! Logging should be a considered and deliberate action, like any other important thing in your application.

                              1. 1

                                Sometimes there are good reasons for having a static variable in a program. This is one of them.

                                1. 1

                                  Eh, nah. Few things are true process singletons, and loggers ain’t one of them.

                                2. 1

                                  In fact, I wonder if an object that is passed absolutely everywhere via dependency injection doesn’t in the end count as a global object anyway, and could be made one to simplify the code.

                                  Keeping dependencies explicit ensures that each component is testable in isolation from others, and establishes a clear rule — the possible effects of a component are constrained by its inputs, i.e. no side effects — which helps enormously in the understanding (and therefore maintenance) of programs.

                                3. 5

                                  Since using honeycomb to emit a single event per request per service (with every useful and interesting piece of data attached to it), I really feel like I am going back to the dark ages when I have to use logging (and even worse if it’s not structured logs).

                                  Looking for things which didn’t happen in a pile of debug logs is a nightmare compared to filtering events for where a property is false.

                                  1. 1

                                    Yep - honeycomb is great (particularly if you’re using microservices - IME it takes a bit more work to tune it right for a single app).

                                  2. 3

                                    Log output should be included in your unit tests (it is one of the affects of your function, right?).

                                    I disagree with this. That seems like it will get very close to testing the implementation quickly, and generally seems too far on the painful side of the testing spectrum.

                                    (I’m sure there’s exceptions where it does make sense because you have some kind of contract with log consumers, e.g. if you have alerting tied to specific log messages, though in that particular case I’d probably prefer metrics.)

                                    The one thing I’ve found useful is selected log analysis in integration tests. Collect service logs during integration tests, and have a post-test check that has some kind of modifiable whitelist/blacklist. I.e., by default an error level log causes a test failure, but then you might allow certain specific errors. It can also be good trade-off to wait for certain info log lines to appear when trying to set up the correct pre-conditions.

                                    1. 1

                                      This was probably poorly phrased. IMO, you should test that the expected logging occurs, when expected. Not the format of the logs (the logger implementations I use for testing inevitably log a very different format than the loggers used in production).

                                      Do you still disagree with this approach?

                                      1. 1

                                        Generally still disagree, yes. Outside particular circumstances, I don’t see logging as relevant to the purpose of the code, hence wouldn’t want to tie that part of the behaviour down by tests. Just like I wouldn’t want to test any number of other tangential side-effects. Don’t have particularly good analogies, but say I know some function will open a certain number of sockets; that’s observable behaviour, but I don’t want to ensure it doesn’t change.

                                      2. 1

                                        I agree, log output should not be included in unit tests, although like you, I can see a tiny bit of value in this. At $JOB, all our logs have a unique id associated with them [1], and they look like “BAR0025: host=db.example.com failure='Network dropped connection because of reset'”. A unit test of logging messages could check that it has a unique tag (or it hasn’t changed).

                                        [1] Why? To make it easier to search for particular log messages. They don’t have to be full UUIDs, just something that is unique enough for the organization.

                                      3. 1

                                        @jhall you list several things NOT to use debug logs for, but what would you actually put in debug logs? I have rarely found debug logs useful.

                                        1. 3

                                          Yes. My advice to folks who are just starting out with good practices for reliability is to make metrics, not logs, and definitely not log-based metrics. I have heard that comments are like apologies to the next programmer; similarly, logs are apologies to the next operator.

                                          1. 1

                                            comments are like apologies to the next programmer

                                            I love this! Any idea where it originated?

                                            1. 1

                                              I believe it originated with @washort and came out of discussions that we had had about how comments and docstrings should work in Monte.

                                          2. 3

                                            I like them for explaining why a decision was made by business logic. Say, there’s a reasonably complicated set of rules, which results in a pass/fail decision. If you put some debug logging in each fork of the decision tree, it’ll be hidden normally, but you can change the level for that package/log zone/whatever when you’re getting a wrong answer and it just might give you enough information to fix it quickly without having to step through and see where it goes wrong.

                                            1. 2

                                              I actually pretty strongly dislike debug logs. I don’t think I ever use them. In my experience, they’re mostly used by inexperienced developers, who aren’t yet up to speed on writing properly modular code, and proper tests.

                                              But I haven’t yet spent enough time looking into it to come out with a blanket statement that I think they’re an anti-pattern… although that is the direction I’m leaning.

                                              1. 5

                                                I think that’s very dismissive. Some people (like me) use languages that don’t have a very good debugger, and write complicated logic. What should I use to find what’s going on in my code, if not printf-debugging?

                                                1. 2

                                                  I think the very assumption that only developers need debug logs is wrong. Users may want to see program’s decision flow and internal data when they encounter an unusual problem, or when the program is something highly configurable and complex. Not all users are familiar with debuggers, and not everyone can easily build from source, so debug logs can be quite a time saver for them. They can also attach those logs to forum posts/bug reports.

                                                  1. 1

                                                    @c-cube It’s not dismissive. I was expressly explaining my experience.

                                                    But expounding on my experience, debuggers are usually overrated. I think a good test suite can replace a debugger in the vast majority of cases. A good test suite also replaces the need for debug logs in most cases.

                                                    @peter I’ve thought about this a bit more, and there are clearly some times when debug logs are appropriate. I use debug logs in third-party services all the time (think: apache, nginx, etc).

                                                    But this is a different sense of “debug” than what I’m reeling against. This is helping the user of an application debug things, not a developer of the application.

                                                    Perhaps “dev debug” logs make sense in some context, too. I’m happy to see this has sparked an enlightening discussion…

                                                    1. 2

                                                      I think a good test suite can replace a debugger in the vast majority of cases.

                                                      Arguments like this read to me like “don’t ship bugs and you won’t need a debugger,” (1) which is fine but I think it assumes a certain relationship between the debugging human and the program that isn’t actually true in the “majority of cases”, at least in my experience.

                                                      When you need the debugger you often don’t have the test suite, or it is not comprehensive to cover the problem, and it is not often the case that it is low friction enough to debug by writing test cases, and this happens a lot (in the “vast majority of cases”) because the person dealing with the bug didn’t get to write the test suite or the program.

                                                      When I get to start from scratch “debug via tests” certainly can work quite well. But mostly I don’t.

                                                      1: I know this is a slightly unfair characterization, it is more like, use your test suite to avoid bugs, and if one is discovered write more tests to explore/describe the bug in the pursuit of fixing it.

                                                      1. 1

                                                        I mean it more in the sense of “A good test suite tells you almost exactly where a failure is occuring, so you don’t need to step through a debugger to get the same information.”

                                                        Of course all bets are off for legacy codebases with little or no test coverage…

                                                        1. 2

                                                          I mean it more in the sense of “A good test suite tells you almost exactly where a failure is occuring, so you don’t need to step through a debugger to get the same information.”

                                                          This still sounds like “the bug is prevented before shipping” rather than fixed/debugged to me.

                                                          Maybe we are talking about different bugs. If the bug only exists before publishing/is detected during development I don’t really even think of those as “bugs”, just “I’m not done yet,” and I would agree a debugger is not going to be what I reach for first.

                                                          I’m having difficulty imagining a passing test suite that didn’t catch it before release “giving enough information” after the bug was detected through other means, but maybe that’s a failure of imagination on my part.

                                                          1. 1

                                                            Thanks for your thoughts. It’s clear I need to ruminate on this some more.

                                                            I still think a good test suite helps a lot in discovering new bugs (I’ve experienced this), but clearly my thoughts on the matter (and thus my verbalization) are not complete… Maybe I’ll write about this some day.

                                                            Thanks again.

                                                  2. 2

                                                    It’s a tool, just like anything else, and they can be useful for when unit tests or attaching a debugger aren’t. I just recently used some debugging statements to isolate a bug in one of our services. Why did the unit tests miss it? Because for a variety of reasons, there are no unit tests (written in C++ years ago with no clearly defined “units” to test [1]). A well placed debug log let me see how a particular piece of data was being parsed let me easily see a few hundred broken examples.

                                                    It also doesn’t help that our ops guys may not fully understand syslog at all. Last year they freaked out when I left a syslog debug message in a service I wrote, and they ask me how to disable debug level messages. I ask them why they didn’t filter out debug level messages with the syslog configuration. They didn’t know what I was talking about. Sigh.

                                                    [1] There are numerous functions only called from one spot with a ton of assumptions about how it’s called and they’ll all be inlined anyway by the compiler, right? Legacy code man, legacy code.