This advice feels a bit superficial, and in some places contrary to my own experience, especially
Even though logging too much is sometimes a problem, it is far more common to not log enough.
Over-logging is in my experience far more pernicious, and can easily make the log pipeline the bottleneck of a production system. I think a better way to get actionable advice about logging is probably to ask the engineers responsible for building and maintaining those pipelines at large-enough organizations. And I think that advice probably looks more like
Log as little as possible — prefer metrics for performance data
Don’t go level-crazy — debug (normally off), info, and error are probably enough
For request processing systems, log one wide event per request, and include a correlation ID
Another kind of meta point is that logging is often as critical to the operation of a service as the actual business logic, and so deserves to be tested in the same ways. Which I almost never see done.
Any article with advice not restricted to a specific area is going to be, in some ways, superficial, and in other ways, wrong. ;)
Systems with a continuous streams of events definitely should consider how much they log. However, there are also systems with rare events of high importance.
For example, IPsec, the bane of network admins. When a tunnel is established, there’s absolutely nothing to log. The hard part is IKE, the protocol for negotiating connection options and keys. The only points when anything can go wrong is initial negotiation or rekeying, but at those points, there are thousands ways it can go wrong. Worse yet, since the negotiation dialog is encrypted early on, traffic dumps are useless.
Many implementations are criminally underlogging it, which makes debugging a nightmare. If they offered an option to dump complete IKE packet contents, that would simplify things quite a bit.
Same with high availability setups. It’s not often that failover events happen, but when they do happen, you need to know exactly what happened. There’s no other way you can get that information after the event.
OpenBSD’s IPsec implementation saves us all by logging the proposal it did receive from the other end, saving hours and hours of conversation by phone for negociating the session until that tiny parameter let to default went wrong.
Some implementations dump some of the packet contents (but only the outline, not the actual content: aka, not the useful part of the packets).
Other implementation, indeed, only vomit “computer says no” at you.
logfmt looks good! I consider adding some logfmt() now… I wonder when NDJSON could be required unless I dump data structures of the program once in a while…
I am only using a DEBUG=1 environment variable, given #1 “Log as little as possible” (in volume).
But I would add “Log as much as possible” (every single intermediate error encountered), as they are precious information that help debugging the program, the configuration, and the data. The log volume would stay low nevertheless, as a production system might encounter few errors.
I like the idea of considering logs as first-class citizen that must be debugged. One might imagine that instead of log(), an assert() could be used, and if assert() is too much (error in data that does not interrupt the program), then a log error message is issued, but if nothing reads and detect these messages, then the programmer intent to signal “something is wrong” is lost! Corrupted data went in, suspicious event came…
I think that warning level is useful as well. From the standard syslog 7 levels I think only separation between info and notice is something weird. Otherwise I see the reason in having levels that are higher than error sometimes (like Knightmare).
Most places where there is too much logging don’t have any metrics at all. I’d rather take too many logs and no metrics over too few logs and no metrics.
How does structured logging address the issues brought up by the article? I can see it helping with grepable messages, but it’s not clear how using structured logging in and of itself addresses the other concerns.
It makes “Add dynamic information” and “grep-able messages” pretty straightforward as it is “natural” to add dynamic information and with structured data greping is much easier. It also solves “No screaming” as with previous - you rarely need to scream at all.
This advice feels a bit superficial, and in some places contrary to my own experience, especially
Over-logging is in my experience far more pernicious, and can easily make the log pipeline the bottleneck of a production system. I think a better way to get actionable advice about logging is probably to ask the engineers responsible for building and maintaining those pipelines at large-enough organizations. And I think that advice probably looks more like
Another kind of meta point is that logging is often as critical to the operation of a service as the actual business logic, and so deserves to be tested in the same ways. Which I almost never see done.
Any article with advice not restricted to a specific area is going to be, in some ways, superficial, and in other ways, wrong. ;)
Systems with a continuous streams of events definitely should consider how much they log. However, there are also systems with rare events of high importance.
For example, IPsec, the bane of network admins. When a tunnel is established, there’s absolutely nothing to log. The hard part is IKE, the protocol for negotiating connection options and keys. The only points when anything can go wrong is initial negotiation or rekeying, but at those points, there are thousands ways it can go wrong. Worse yet, since the negotiation dialog is encrypted early on, traffic dumps are useless.
Many implementations are criminally underlogging it, which makes debugging a nightmare. If they offered an option to dump complete IKE packet contents, that would simplify things quite a bit.
Same with high availability setups. It’s not often that failover events happen, but when they do happen, you need to know exactly what happened. There’s no other way you can get that information after the event.
OpenBSD’s IPsec implementation saves us all by logging the proposal it did receive from the other end, saving hours and hours of conversation by phone for negociating the session until that tiny parameter let to default went wrong.
Some implementations dump some of the packet contents (but only the outline, not the actual content: aka, not the useful part of the packets).
Other implementation, indeed, only vomit “computer says no” at you.
logfmt looks good! I consider adding some logfmt() now… I wonder when NDJSON could be required unless I dump data structures of the program once in a while…
I am only using a DEBUG=1 environment variable, given #1 “Log as little as possible” (in volume).
But I would add “Log as much as possible” (every single intermediate error encountered), as they are precious information that help debugging the program, the configuration, and the data. The log volume would stay low nevertheless, as a production system might encounter few errors.
I like the idea of considering logs as first-class citizen that must be debugged. One might imagine that instead of log(), an assert() could be used, and if assert() is too much (error in data that does not interrupt the program), then a log error message is issued, but if nothing reads and detect these messages, then the programmer intent to signal “something is wrong” is lost! Corrupted data went in, suspicious event came…
I think that
warning
level is useful as well. From the standard syslog 7 levels I think only separation between info and notice is something weird. Otherwise I see the reason in having levels that are higher than error sometimes (like Knightmare).Most places where there is too much logging don’t have any metrics at all. I’d rather take too many logs and no metrics over too few logs and no metrics.
This sounds good, until your log pipeline becomes the bottleneck of your production environment.
Well, there is even better approach - structured logging.
How does structured logging address the issues brought up by the article? I can see it helping with grepable messages, but it’s not clear how using structured logging in and of itself addresses the other concerns.
It makes “Add dynamic information” and “
grep
-able messages” pretty straightforward as it is “natural” to add dynamic information and with structured datagrep
ing is much easier. It also solves “No screaming” as with previous - you rarely need to scream at all.