1. 33
  1.  

  2. 66

    We used to log things into a single file. It was fun! [..] Do you happen to have several servers? Good luck finding any required information in all of these files and ssh connections.

    We’ve had syslog since the 80s. It doesn’t require 3 Docker containers, a database, and a management tool to manage the containers, so guess it’s not hip enough 🤷‍♂️

    1. 12

      I love the contrast between “logging infrastructure is hard” and “use Sentry” … which is … ummm… easy to start, hard to scale. In both cases, though, you can use SaaS providers and deal only with getting the data out of your app and into their systems.

      1. 1

        and deal only with getting the data out of your app and into their systems.

        That’s not as easy in the world of GDPR. You have to be able to purge data from SaaS tools when there is leakage. Developers don’t check what they are logging unless you have virtually unworkable processes in place and end up leaking PII which doesn’t get noticed until someone else is troubleshooting a problem.

        Clearing the PII from the SaaS provider can be challenging in many cases, but longer-term they also become a potential target for attackers. At minimum, keeping a short retention window is good. Overall, it’s just another thing to think about, but many people do get time to think about this topic until it’s too late.

      2. 6

        Syslog usually means your application needs to make a dgram Unix socket connection, which is kind of exotic and not available everywhere. Your application also needs a switch for it, or someone may try to run it in the foreground and wonder why everything’s quiet. Collecting logs from several servers is also finicky configuration and kind of unreliable in my experience.

        Just print to stderr and let the service manager handle it.

        1. 16

          I posted my comment late last night, and is probably a bit more snarky and abrupt that it should have been. To expand a little:

          I don’t think syslog is perfect, or suitable for every situation, or that other solutions don’t have advantages over syslog; the gripe I have is that the situation is presented as “oh, look at the old way, how antiquated! Here’s a new super-complex solution for you to use” whereas in fact, people discovered exactly the same problems and invented solutions for it (such as syslog) before most of us were involved in computing, or even born. Turns out people doing computing in the 80s weren’t complete blubbering idiots.

          This is part of a general pattern of annoyance I’ve seen recently where “old” solutions are misrepresented as being much more simplistic than they actually are – usually out of ignorance, not malice – before declaring we should all be using this really complex “modern” solution. Sometimes that’s the case, but a lot of the times it’s not, or at least more nuanced. For a lot – though obviously not all – of cases, “old” solutions like logging to files, syslog, or stderr is just fine.

          1. 2

            Sorry, mine was equally snarky. And I missed the underlying point, which I do agree with. :-)

          2. 8

            kind of exotic and not available everywhere

            What universe are we living in where making a Unix socket connection is “kind of exotic”? Oh yeah, a world where Unix won, fair and square. There’s nothing exotic or strange about Unix sockets. They’re supported on every platform. Any platform that doesn’t support them is a bad platform.

            1. 4

              There’s a lot exotic and strange about Unix sockets.

              When someone says ‘Unix socket’, they usually mean SOCK_STREAM, and that is often well supported. But syslog uses SOCK_DGRAM, which for example Node.js doesn’t support out of the box. You can also have SOCK_SEQPACKET Unix sockets, apparently, but I’ve never seen that in practice.

              But people do all sorts of arguably exotic things with Unix sockets, like passing file descriptors or using the remote UID for authentication.

              1. 6

                I don’t think any behaviour standardised by POSIX is “exotic”.

                1. 6

                  There’s a lot of baroque and antiquated stuff in POSIX if you read the spec. A lot of it is a tortured attempt to avoid standardizing things like tar (see pax).

                  1. 1

                    Antiquated and exotic are not the same thing.

                    1. 4

                      Sometimes it’s actually both. Ever seen POSIX AIO?

                      1. 1

                        POSIX AIO is not exotic. Nothing standardised by POSIX is exotic, almost by definition. It’s the set of standardised interfaces common to all operating systems. It’s the portable operating system standard.

                        POSIX AIO might be bad but exotic it is not.

                      2. 1

                        I guess it depends on what you mean with “exotic”, but there are certainly some parts of POSIX that almost no one uses, for example the sccs VCS, batch functionality (q* commands), messaging tools (write, talk, etc), iso646.h (macros && to and etc), and probably some more.

                        These parts are rarely seen, and many people don’t even know about it, thus “exotic” sounds like a pretty apt description.

                  2. 1

                    Most modern syslog daemons support TCP sockets as well. Additionally you can always send messages via “regular” (non-UNIX) UDP socket bound on loopback address. So I do not think that is much of the problem. It can be a little bit more problematic in case of systemd journal, but in such cases you can always fall back to just logging to stdout and binding that to the journal via StandardOutput=journal.

                    1. 1

                      I agree. I don’t know if you already used the C API for passing file descriptors over sockets with recvmsg because it’s rather… very special. Even the best possible man page about this would be rather very confusing and hard to understand, and since it’s a lot about allocating buffers of the correct size I’m very concerned about security. It’s probably the weirdest C API I’ve ever seen.

              2. 13

                Um; so IIUC, the gist of the argument is rather: “Use Sentry instead of plain text logs”, not “Do not log”, right? As that still seems like a log-like solution, just on steroids probably. There’s some server, so it still requires pushing some stuff into it, no? In the same places where I’d normally log, I guess?

                I mean, this Sentry thing looks like an interesting thing based on a quick glance, I think I might be interested in reading more about it; though I also wonder if the added complexity over text logs might introduce more failure modes; I guess I might be interested in using both the Sentry thing, and also dumping the same info to a text log anyway?

                So, I’m not really coming away from the article particularly convinced to “not log”…

                edit: I mean, faced with a tricky bug, I need some way to try and piece together what happened… even (or especially) if there was no exception, but the behavior of the system was still different than expected…

                1. 5

                  The example of “a job for Sentry” in this article doesn’t ring true for me. Sentry is a fantastic tool for catching and addressing unexpected issues. Codebases I’ve worked on that use Sentry as a logging tool, rather than an alerting system that should be empty & quiet, tend to accumulate a lot of “steady state” Sentry alerts that everyone ignores, which gradually just turns into everyone filtering out anything from Sentry at all.

                  The example reads to me like an expected (network requests will definitely fail at some point) but unhandled code path.

                  1. 2

                    The “right way” to do this is something like the following:

                    task = ...
                    try:
                        result = task.something_that_might_fail()
                    except HttpError as exc:
                        if exc.status_code == 500: 
                            statsd.inc('some_service.failure_count')  # SLA checks, 
                            task.set_state('generic_error')
                            raise RetryTask()
                        elif exc.status_code == 400:
                            # user misconfigured service integration
                            task.set_state('configuration_error')
                            # no logging
                        else:
                            # unhandled status code
                            raise exc
                    except ValidationError as exc:
                       # user input is not right shape
                          task.set_state('validation_error', exc=exc)  # task parses out error dict
                    
                    # implicitly raise on other stuff
                    

                    and hook up sentry or w/e. Basically you do need to handle case-by-case stuff, and determine if stuff needs “logging” (I agree that Sentry is logging), and if not just handle stuff gracefully. That way everything in Sentry is “thing that requires attention” and stuff that doesn’t require attention doesn’t generate dev-side alerts

                2. 9

                  This could maybe use a bit of a tl;dr at the top that says “this mainly applies to web applications, if you’re writing a CLI app or PyQt GUI or a game then YMMV”. I mean I write a lot of Python that sits behind an airgap and can’t send a damn thing to Sentry.

                  Also in like a dozen years of doing this I’ve never, not once, seen anyone do logger.log(ex), since that’ll throw a TypeError. Inside an exception handler, no doubt … man, that’ll be fun to track down.

                  1. 9

                    Some developer who obviously never maintained a production system advices people to

                    • spend money on a Webservice for some other type of logging
                    • “Just don’t write bugs”
                    • Dont care about code which fails because is not necessary (why is this code in the code base then?)

                    Disclaimer: Im a sysadmin maintaining tons of applications in different languages from software projects which are black boxes to me. Im happy for every single line of log output and its my responsibility to make maximum use out of those log lines given to me by friendly developers.

                    1. 3

                      As a developer who does maintain a production system, there is an alternative to logging which has fully replaced it (in my usage, at least).

                      I use honeycomb, but anything built on the same basic concept would be comparable.

                      However, this only really works well when A) developers are investigating system outages in the systems they write, and B) that you can get a new diagnostic line into production in under 15 minutes.

                      1. 1

                        You basically put in words perfectly what I wanted to say but wasn’t able to express properly - also from my perspective as a developer, who not so rarely needs to debug some tricky production issue.

                      2. 5

                        This was essentially the conclusion I came to during and after my logging project, too. Or, maybe not “do not log” outright, but at least: think very carefully about what you’re logging, minimize it as much as possible, and prefer updating metrics in most cases — especially in the path of anything that happens more than once per second.

                        1. 5

                          I have a half baked essay floating in my head that I should write down. But largely I agree, but with a different spin on things.

                          What is needed are structured events, which can then be “mined”. Not plaintext.

                          My latest iteration of that idea is https://gitlab.com/pnathan/pmetrics .

                          1. 2

                            The way I see it, these are two different types of windows into the system. On the one hand, there are the plaintext logs with one entry for everything happening which a human can look at when diagnosing an issue.

                            On the other hand, there are continuous aggregated metrics which help operators in their quest for mechanical sympathy and capacity planning and whatnot. These are not individual events, but things like queue flows (both units and volume), buffer levels, error rates, latencies, and so on. These can be reported in aggregatable formats (e.g. as plain counters or histograms) and are a very scalable way to tell a lot about how your system is doing without having to extrapolate from individual events.

                            This is a good article on the subject: https://blog.colinbreck.com/observations-on-observability/

                            1. 1

                              I’m familiar with those kinds of aggregated metrics. They have their purpose. The observations on observability article is okay - I see where the author is gesturing, but I don’t think he’s laying a good trail to go there.

                              My experience is those single-variable timeseries metrics begin to fail catastrophically as a mental model both horizontally and vertically as you look to understand the system in more nuanced fashions. Being able to compute your continuous aggregates from logically structured events - and I do not mean logs, as those are very very noisy - is where I believe we need to go.

                          2. 4

                            To me, the only alternative to adding instrumentation (i.e. logging, tracing, debug registers, test points, depending on what you work with) is making no mistakes or knowing all the emergent behaviors of your system. Keep in mind that your system can be driven in ways you did never envision.

                            As this is impossible, I clearly prefer to have some instrumentation in the system.

                            One part of the article that does ring a bell, though, is “fighting log levels”, as it’s called. I believe that it’s basically intractable do define a useful set of levels like e.g. CRITICAL, ERROR, EXCEPTION, FATAL, WARNING etc. For one, every system and organization has a different view on what these mean and how system behavior should be assigned to these categories. Often, there is no consensus.

                            Secondly, and more importantly, I think that it’s typically impossible to assign a level at the point of logging. For example, consider a function that opens a file, reads the contents, closes the file and returns the content and an error code. Now imagine that you decide that opnening a nonexistant file is an error and this should be logged in the quite severe ERROR category. Imagine now that you want to use that function to read some files if they exist - if they don’t exist, no biggie, you continue. You can implemement the behavior because the error code the function returns is enough for this. However, you will now have your log sprinkled with ERRORs that are, in reality, normal and expected operation.

                            1. 3

                              So, this article advocates for trying to avoid overlogging, which makes sense.

                              But, other than analyzing logs and trying to get low signal ones put of the system over time, is there a more principled approach to get a sense of what logs will useful?

                              For me, it’s been a case of logging all exceptions, and then adding special cases as I discover which exceptions happen when a network endpoint is temporarily missing or other similar errors.

                              Where as for unit testing, the idea of virtualizing your dependencies on the outside world first, and using that to help build your tests has been a helpful idea.

                              1. 3

                                And eliminate all possible exceptions that can happen here with mypy

                                I’m fully on board with this in principle, but mypy already introduces so many operational issues that I have a hard time believing it can be sold to dynamic typing enthusiasts who tend to be of the opinion that type errors are too tedious. For example, it turns out that mypy has a terrible time finding packages in its environment, and the ones that it finds must be built according to a particular pep (many aren’t), and the error messages are terrible and the (purportedly) relevant docs are unhelpful. Further, there are general usability issues with mypy, such as the lack of support for callables with kwargs, recursive datatypes (e.g., linked lists or other tree/graph-like structures such as JSON, etc) and the various circular dependency issues you run into.

                                If my experience is remotely typical, I don’t think mypy is mature enough yet.

                                1. 3

                                  If you decide….

                                  • The exception will never happen or…
                                  • it doesn’t matter if it does…
                                  • Or you don’t know what to do if it does…

                                  …and you just catch and drop the exception on the floor without so much as logging…..

                                  ….I will be very very rude to you.

                                  The hours of my life I have wasted on truly obscure bugs where assholes have done this is amazing.

                                  Hint: If it doesn’t matter whether an operation fails….

                                  …. it’s a hint that nobody actually cares about the operation, so why are you bothering to do it?

                                  So delete the code and make everyone happier.

                                  If someone starts screaming, then clearly somebody does care, so you better make it work XOR explain to them why it isn’t.

                                  1. 3

                                    The first question is: can we make bad states for do_something_complex() unreachable?

                                    No

                                    The second question I ask: is it important that do_something_complex() did fail?

                                    Yes

                                    The third question is: can we instead apply business monitoring to make sure our app works?

                                    No


                                    Do log

                                    1. 2

                                      Well I’m convinced.

                                      I’m sure I’ll run into edge-cases where I’d need logging but you’re right in saying logs without context is a mess. I’ve been writing TypeScript for the past few months and it’s radically transformed my code quality.

                                      Sentry sounds great, I’m just queasy about having an active third-party in my codebase.

                                      1. 3

                                        Sentry can be self-hosted and is open source, if that helps. I haven’t used it myself though, so I’m not vouching for it in any way. I’ve just looked at the marketing page in the past, and it looks promising.

                                        1. 1

                                          Oh wow, I had no idea! HMM

                                      2. 2

                                        This doesn’t sound like much of a good idea to me. In my experience, general text-based logging in reasonable places combined with a quality system for aggregating and searching those logs makes for a wonderfully flexible tool for monitoring the status and performance of various processes and diagnosing weird, unexpected errors, particularly those related to services interacting in unexpected ways. I am highly skeptical that it’s a better practice overall to replace that with a number of more specialized services monitoring things that you expect to need monitoring. Such systems may behave well when your systems behave exactly the way you intended and there’s nothing unexpected to figure out. They tend to fall flat when you’re trying to figure out what the heck is going on when your system is doing something that doesn’t seem to make any sense.

                                        1. 1

                                          I somewhat disagree with this article - logging rocks!

                                          Here at oxydata.io we run many complex crawlers and related dynamic programs (like parsers, dumpers etc). The easiest and most efficient way to deal with it were simple logs with slack alerts. Crawler X collected 50 warnings in the last 10 minutes: kube pod AAABBBCC @maintainer turned out to be the easiest and most efficient approach here with almost 0 setup - if you have some monitoring system it can probably do it for you already.

                                          What I do agree with is my interpretation of #2 - logging code is ugly and gets in they way. I’ve been trying all sorts of approaches of minimizing logging code in the actual program with python decorators, abstractions and all sort of hacks and yet everything is still ugly.