1. 21

I was curious how my fellow 🦞’s manage logs generated by their services, be it at home or at work.

It looks like “Observability” is the cool new trend nowadays and it focuses heavily on structured logging and dynamic sampling, with many tools built to aid that so I’m wondering what tools my fellow crustaceans use, are they built in-house or a vendor product, and how has that been going.

I was reading a blog post on Observability and found a few interesting quotes:

When a request enters a service, we initialize an empty Honeycomb blob and pre-populate it with everything we know or can infer about the request: parameters passed in, environment, language internals, container/host stats, etc. While the request is executing, you can stuff in any further information that might be valuable: user ID, shopping cart ID — anything that might help you find and identify this request in the future, stuff it all in the Honeycomb blob. When the request is ready to exit or error, we ship it off to Honeycomb in one arbitrarily wide structured event, typically 300-400 dimensions per event for a mature instrumented service.

and

Dynamic sampling isn’t the dumb sampling that most people think of when they hear the term. It means consciously retaining ALL the events that we know to be high-signal (e.g. 50x errors), sampling heavily the common and frequent events we know to be low-signal (e.g. health checks to /healthz), sampling moderately for everything in between (e.g. keep lots of HTTP 200 requests to /payment or /admin, few of HTTP 200 requests to /status or /). Rich dynamic sampling is what lets you have your cake and eat it too.

  1.  

  2. 5

    Last job: kibana - hated it. For some reason I’m incompatible with the syntax. I often tried to just narrow it down to minutes/hours and then still used grep and jq on the structured logging json files.

    Current job: Nothing except ssh and grep - not ideal, but still beats kibana for low number of hosts. Also not my or even our choice, customer’s boxes.

    TLDR: No idea what’s a good solution, really. I’d probably give graylog a spin and then maybe write my own thing. Only half-joking.

    But to elaborate on what I see as your actual question: Logging isn’t enough. Honeycomb sounds like a mashup of logging and metrics. I liked Prometheus when I used it, but no good mix-in with the logs.

    Also the “typically 300-400 dimensions per event for a mature instrumented service” is not to be taken lightly. Seeing as who posted the linked blog post and that I seem to be disagreeing with most of their angry tweets… This probably is a solution for a specific type of problem that I usually don’t find myself having.

    1. 4

      Honeycomb sounds like a mashup of logging and metrics.

      As a (very happy) Honeycomb user (on a simple rails monolith): It’s really not.

      I’ve used Splunk and NewRelic extensively; they are not the same kind of thing. The best summary of the difference I can offer is this:

      • Splunk and NewRelic are designed to answer a large-but-finite set of known questions.
      • Honeycomb is designed to answer new questions.

      IME, the ‘fixed set of questions’ approach will get you a pretty long way, but they often lead to outages getting misdiagnosed (extending time-to-fix) because you have to build a theory based on the answers you’re able to get.

      • Honeycomb lets you formulate a question to confirm your theory during the outage
      • Splunk lets you add a new indexing strategy so you could answer that question during the next outage, and
      • NewRelic tells you to be happy with the answers you’ve already got.

      Also, from a cost perspective the orders of magnitude don’t really line up. If you already pay for splunk, the cost of Honeycomb is a rounding error.

      1. 1

        If I had written “a mashup of logging and metrics and more”, would you have agreed more to that? I could only guess as I have never tried it.

        I haven’t used splunk - for me it’s “that process that runs on the managed machines we deploy to and that sometimes hogs CPU without providing any benefit because we don’t have access to the output” :P And no, don’t ask about this weird setup.

        1. 3

          Still not quite. “Nested span” really is a separate type of thing.

          The failure of logs is that they are too large to store cheaply or query efficiently.

          You can extract metrics from logs (solving the ‘too large’ issue), but then you can only answer questions your metrics suit. If you have new questions you have to modify your metrics gathering and wait for data to come in, which is not an option during an outage.

          Nested spans are (AFAIK) the most space-efficient and query-efficient structure to store the data you would usually put into logs, but they require you to modify your application to provide them (which is a nontrivial cost). In practice, nested spans typically also combine logs from multiple sources (eg load balancer, app server and postgres logs can all go into the same nested span).

          The core observations IMO:

          • It is possible (with clever tricks) to pack all useful info from a log line into a key/value map without increasing the storage required. This is the ‘arbitrarily wide’ term.
          • Most events are part of one parent event (eg: one DB query is part of responding to one http request). This is the ‘nested’ term.
          • All events which can be logged have a start and end time, even if the duration is nearly 0. This is the ‘span’ term.

          Taken together, this gives you term ‘arbitrarily-wide nested spans’. Several optimizations are possible from there - eg the start/end times of a child span tend to be very small offsets from the parent event, whereas storing as logs requires the full precision timestamp.

          When your data is stored in this sort of nested-span structure, it’s both smaller than the original logs (thanks to removing redundancy in related lines) and more complete (because it can join logs from multiple sources).

          You can further shrink your data by sampling each top-level span and storing the sample rate against the ones you keep. For instance, my production setup only stores 0.5% of 200 OK requests that completed in under 400ms. Because the events which are kept store the sample rate, all the derived metrics/graphs can weight that event 200x higher. This means that I have more storage available for detailed info about slow/failed requests.

          This is all sensible stuff you could build yourself, but honeycomb already got it right and they’re dirt cheap (their free plan is ample for my work, which is a pretty damn big site). The annual cost for their paid plan is less than a day of my time costs.

          I haven’t used splunk - for me it’s “that process that runs on the managed machines we deploy to and that sometimes hogs CPU without providing any benefit because we don’t have access to the output” :P And no, don’t ask about this weird setup.

          This is par for the course at most places dysfunctional enough to pay for splunk (at a past employer it cost more than a small dev team, and it took me months to get access to it).

        2. 1

          Could you give some concrete examples of the difference between Splunk and Honeycomb? Specially, what do you mean exactly by “known questions” vs “new questions”?

          1. 1

            When you configure a splunk install, you setup indexing policies which determine which queries can be efficiently answered.

            Most of the time, when I wanted to find something out in splunk, I wrote a query and got a result in a second or two. These are ‘known questions’, and splunk uses its index to come up with answers quickly. For instance, if I wanted to see a breakdown of HTTP status code by request route per hour, our splunk install had zero trouble answering instantly.

            If you want to answer a question which isn’t handled by those indexing policies (a ‘new question’), Splunk has to read the log files in full to answer it. This results in splunk taking a minute or two (dependent on your hardware and load) to give you a results page.

            For instance, I recently wanted to compare browser share stats between google analytics (blocked by trackers) and honeycomb (records bots as well as real users). So, I did a breakdown by browser + version + IP address, then manually excluded IP addresses with abnormally high request rates and repeated with just browser+version. Took me a minute or so to put the final query together. If I’d tried that with splunk, I’d have had a visit from the operations team asking me how long I planned on generating that much load on the splunk server.

        3. 1

          Observability - and the service Honecomb provides - is about being able observe or introspect your system, to ask unknown questions without having to recompile and redeploy.

          Metrics and logs can be a part of observability, but for example metrics usually only give you a hint at what’s going on at a given sampled period and are based on known questions. Similarly for logs, you can add as many log lines as you want but they might not help you answer specific questions at runtime - and you have the overhead of having to process and store all these logs. Inevitably people downsample logs and metrics for long-term storage, which means you lose data.

          If you have an observable system you can, at runtime, ask questions about what the system is doing at a given point. This is generally in the form of some kind of distributed tracing system. In tracing, while a processing a request, each component of a system (including external dependencies) starts a span connected to a trace. The sum of all spans and the metadata attached to them let you see more detail about what’s going on, such as durations for a certain type of request. This would be too expensive in a timeseries system.

          Going back to the original question, ELK still works fine. For small setups like a home environment it’s relatively straightforward. Structured logs (e.g. JSON formatted) does make ingestion to Elasticsearch easier, but it’s not required (but it is recommended).

          1. 1

            I have the nagging feeling you’re disagreeing with me as if I had tried to define observability wrongly, but imho I’m not even trying to.

            If it’s simply about the sentence with mashup, that was simply my impression from the linked article, as I don’t know about Honeycomb.

        4. 4

          It looks like “Observability” is the cool new trend nowadays

          It’s not clear from this post what your goals are and what your current setup is, because you’re conflating observability and centralized logging, but those are independent topics. Related, yes, but independent.

          What is it you wish to observe, and to what ends?

          Observability is a somewhat vague term, but I assume that you mean it by the common definition of observing the current (and past) states of a system for the purposes of operating that system safely. That is: observability generally means operational observability and not business observability (or legal/financial/compliance observability). Operational observability generally entails some sort of alerting system. Most people would not put auditing for legal or compliance reasons under the umbrella of observability, even though auditing is literally the process of observing what has happened. Centralized logging is essential for dealing with issues of legal compliance, and it’s fantastic for doing so. If you are trying to solve some sort of legal or compliance problem the rest of my post will not help you. If you are, however, concerned with operating live systems, the rest will apply.

          I’ve run ELK in the past and don’t bother with it now; I just use dsh to ssh into my nodes and tail their logs directly and pipe them into grep. Using journalctl -f is the same thing, sometimes I do that too. I log to file and logrotate my logs and throw them out after a week because I don’t have a legal requirement to keep them and I’m never going to look at them. I don’t personally believe in the efficacy of logs for the purposes of operational observability.

          To clarify where I’m coming from and my experience with the ELK stack: I spent a few months working on ELK infrastructure full time at Etsy; it was my responsibility to productionize the log shipping from every node in the fleet (Elastic’s “beats” did not exists yet; the predecessor to file beat was lumberjack and it was not production-ready). Not only did I find a large number of problems with how the ELK stack shipped logs, I also found a large number of problems with how Splunks ship logs (do you know what happens when one process continues to write log lines after another process has deleted the file handle?). We were able to scale this stack and when I was there, we were shipping many billions of log lines every day. I would guess that less than one one-hundredth of one percent of all of that data was ever read in some way. My conclusion is that centralized logging is primarily a concept that is pushed by people who are trying to sell things. If what you really want is some sort of graph or chart or alerting system, you probably don’t want to get there by parsing logs.

          My conclusion after all this time was the centralized logging is ill-suited to operational observability inasmuch as centralized logging is concerned with completeness; the metric of success for centralized logging systems is to not lose log data. But observability in the general ops parlance is not concerned with the completeness of records inasmuch as it is concerned with the timeliness of signals. Alerting systems are latency-bounded systems. Case in point: downsampling is the process of throwing away data to make your pipeline more efficient in terms of something (disk space, network utilization, CPU utilization, or end-to-end latency, take your pick).

          Centralized logging systems nearly always involve queuing systems or backpressure; they’re nearly always designed around the idea that you want to get all of the data eventually, not a current view of your system right now. The time-to-alert for your alerts driven by centralized logging is in every case going to be the end-to-end latency of the entire system. Especially if your alert is defined as “some system is in some state for some amount of time”, such as “for 3 minutes this endpoint is not responding”, which is how a lot of production alerts are driven in order to avoid false positives. When some portion of your centralized logging infrastructure slows down, so does your logging pipeline and so does your alerting system. If your logging system is delayed by five minutes and you want to alert on something being in some state for five minutes you’re looking at ten minutes from the time an event begins to the time an alert is fired. Is that ok?

          I strongly prefer, for the purposes of operational observability, to use a time series database and to focus on metrics instead of logs. I use InfluxDB and Grafana for this purpose and my experience has been positive. I install Telegraf on every node and have my local processes talk to telegraf, which performs aggregations at the source and sends them to an influx database, which I view through grafana and drive alerts through pagerduty. This is working well for me.

          If we want to be exceptionally pedantic: technically any append-only, ordered stream of events is a log of some form, and technically writing to a time series database is itself a form of logging. I assume what you mean is: “I have textual application logs, I want to know how my system is doing both now and historically”, and my post is written through that lens.

          1. 2

            After reading several people talk about their lukewarm to negative experiences with Splunk, I wanted to chime in. A different team manages the Splunk infrastructure at my job, so I can’t speak to that beyond this: the Splunk search head and indexer run atop OpenStack on an R630; they are not the only VMs running. For the most part, I just want to speak to my experience making heavy use of it as an analyst.

            We have several indexes, the smallest of which has hundreds of thousands of events, and the largest of which has hundreds of millions of events. If memory serves, at least one has over a billion events. Most contain network data, but some have system events or web server logs. Only the most complex queries, on the largest data sets, ever take significant amounts of time. Most queries begin returning results in a few seconds. If administered according to best practices, and given enough resources, Splunk does a phenomenal job of allowing analysts to carve through and draw conclusions from massive data sets.

            Based on the experiences some of you described, my guess is that your organizations are not following best practices, are not giving Splunk enough resources, or both. Is Splunk administration expertise hard to come by? Is an R630 really that hard to justify? I’m curious to hear back.

            1. 2

              Last job: Amazon’s Kinesis pipelines and Spark Streaming.

              1. 2

                Datadog Log Collection. It’s painless to setup and can do fancy things such as dumping older logs into S3 for long-term storage (which is a requirement in the industry I work on). Honestly, after using Kibana and Splunk in previous jobs, Splunk might have more powerful features, but it’s stupid expensive and hard to maintain. Datadog Logs hits the right balance of cheap, easy, and usable.

                1. 2

                  we generate a lot of logs. we were using sumologic but a few events with runaway ingest and subsequent plan extortion led me to investigate options. we’d already moved to prometheus and grafana for monitoring and alerting, so using loki for log aggregation and search seemed feasible. loki is still semi stable but they’re iterating on it rapidly, so we were able to shut off sumo and transition entirely to loki. for now at least. so far it’s been good.

                  1. 2

                    I’ve been using the logging that we added to our Datadog account. Budget restrictions limited the logs we’re ingesting (service, os, and application logs and CloudWatch). That restriction also lends to having a short retention time. It works fine and it’s nice to have one place where logging, APM, metrics, and events are captured. I just wished we had the budget to log everything and have a slightly longer retention period. But as it mentioned you can re-ingest logs that it’s archived to S3.

                    Before Datadog we used Sumo logic, which worked fine and the only reason we switched was budget. At smaller places I’ve used both Papertrail and Loggly (before Solarwinds bought either of them). Splunk was the solution I was happiest with, but you need the budget for it.

                    I’ve come to the opinion that if your only needing basic logging for troubleshooting and the size of your environment is under 100 things a managed service like Sumo Logic, Papertrail, Loggly, or Datadog will work fine. But it all depends on the volume of logs being created, which of them needs indexing, and how long you need to retain them. Anything larger benefits from something like Splunk.

                    Also consider things you may need to log for security or compliance (CCPA/GDPR) may want to use something else. You might need to consider storing them long term in something like S3 with the Object Lock and lifecycle management setup. That will gives you the ability to reduce the chance the logs are changed and then deleted them after you’re no longer required to keep them.

                    Making the LogLevel something that can be changed to increase the logging as needed when troubleshooting is another useful approach.

                    1. 1

                      We use https://www.scalyr.com/ (for free since we work for “good causes”). It was selected Before My Time, but it just works for what I need.

                      1. 1

                        ELK stack running on AWS with a separate index per service. Kibana has some bad parts but overall it’s pretty useable for me. We have a team that owns it but it’s been running without much intervention for the past ~2 years. At my previous job we used Graylog and found the experience to be awful.

                        1. 1

                          Syslog!

                          1. 1

                            nomad[0] logs and some shell scripts to pull logs from nomad for multiple instances of a job… plus grep(well ripgrep, but whatever)

                            I setup graylog[1] and we used it for a while, and we even still occasionally use it, but it’s generally more hassle than it’s worth. whenever it breaks I don’t expect anyone here will care enough to have me fix it.

                            for metrics, we use prometheus[2] we are happy with it.

                            0: https://www.nomadproject.io/

                            1: https://www.graylog.org/

                            2: https://prometheus.io/

                            1. 1

                              I’m a big fan if structured json in Splunk. You can configure it to generate indexes for every key and substructure in the json which makes some insanely fast queries. The querying language has a learning curve but I think if you’re familiar with data pipelines or shell pipelines, it’s pretty approachable. There’s definitely some weird bugs with it and the visualization side is a bit limited. I often have some simple python scripts that query Splunk for data and then render some more advanced visualization or do some tricky math on things.

                              I think the dashboard is a bit limited in that it’s hard to have complex data pipelines like multiple queries that combine, or a single shared query that gets visualized or refined in multiple ways. It’s not impossible but it’s not easy.

                              The API is very friendly so it’s easy to emit your logs to Splunk if you want to do it yourself or your language or platform doesn’t natively support it. It’s also really easy to consume data or just export the raw rows.

                              In the end, I think it’s my best experience with logging and queries I’ve found so far.

                              1. 1

                                I’ve used Loggly for the last 5 years or so and find it much better to use than self hosted Kibana.

                                1. 1

                                  Mozilla’s mozdef is a pretty good starting point. It’s an elk stack on a stick plus some log inspection logic.
                                  https://github.com/mozilla/MozDef

                                  1. 1

                                    Has anybody experience using riemann (http://riemann.io/) in production?

                                    1. 1

                                      I use nanomsg. All of the languages I use support nanomsg; Go, C, Python etc. I have a simple server written in C that listens on a nanomsg socket, and logs everything it receives to a file. It even works for cgi scripts. (Which typically can’t do centralized logging.)

                                      1. 1

                                        I once tried to get systemd’s remote logging set up, but I couldn’t manage it. If someone has gotten it working, please let me know.

                                        1. 1

                                          At $job, we use papertrail. I find it works pretty well, and am fairly happy with it.

                                          1. 1

                                            We (@trivago) have serveral ELK clusters (3 small ones in each GCP regions and 1 “big” cluster in one of our on-premise datacenters). We ingest between 5TB-10TB of logs daily with different retention policies. We use Kafka as a central buffer for all logging pipelines and most of our logs are protobuf encoded.

                                            We have an internal cloud (self hosted) that runs Nomad (from hashicorp) and in those nodes we use filebeat (also from elastic) to ship the logs of the jobs to our ELK cluster. We recently wrote an autodiscover module that enriches the log events with information from the Nomad job/group/allocation. A nice bonus is that now we can use autodiscover hints (https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover-hints.html) to configure how the logs of a particular job should be parsed.