I was curious how my fellow đŚâs manage logs generated by their services, be it at home or at work.
It looks like âObservabilityâ is the cool new trend nowadays and it focuses heavily on structured logging and dynamic sampling, with many tools built to aid that so Iâm wondering what tools my fellow crustaceans use, are they built in-house or a vendor product, and how has that been going.
I was reading a blog post on Observability and found a few interesting quotes:
When a request enters a service, we initialize an empty Honeycomb blob and pre-populate it with everything we know or can infer about the request: parameters passed in, environment, language internals, container/host stats, etc. While the request is executing, you can stuff in any further information that might be valuable: user ID, shopping cart ID â anything that might help you find and identify this request in the future, stuff it all in the Honeycomb blob. When the request is ready to exit or error, we ship it off to Honeycomb in one arbitrarily wide structured event, typically 300-400 dimensions per event for a mature instrumented service.
and
Dynamic sampling isnât the dumb sampling that most people think of when they hear the term. It means consciously retaining ALL the events that we know to be high-signal (e.g. 50x errors), sampling heavily the common and frequent events we know to be low-signal (e.g. health checks to /healthz), sampling moderately for everything in between (e.g. keep lots of HTTP 200 requests to /payment or /admin, few of HTTP 200 requests to /status or /). Rich dynamic sampling is what lets you have your cake and eat it too.
Last job: kibana - hated it. For some reason Iâm incompatible with the syntax. I often tried to just narrow it down to minutes/hours and then still used grep and jq on the structured logging json files.
Current job: Nothing except ssh and grep - not ideal, but still beats kibana for low number of hosts. Also not my or even our choice, customerâs boxes.
TLDR: No idea whatâs a good solution, really. Iâd probably give graylog a spin and then maybe write my own thing. Only half-joking.
But to elaborate on what I see as your actual question: Logging isnât enough. Honeycomb sounds like a mashup of logging and metrics. I liked Prometheus when I used it, but no good mix-in with the logs.
Also the âtypically 300-400 dimensions per event for a mature instrumented serviceâ is not to be taken lightly. Seeing as who posted the linked blog post and that I seem to be disagreeing with most of their angry tweets⌠This probably is a solution for a specific type of problem that I usually donât find myself having.
As a (very happy) Honeycomb user (on a simple rails monolith): Itâs really not.
Iâve used Splunk and NewRelic extensively; they are not the same kind of thing. The best summary of the difference I can offer is this:
IME, the âfixed set of questionsâ approach will get you a pretty long way, but they often lead to outages getting misdiagnosed (extending time-to-fix) because you have to build a theory based on the answers youâre able to get.
Also, from a cost perspective the orders of magnitude donât really line up. If you already pay for splunk, the cost of Honeycomb is a rounding error.
If I had written âa mashup of logging and metrics and moreâ, would you have agreed more to that? I could only guess as I have never tried it.
I havenât used splunk - for me itâs âthat process that runs on the managed machines we deploy to and that sometimes hogs CPU without providing any benefit because we donât have access to the outputâ :P And no, donât ask about this weird setup.
Still not quite. âNested spanâ really is a separate type of thing.
The failure of logs is that they are too large to store cheaply or query efficiently.
You can extract metrics from logs (solving the âtoo largeâ issue), but then you can only answer questions your metrics suit. If you have new questions you have to modify your metrics gathering and wait for data to come in, which is not an option during an outage.
Nested spans are (AFAIK) the most space-efficient and query-efficient structure to store the data you would usually put into logs, but they require you to modify your application to provide them (which is a nontrivial cost). In practice, nested spans typically also combine logs from multiple sources (eg load balancer, app server and postgres logs can all go into the same nested span).
The core observations IMO:
Taken together, this gives you term âarbitrarily-wide nested spansâ. Several optimizations are possible from there - eg the start/end times of a child span tend to be very small offsets from the parent event, whereas storing as logs requires the full precision timestamp.
When your data is stored in this sort of nested-span structure, itâs both smaller than the original logs (thanks to removing redundancy in related lines) and more complete (because it can join logs from multiple sources).
You can further shrink your data by sampling each top-level span and storing the sample rate against the ones you keep. For instance, my production setup only stores 0.5% of 200 OK requests that completed in under 400ms. Because the events which are kept store the sample rate, all the derived metrics/graphs can weight that event 200x higher. This means that I have more storage available for detailed info about slow/failed requests.
This is all sensible stuff you could build yourself, but honeycomb already got it right and theyâre dirt cheap (their free plan is ample for my work, which is a pretty damn big site). The annual cost for their paid plan is less than a day of my time costs.
This is par for the course at most places dysfunctional enough to pay for splunk (at a past employer it cost more than a small dev team, and it took me months to get access to it).
Could you give some concrete examples of the difference between Splunk and Honeycomb? Specially, what do you mean exactly by âknown questionsâ vs ânew questionsâ?
When you configure a splunk install, you setup indexing policies which determine which queries can be efficiently answered.
Most of the time, when I wanted to find something out in splunk, I wrote a query and got a result in a second or two. These are âknown questionsâ, and splunk uses its index to come up with answers quickly. For instance, if I wanted to see a breakdown of HTTP status code by request route per hour, our splunk install had zero trouble answering instantly.
If you want to answer a question which isnât handled by those indexing policies (a ânew questionâ), Splunk has to read the log files in full to answer it. This results in splunk taking a minute or two (dependent on your hardware and load) to give you a results page.
For instance, I recently wanted to compare browser share stats between google analytics (blocked by trackers) and honeycomb (records bots as well as real users). So, I did a breakdown by browser + version + IP address, then manually excluded IP addresses with abnormally high request rates and repeated with just browser+version. Took me a minute or so to put the final query together. If Iâd tried that with splunk, Iâd have had a visit from the operations team asking me how long I planned on generating that much load on the splunk server.
Observability - and the service Honecomb provides - is about being able observe or introspect your system, to ask unknown questions without having to recompile and redeploy.
Metrics and logs can be a part of observability, but for example metrics usually only give you a hint at whatâs going on at a given sampled period and are based on known questions. Similarly for logs, you can add as many log lines as you want but they might not help you answer specific questions at runtime - and you have the overhead of having to process and store all these logs. Inevitably people downsample logs and metrics for long-term storage, which means you lose data.
If you have an observable system you can, at runtime, ask questions about what the system is doing at a given point. This is generally in the form of some kind of distributed tracing system. In tracing, while a processing a request, each component of a system (including external dependencies) starts a span connected to a trace. The sum of all spans and the metadata attached to them let you see more detail about whatâs going on, such as durations for a certain type of request. This would be too expensive in a timeseries system.
Going back to the original question, ELK still works fine. For small setups like a home environment itâs relatively straightforward. Structured logs (e.g. JSON formatted) does make ingestion to Elasticsearch easier, but itâs not required (but it is recommended).
I have the nagging feeling youâre disagreeing with me as if I had tried to define observability wrongly, but imho Iâm not even trying to.
If itâs simply about the sentence with mashup, that was simply my impression from the linked article, as I donât know about Honeycomb.
Itâs not clear from this post what your goals are and what your current setup is, because youâre conflating observability and centralized logging, but those are independent topics. Related, yes, but independent.
What is it you wish to observe, and to what ends?
Observability is a somewhat vague term, but I assume that you mean it by the common definition of observing the current (and past) states of a system for the purposes of operating that system safely. That is: observability generally means operational observability and not business observability (or legal/financial/compliance observability). Operational observability generally entails some sort of alerting system. Most people would not put auditing for legal or compliance reasons under the umbrella of observability, even though auditing is literally the process of observing what has happened. Centralized logging is essential for dealing with issues of legal compliance, and itâs fantastic for doing so. If you are trying to solve some sort of legal or compliance problem the rest of my post will not help you. If you are, however, concerned with operating live systems, the rest will apply.
Iâve run ELK in the past and donât bother with it now; I just use dsh to ssh into my nodes and tail their logs directly and pipe them into grep. Using
journalctl -f
is the same thing, sometimes I do that too. I log to file and logrotate my logs and throw them out after a week because I donât have a legal requirement to keep them and Iâm never going to look at them. I donât personally believe in the efficacy of logs for the purposes of operational observability.To clarify where Iâm coming from and my experience with the ELK stack: I spent a few months working on ELK infrastructure full time at Etsy; it was my responsibility to productionize the log shipping from every node in the fleet (Elasticâs âbeatsâ did not exists yet; the predecessor to file beat was lumberjack and it was not production-ready). Not only did I find a large number of problems with how the ELK stack shipped logs, I also found a large number of problems with how Splunks ship logs (do you know what happens when one process continues to write log lines after another process has deleted the file handle?). We were able to scale this stack and when I was there, we were shipping many billions of log lines every day. I would guess that less than one one-hundredth of one percent of all of that data was ever read in some way. My conclusion is that centralized logging is primarily a concept that is pushed by people who are trying to sell things. If what you really want is some sort of graph or chart or alerting system, you probably donât want to get there by parsing logs.
My conclusion after all this time was the centralized logging is ill-suited to operational observability inasmuch as centralized logging is concerned with completeness; the metric of success for centralized logging systems is to not lose log data. But observability in the general ops parlance is not concerned with the completeness of records inasmuch as it is concerned with the timeliness of signals. Alerting systems are latency-bounded systems. Case in point: downsampling is the process of throwing away data to make your pipeline more efficient in terms of something (disk space, network utilization, CPU utilization, or end-to-end latency, take your pick).
Centralized logging systems nearly always involve queuing systems or backpressure; theyâre nearly always designed around the idea that you want to get all of the data eventually, not a current view of your system right now. The time-to-alert for your alerts driven by centralized logging is in every case going to be the end-to-end latency of the entire system. Especially if your alert is defined as âsome system is in some state for some amount of timeâ, such as âfor 3 minutes this endpoint is not respondingâ, which is how a lot of production alerts are driven in order to avoid false positives. When some portion of your centralized logging infrastructure slows down, so does your logging pipeline and so does your alerting system. If your logging system is delayed by five minutes and you want to alert on something being in some state for five minutes youâre looking at ten minutes from the time an event begins to the time an alert is fired. Is that ok?
I strongly prefer, for the purposes of operational observability, to use a time series database and to focus on metrics instead of logs. I use InfluxDB and Grafana for this purpose and my experience has been positive. I install Telegraf on every node and have my local processes talk to telegraf, which performs aggregations at the source and sends them to an influx database, which I view through grafana and drive alerts through pagerduty. This is working well for me.
If we want to be exceptionally pedantic: technically any append-only, ordered stream of events is a log of some form, and technically writing to a time series database is itself a form of logging. I assume what you mean is: âI have textual application logs, I want to know how my system is doing both now and historicallyâ, and my post is written through that lens.
After reading several people talk about their lukewarm to negative experiences with Splunk, I wanted to chime in. A different team manages the Splunk infrastructure at my job, so I canât speak to that beyond this: the Splunk search head and indexer run atop OpenStack on an R630; they are not the only VMs running. For the most part, I just want to speak to my experience making heavy use of it as an analyst.
We have several indexes, the smallest of which has hundreds of thousands of events, and the largest of which has hundreds of millions of events. If memory serves, at least one has over a billion events. Most contain network data, but some have system events or web server logs. Only the most complex queries, on the largest data sets, ever take significant amounts of time. Most queries begin returning results in a few seconds. If administered according to best practices, and given enough resources, Splunk does a phenomenal job of allowing analysts to carve through and draw conclusions from massive data sets.
Based on the experiences some of you described, my guess is that your organizations are not following best practices, are not giving Splunk enough resources, or both. Is Splunk administration expertise hard to come by? Is an R630 really that hard to justify? Iâm curious to hear back.
Last job: Amazonâs Kinesis pipelines and Spark Streaming.
Datadog Log Collection. Itâs painless to setup and can do fancy things such as dumping older logs into S3 for long-term storage (which is a requirement in the industry I work on). Honestly, after using Kibana and Splunk in previous jobs, Splunk might have more powerful features, but itâs stupid expensive and hard to maintain. Datadog Logs hits the right balance of cheap, easy, and usable.
we generate a lot of logs. we were using sumologic but a few events with runaway ingest and subsequent plan extortion led me to investigate options. weâd already moved to prometheus and grafana for monitoring and alerting, so using loki for log aggregation and search seemed feasible. loki is still semi stable but theyâre iterating on it rapidly, so we were able to shut off sumo and transition entirely to loki. for now at least. so far itâs been good.
Iâve been using the logging that we added to our Datadog account. Budget restrictions limited the logs weâre ingesting (service, os, and application logs and CloudWatch). That restriction also lends to having a short retention time. It works fine and itâs nice to have one place where logging, APM, metrics, and events are captured. I just wished we had the budget to log everything and have a slightly longer retention period. But as it mentioned you can re-ingest logs that itâs archived to S3.
Before Datadog we used Sumo logic, which worked fine and the only reason we switched was budget. At smaller places Iâve used both Papertrail and Loggly (before Solarwinds bought either of them). Splunk was the solution I was happiest with, but you need the budget for it.
Iâve come to the opinion that if your only needing basic logging for troubleshooting and the size of your environment is under 100 things a managed service like Sumo Logic, Papertrail, Loggly, or Datadog will work fine. But it all depends on the volume of logs being created, which of them needs indexing, and how long you need to retain them. Anything larger benefits from something like Splunk.
Also consider things you may need to log for security or compliance (CCPA/GDPR) may want to use something else. You might need to consider storing them long term in something like S3 with the Object Lock and lifecycle management setup. That will gives you the ability to reduce the chance the logs are changed and then deleted them after youâre no longer required to keep them.
Making the LogLevel something that can be changed to increase the logging as needed when troubleshooting is another useful approach.
We use https://www.scalyr.com/ (for free since we work for âgood causesâ). It was selected Before My Time, but it just works for what I need.
ELK stack running on AWS with a separate index per service. Kibana has some bad parts but overall itâs pretty useable for me. We have a team that owns it but itâs been running without much intervention for the past ~2 years. At my previous job we used Graylog and found the experience to be awful.
Syslog!
nomad[0] logs and some shell scripts to pull logs from nomad for multiple instances of a job⌠plus grep(well ripgrep, but whatever)
I setup graylog[1] and we used it for a while, and we even still occasionally use it, but itâs generally more hassle than itâs worth. whenever it breaks I donât expect anyone here will care enough to have me fix it.
for metrics, we use prometheus[2] we are happy with it.
0: https://www.nomadproject.io/
1: https://www.graylog.org/
2: https://prometheus.io/
Iâm a big fan if structured json in Splunk. You can configure it to generate indexes for every key and substructure in the json which makes some insanely fast queries. The querying language has a learning curve but I think if youâre familiar with data pipelines or shell pipelines, itâs pretty approachable. Thereâs definitely some weird bugs with it and the visualization side is a bit limited. I often have some simple python scripts that query Splunk for data and then render some more advanced visualization or do some tricky math on things.
I think the dashboard is a bit limited in that itâs hard to have complex data pipelines like multiple queries that combine, or a single shared query that gets visualized or refined in multiple ways. Itâs not impossible but itâs not easy.
The API is very friendly so itâs easy to emit your logs to Splunk if you want to do it yourself or your language or platform doesnât natively support it. Itâs also really easy to consume data or just export the raw rows.
In the end, I think itâs my best experience with logging and queries Iâve found so far.
Iâve used Loggly for the last 5 years or so and find it much better to use than self hosted Kibana.
Mozillaâs mozdef is a pretty good starting point. Itâs an elk stack on a stick plus some log inspection logic.
https://github.com/mozilla/MozDef
Has anybody experience using riemann (http://riemann.io/) in production?
I use nanomsg. All of the languages I use support nanomsg; Go, C, Python etc. I have a simple server written in C that listens on a nanomsg socket, and logs everything it receives to a file. It even works for cgi scripts. (Which typically canât do centralized logging.)
I once tried to get systemdâs remote logging set up, but I couldnât manage it. If someone has gotten it working, please let me know.
At $job, we use papertrail. I find it works pretty well, and am fairly happy with it.
We (@trivago) have serveral ELK clusters (3 small ones in each GCP regions and 1 âbigâ cluster in one of our on-premise datacenters). We ingest between 5TB-10TB of logs daily with different retention policies. We use Kafka as a central buffer for all logging pipelines and most of our logs are protobuf encoded.
We have an internal cloud (self hosted) that runs Nomad (from hashicorp) and in those nodes we use filebeat (also from elastic) to ship the logs of the jobs to our ELK cluster. We recently wrote an autodiscover module that enriches the log events with information from the Nomad job/group/allocation. A nice bonus is that now we can use autodiscover hints (https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover-hints.html) to configure how the logs of a particular job should be parsed.