Sure, logs/traces are better for debugging, but they are slower to write, slower to query and more expensive to store. Your alerts should usually be based on metrics because repeatedly querying your logs system will get very expensive in a hurry and will take much longer from the moment they are emitted until they are queryable.
But yes it’s interesting that your logs give you what you need. Metrics are there so you can quickly check if you need to dig into your logs.
That and alerting. The role of metrics isn’t to tell you how your app is doing in a fine-grained manner. More to answer “Is it up?” “Am I about to have a bad page?”
Modern observability providers can compute SLIs from your events when they are received, effectively turning those traces into metrics within the provider.
Doing it this way means you have everything based off of the same data (just as fast as metrics, since the entire query doesn’t need to be rebuilt every time). And when you need to switch from the alerting graph to the actual problematic requests, there’s no context switching.
The kind of coarse-grained application-reported metrics I was using to drive alerts a decade ago are a very different beast from the kind of metrics you can derive from the newer generation of span/trace systems.
I’m not interested in going back to “most pages are for a non-event”
What I’d personally find interesting to read about is applying such concepts to scenarios that are not “we run large web services on some cloud platform”, which seems to be the more or less implicit assumption most of the time.
Something like logging commands / requests to a (shared-memory?) ring buffer, and dumping those to disk on a crash?
Systems like that definitely exist. I like - partly for other reasons - apenwarr’s writeup; the part relevant to this discussion starts at “Userspace and kernel messages, in a single stream”. (Note that that particular implementation is for embedded Linux; you may need to adapt it for non-embedded *nix or non-Linux embedded systems.)
People are usually not a fan of software collecting information about itself and sending it off to some centralized collection system (aka telemetry). Distributed metrics are a complex problem, but mostly in social sense, rather than technical one.
There’s so many assumptions presented like they’re fundamental things. For example you don’t have to aggregate your metrics. Influxdb will let you save all the samples. You don’t need to have a split between metrics and logs. Your metrics can be a projection (optimised, materialised if needed) over logs, and there are many tools that will do that for you. Sometimes you do want pre-aggregated metrics pages, because the performance of pushing everything as separate events would overwhelm your monitoring solution.
There’s so much context needed before you can decide about logs, metrics, pre-aggregated batches. And before you can decide if you actually care, or if you just want multiple ways of querying the same thing.
My experience suggests that the best observability “pillar” in terms of ROI is contextual.
If your production data volume isn’t so large that you have to design around it —anything less than 10k RPS, certainly — then you don’t really feel the pain of logging, or appreciate the benefits of metrics. So ElasticSearch or whatever can make sense.
But telemetry about a system is by its nature almost always far more data than the data of the system itself. (Given some event E, there is essentially no limit to the possible metadata about E that can be observed, derived, or inferred.) So if you’re anywhere near resource limits for your prod traffic, full-fidelity observability data, like logs, is usually gonna cost you at least as much in terms of OpEx as the prod system it’s observing. The ROI quickly goes negative.
That’s why metrics are great: unlike logs and traces, they scale more or less invariant to load. A counter is a single integer in memory, no matter how frequently you increment it. Do they solve every problem? No. But they solve an enormous class of problems very efficiently.
Sure, logs/traces are better for debugging, but they are slower to write, slower to query and more expensive to store. Your alerts should usually be based on metrics because repeatedly querying your logs system will get very expensive in a hurry and will take much longer from the moment they are emitted until they are queryable.
“You most likely need more than metrics”
Logs are vastly slower to write/query & more expensive to store.
Depending on your sampling rate, traces have whatever amortized performance impact / storage cost you allocate to them.
They don’t have to be expensive, and I’ve had far more actionable insights from sampled traces than from p95 metrics.
Terrible title.
But yes it’s interesting that your logs give you what you need. Metrics are there so you can quickly check if you need to dig into your logs.
That and alerting. The role of metrics isn’t to tell you how your app is doing in a fine-grained manner. More to answer “Is it up?” “Am I about to have a bad page?”
Modern observability providers can compute SLIs from your events when they are received, effectively turning those traces into metrics within the provider.
Doing it this way means you have everything based off of the same data (just as fast as metrics, since the entire query doesn’t need to be rebuilt every time). And when you need to switch from the alerting graph to the actual problematic requests, there’s no context switching.
You need metrics. If you are using the right providers you can conceivably replace them with powerful log features.
The kind of coarse-grained application-reported metrics I was using to drive alerts a decade ago are a very different beast from the kind of metrics you can derive from the newer generation of span/trace systems.
I’m not interested in going back to “most pages are for a non-event”
Ok (if often-repeated) point, bad title.
What I’d personally find interesting to read about is applying such concepts to scenarios that are not “we run large web services on some cloud platform”, which seems to be the more or less implicit assumption most of the time.
Something like logging commands / requests to a (shared-memory?) ring buffer, and dumping those to disk on a crash?
Systems like that definitely exist. I like - partly for other reasons - apenwarr’s writeup; the part relevant to this discussion starts at “Userspace and kernel messages, in a single stream”. (Note that that particular implementation is for embedded Linux; you may need to adapt it for non-embedded *nix or non-Linux embedded systems.)
People are usually not a fan of software collecting information about itself and sending it off to some centralized collection system (aka telemetry). Distributed metrics are a complex problem, but mostly in social sense, rather than technical one.
There’s so many assumptions presented like they’re fundamental things. For example you don’t have to aggregate your metrics. Influxdb will let you save all the samples. You don’t need to have a split between metrics and logs. Your metrics can be a projection (optimised, materialised if needed) over logs, and there are many tools that will do that for you. Sometimes you do want pre-aggregated metrics pages, because the performance of pushing everything as separate events would overwhelm your monitoring solution.
There’s so much context needed before you can decide about logs, metrics, pre-aggregated batches. And before you can decide if you actually care, or if you just want multiple ways of querying the same thing.
My experience suggests that the best observability “pillar” in terms of ROI is contextual.
If your production data volume isn’t so large that you have to design around it —anything less than 10k RPS, certainly — then you don’t really feel the pain of logging, or appreciate the benefits of metrics. So ElasticSearch or whatever can make sense.
But telemetry about a system is by its nature almost always far more data than the data of the system itself. (Given some event E, there is essentially no limit to the possible metadata about E that can be observed, derived, or inferred.) So if you’re anywhere near resource limits for your prod traffic, full-fidelity observability data, like logs, is usually gonna cost you at least as much in terms of OpEx as the prod system it’s observing. The ROI quickly goes negative.
That’s why metrics are great: unlike logs and traces, they scale more or less invariant to load. A counter is a single integer in memory, no matter how frequently you increment it. Do they solve every problem? No. But they solve an enormous class of problems very efficiently.