1. 38
  1.  

    1. 18

      It’s nice to see an article this topic that sees traces as a better version of logs and/or structured logs, instead of something you do alongside logging (i.e. the “three pillars of observability” model).

      1. 11

        I always say that tracing is structured logging with a duration and a parent link. It always confuses me when people treat it like something completely different.

        The parent link does require some extra coordination (especially if you want to have it work cross-process) but it is really just one extra but if metadata.

        1. 4

          To quote the article:

          However, this has issues; a timeout once is “for later” or just info, but many timeouts might be “wake up”, how do you encode that logic somewhere?

          You encode that logic in aggregate log analysis. There are many bog standard options here. Part of the reason to use both logging and tracing is that tracing is still in a phase of development where tool vendors are acting as gatekeepers. While you can attach ad-hoc attributes to traces the analysis tools are often lacking. Most vendors are trying to upsell on anomaly detection and other similar features where they sell you their method of log analysis rather than enabling you to precisely define the analyses you’d like to be woken up for.

        2. 2

          even if you’re lazy and don’t set up the parent, it’s still slightly better than a log. you only have to look in one place (traces), instead of two (traces and logs)

          1. 1

            This is not to be unde estimated! Having to debug through multiple tools, or even the same tools with isolated UI for traces and logs is so painful

      2. 4

        i hadn’t really heard of observability until i started working on a monitoring agent and a collector a few years ago

        i asked coworkers, but was never able to get a satisfying answer for “when would you ever use a log, instead of a trace?”

        my best guess is that they’re still included for compatibility with older logging libraries, or when you’re transitioning legacy software, but you still want your logs to show up in the same backend as the traces and metrics. but it could be that peoples whole mindset around observability is path-dependent, and they learned logs first

        1. 3

          IMHO, logs should be for things that aren’t time bound (that don’t have a beginning or an end). For example, if you need to express that the app has started binding on a specific port.

          As soon as you are within a request, a message processing, or anything which has a beginning and an end, they should be traces. And if something doesn’t have a beginning and an end but fits within one, OpenTelemetry has the notion of events.

          1. 1

            but why?

            you could replace any log/event with a root-span trace, and it’d carry the same information. so, it seems redundant to have both

            edit: another way of phrasing it, is that a log looks a lot like a trace that has zero elapsed time, and no parent. so i’m not sure why people don’t use those traces instead

            1. 2

              You absolutely can. It’s just weird to have spans that are started and ended right away. Those non time-bound events are also often interesting to have in stdout, but not so much in a tracing platform.

              1. 1

                That’s an interesting idea, but I’m struggling to think of an example where I have zero duration on an event.

                Even handling and ignoring an os signal has some duration, even if it’s tiny, and you’d still be able to attach data about why the app isn’t handling the event.

      3. 3

        This is a good point. Where I have seen a difference between traces and logs is that traces get sampled down heavily, while logs are fully retained. That isn’t an inherent property, obviously: we could also configure full retention for traces. To reduce data volume, though, what would be interesting is if a trace could have part of its data sampled off, and part retained for longer. This is good food for thought.

        1. 3

          one thing that’s interesting about traces, is that you sample at the root of the trace, and then trace/don’t for the entire tree. with logging, you get basically useless nonsense if you try to sample, because you’re typically relying on timestamps for correlation with logs

          you can also do it at the end of the trace, based on what’s in the trace. that doesn’t help with performance, but it does help with data volume. you can be clever and try to decide if anything “interesting” happened in the span, and (probabilistically) throw it out if it looks boring

        2. 2

          One of the interesting ideas I’ve come across in this area is if you theoretically had unlimited data retention, you could purely store traces and calculate metrics from the data embedded in them. Logs, traces, and metrics would all be different displays of the same structured data.

          if a trace could have part of its data sampled off, and part retained for longer.

          There’s definitely something in that - I feel like there’s got to be some more interesting approaches than just picking 1 in N requests to trace.

          1. 5

            Logs, traces, and metrics would all be different displays of the same structured data.

            Yep:

            Metrics, tracing, and logging are actually emergent patterns of consumption of [the same] observability data

          2. 2

            Metrics are something I deliberately left out of this post as it was long enough already, and I have very little positive to say about them (useful for CPU/mem/resource usage over time is about all I use them for).

            However on CI Tracing, I’ve used the otel collector to also send the traces to a small service which builds very specific metrics from the traces. Its storage cost is low as it doesn’t retain the traces after processing, and with the through that CI has, I basically don’t have to worry about running out of space to store pretty granular metrics.

            1.  

              I have very little positive to say about them (useful for CPU/mem/resource usage over time is about all I use them for).

              No way! 😇 Metrics deliver the most value per dollar of all possible telemetry categories. You always start with metrics, and add other stuff as the need arises.

    2. 4

      This can be life changing when it comes to debugging/optimising complex systems.

    3. 4

      I’m relatively new to the world of tracing, so maybe someone can guide me here: it looks as if Honeycomb and Lightstep (and Sentry?) are the go-to solutions for consuming traces, but are there any decent free solutions that would require self-hosting? I’m looking for a way to test the viability of introducing traces across multiple services and languages.

      1. 2

        This 100x this, it seems like everybody who says smart things about telemetry also is pointing me at using 3 or more different SaaS solutions and have zero consideration for self hosted on-pref solutions.

      2. 2

        Honeycomb at least has a free tier. Locally I’ve used jaeger and zipkin.

      3. 1

        You can self host both Jaeger and Grafana Tempo

    4. 3

      I’ve had my own confusion on this topic… I’m going to write this out, so I can be wrong on the Internet, and someone can correct me:

      I think “logging” means “capturing data to a log file” (where a log file is characterized by being written append-only – I get the feeling that log-analysis tooling breaks this? but any other time I think I’ve seen something characterized as “log structured”, it’s generally meant “append-only, non-branching”), while “tracing” means “recording a semantic flow through an application”. That is, “logging” reflects how data is captured, and “tracing” reflects what data is captured. One can capture traces using logging APIs, one can emit logs using tracing APIs, and one can log data that doesn’t include tracing data.

      The APIs and tooling that surround logs and traces tend to work very differently: My favorite log-analysis tool is awk (or Python’s json module for structured logging), which you can’t use against Jaeger. But that’s not an essential difference. I’ve gotten a lot of value from capturing trace data in log files.