1. 7
  1. 3

    A bit high level for my tastes….

    …but almost stumbles on to a nifty trick read about somewhen and adapted for my purposes.

    A correlation ID.

    When an event comes into your system… give it a unique ID.

    Either a global counter or a per source counter or’d with a source ID.

    And then for every cascading event….

    ie. If this event triggers another event… copy the parent event’s correlation ID to the child.

    This enables you to ask and answer such questions as “Which source of events is creating the most load on my system?”

    1. 3

      That’s how honeycomb do it - every span is a tuple of (id, optional parent span id, start time, duration, arbitrary key/value data).

      Because you can query on the keys and values, this is enough to let you slice in all sorts of interesting ways (eg: for database calls, I record the parameterized query and all the parameters - I can then find the unique SQL which has the highest sum(runtime), and then break that down according to a parameter value )