This is a really helpful write-up on function call tracing, and the small runtime library is excellent for learning.
I have a couple of questions:
I’m curious about how trace_data::pos is handled when the buffer becomes full. I see it’s initialized in allocate and incremented during function entry instrumentation, but I didn’t spot where it’s reset after a wraparound.
What’s your take on the call_site parameter of __cyg_profile_func_enter? It seems wasteful in terms of code size and overhead.
GCC’s x86 port introduced -minstrument-return=call in 2018. I am curious what projects have used the option before funtrace…
Also, thanks for explaining LLVM XRay. I learned about it around the start of my US career and found it very interesting as my day job did not brought me a lot of fun.
I even contributed a bit to XRay and reviewed some features.
The throw issue (and similar functions like fork/exit) reminds me of related code coverage challenges. I contributed to Clang’s -finstrument-functions and this might be a nice improvement opportunity.
the pos pointer is reset by ANDing with wraparound_mask. it’s a bit weird to AND pointers like that; it’s allocated with posix_memalign to allow it.
__cyg handlers are quite wasteful, i guess to let you neatly construct a call graph in portable code; for funtrace, rdtsc dominates the overhead so -pg’s more efficient hooks don’t help much, it’s the fact that they are inserted post inlining that matters.
One nice thing about -finstrument-functions on gcc is the exit hook gets called upon stack unwinding - would be good if clang did this, too
the pos pointer is reset by ANDing with wraparound_mask. it’s a bit weird to AND pointers like that; it’s allocated with posix_memalign to allow it.
Thx. I missed the (&=) use. When the buffer is full, there is no auto flushing?
__cyg handlers are quite wasteful, i guess to let you neatly construct a call graph in portable code; for funtrace, rdtsc dominates the overhead so -pg’s more efficient hooks don’t help much, it’s the fact that they are inserted post inlining that matters.
Thanks for confirming my intuition.
While __cyg offer a mechanism for constructing dynamic call graphs, I haven’t come across any use cases where this feature is actually utilized…
I agree that the implemented log modes are too heavy.
I believe that only one team at Google used LLVM XRay (the authors of the whitepaper) and they eventually removed their dependency on it a few years ago.
Rewriting a significant portion of the open-source implementation seems reasonable, assuming there’s community interest in doing so.
The register saving code enables arbitrary C++ function calls. As you’ve noticed, this introduces significant overhead.
For x86-64 and aarch64, we could probably require preserve_all or preserve_most calling convention so that the register saving code can be removed. However, this requires log function simplification first as we do not want a preserve_most callback to call other functions. Any function customization could be an enemy here.
no auto flushing; you either read the data using the snapshotting api, or from a core dump, getting the latest events from the buffers. i feel that it makes sense to settle on a maximum buffer size and not bother with longer snapshots; in the post you can see a strategy with taking snapshots with very little I/O and for that you need to settle on a maximum snapshot size, arbitrary size means constant I/O which i think is too high overhead and then you need to look at all that data, also not fun… if the program knows when to take snapshots it can limit their size enough to not need to worry about any of this
what do people use instead of XRay? something hw assisted like magic-trace, or a software based tool, or say kutrace and/or ftrace but no userspace tracing?
Leaf 15h only exists since Intel Skylake, so anything older won’t have it. Likewise, every AMD chip won’t have this. AMD is particularly annoying: there is a way to get the TSC frequency, but it can only be obtained from an MSR which requires kernel collaboration.
absent leaf 15h or a dmesg printout telling the TSC frequency, funtrace falls back on sleeping for 100 ms and checking how many TSC cycles elapsed to estimate the frequency - not very accurate, but it will “work.”
generally the wrong frequency scales the view but it doesn’t change the proportions of time ranges. LLVM XRay incorrectly scales the timestamps by a factor of around 2 in my tests and IMO is still usable in this sense; funtrace should do much better than that most of the time.
One key point I got out of this is that sampling profilers are really bad at catching intermittent latency spikes (or worst case latency, the author’s words).
Another argument for the idea that different performance use cases need different kinds of profiler measurement and different kinds of visualizations.
since this is lobste.rs, if anyone wants to port the tracing profiler to Rust, I’ll be glad to work with you on this! shouldn’t be hard - it’s 1200 LOC of C++ - but you need to be better at Rust than me to do it, quite possibly, since it’s not “normal” kind of code
Very cool stuff IMHO! One key takeaway for me is that I’d want my code to play nice alongside the tracer.
I’ve tried to get something like that going on ARM microcontroller CPUs. Most of them have an embedded trace unit called ETM which can pump out low-level trace info, and I think one can also instrument that in the application code. However, I unfortunately never managed to get it running properly on the platform I tried it, maybe because at some point time ran out. Not sure if that’s actually similar to the Intel tracing solution.
funtrace, the tracer from the post, is designed to be easy to port to an embedded platform (at work I deal with fairly large, but embedded systems) and it traces straight to memory. it is way higher overhead than hw tracing but actually x86 rdtsc instruction is very slow and memory is very fast, and if your embedded CPU has a fast cycle counter (they often do) and slow memory, you can condition the writing of trace events on a call taking more than X cycles, cutting overhead a lot with no manual filtering. (the downside is that you need to maintain a stack, which i will elide here for brevity, if you can call it that)
Having such a tracer on an embedded platform certainly would already have helped me a couple of times, that’s for sure. Although, on our platform, we don’t have a lot of memory available - I’m currently working on an STM32 with no more than 1MB of RAM, most of which we need for different functionalities so there’s not much room for trace data.
That’s where the Embedded Trace Macrocell (ETM) would’ve come in handy: it can stream trace data to the parallel trace port from which an external debugger or something like orbtrace can collect it and process it further.
Now that I think about it, it might even be possible to directly write to the parallel trace buffer, leveraging the software tracer while still not using too much RAM and streaming the data off-device for further analysis. Huh, that’d be cool.
I haven’t played a ton with it yet but Zephyr has some functionality around this. You can stream CTF trace data out via Segger RTT or with regular ITM. I haven’t looked too much into using ETM with it. The documentation mostly talks about consuming it with Percepio but I had no problem opening the saved trace data in Trace Compass (free, Eclipse-based, I don’t love it but it works) or in Python for scripted analysis.
The address-to-symbol (and source line) lookup seems to be Linux-specific, i.e. relies on parsing ELF files. Any love for us Mach-O types?
That said, I’ve always been pretty happy with Apple’s sampling profiler, and the Instruments app now displays flame charts, which I’m finding super useful.
The whole thing is x86/Linux-specific, like it says on the tin. But since it’s very small - less than 3K LOC of compile time, runtime and offline code - it should be easy to port, if you wish to do it, which I guess you don’t since you are happy with a sampling profiler, inapplicable as it might be when it comes to worst case latency.
This is a really helpful write-up on function call tracing, and the small runtime library is excellent for learning.
I have a couple of questions:
trace_data::posis handled when the buffer becomes full. I see it’s initialized inallocateand incremented during function entry instrumentation, but I didn’t spot where it’s reset after a wraparound.call_siteparameter of__cyg_profile_func_enter? It seems wasteful in terms of code size and overhead.Also, thanks for explaining LLVM XRay. I learned about it around the start of my US career and found it very interesting as my day job did not brought me a lot of fun. I even contributed a bit to XRay and reviewed some features. The throw issue (and similar functions like fork/exit) reminds me of related code coverage challenges. I contributed to Clang’s -finstrument-functions and this might be a nice improvement opportunity.
the pos pointer is reset by ANDing with wraparound_mask. it’s a bit weird to AND pointers like that; it’s allocated with posix_memalign to allow it.
__cyg handlers are quite wasteful, i guess to let you neatly construct a call graph in portable code; for funtrace, rdtsc dominates the overhead so -pg’s more efficient hooks don’t help much, it’s the fact that they are inserted post inlining that matters.
One nice thing about -finstrument-functions on gcc is the exit hook gets called upon stack unwinding - would be good if clang did this, too
i look a bit into XRay’s code here - your thoughts on its overhead might be interesting, i feel it isn’t optimized very aggressively: https://x.com/YossiKreinin/status/1888900457046917300?t=MBUZDRhOrQo2vJBH8Ul_0w&s=19
Thx. I missed the (&=) use. When the buffer is full, there is no auto flushing?
Thanks for confirming my intuition. While __cyg offer a mechanism for constructing dynamic call graphs, I haven’t come across any use cases where this feature is actually utilized…
I agree that the implemented log modes are too heavy. I believe that only one team at Google used LLVM XRay (the authors of the whitepaper) and they eventually removed their dependency on it a few years ago. Rewriting a significant portion of the open-source implementation seems reasonable, assuming there’s community interest in doing so.
The register saving code enables arbitrary C++ function calls. As you’ve noticed, this introduces significant overhead. For x86-64 and aarch64, we could probably require preserve_all or preserve_most calling convention so that the register saving code can be removed. However, this requires log function simplification first as we do not want a preserve_most callback to call other functions. Any function customization could be an enemy here.
no auto flushing; you either read the data using the snapshotting api, or from a core dump, getting the latest events from the buffers. i feel that it makes sense to settle on a maximum buffer size and not bother with longer snapshots; in the post you can see a strategy with taking snapshots with very little I/O and for that you need to settle on a maximum snapshot size, arbitrary size means constant I/O which i think is too high overhead and then you need to look at all that data, also not fun… if the program knows when to take snapshots it can limit their size enough to not need to worry about any of this
what do people use instead of XRay? something hw assisted like magic-trace, or a software based tool, or say kutrace and/or ftrace but no userspace tracing?
https://github.com/yosefk/funtrace/blob/e655c1782a1db58276a7bab48bce59bea61f6c39/funtrace.cpp#L400
Leaf 15h only exists since Intel Skylake, so anything older won’t have it. Likewise, every AMD chip won’t have this. AMD is particularly annoying: there is a way to get the TSC frequency, but it can only be obtained from an MSR which requires kernel collaboration.
absent leaf 15h or a dmesg printout telling the TSC frequency, funtrace falls back on sleeping for 100 ms and checking how many TSC cycles elapsed to estimate the frequency - not very accurate, but it will “work.”
generally the wrong frequency scales the view but it doesn’t change the proportions of time ranges. LLVM XRay incorrectly scales the timestamps by a factor of around 2 in my tests and IMO is still usable in this sense; funtrace should do much better than that most of the time.
One key point I got out of this is that sampling profilers are really bad at catching intermittent latency spikes (or worst case latency, the author’s words).
Another argument for the idea that different performance use cases need different kinds of profiler measurement and different kinds of visualizations.
since this is lobste.rs, if anyone wants to port the tracing profiler to Rust, I’ll be glad to work with you on this! shouldn’t be hard - it’s 1200 LOC of C++ - but you need to be better at Rust than me to do it, quite possibly, since it’s not “normal” kind of code
Very cool stuff IMHO! One key takeaway for me is that I’d want my code to play nice alongside the tracer.
I’ve tried to get something like that going on ARM microcontroller CPUs. Most of them have an embedded trace unit called ETM which can pump out low-level trace info, and I think one can also instrument that in the application code. However, I unfortunately never managed to get it running properly on the platform I tried it, maybe because at some point time ran out. Not sure if that’s actually similar to the Intel tracing solution.
funtrace, the tracer from the post, is designed to be easy to port to an embedded platform (at work I deal with fairly large, but embedded systems) and it traces straight to memory. it is way higher overhead than hw tracing but actually x86 rdtsc instruction is very slow and memory is very fast, and if your embedded CPU has a fast cycle counter (they often do) and slow memory, you can condition the writing of trace events on a call taking more than X cycles, cutting overhead a lot with no manual filtering. (the downside is that you need to maintain a stack, which i will elide here for brevity, if you can call it that)
Having such a tracer on an embedded platform certainly would already have helped me a couple of times, that’s for sure. Although, on our platform, we don’t have a lot of memory available - I’m currently working on an STM32 with no more than 1MB of RAM, most of which we need for different functionalities so there’s not much room for trace data.
That’s where the Embedded Trace Macrocell (ETM) would’ve come in handy: it can stream trace data to the parallel trace port from which an external debugger or something like orbtrace can collect it and process it further.
Now that I think about it, it might even be possible to directly write to the parallel trace buffer, leveraging the software tracer while still not using too much RAM and streaming the data off-device for further analysis. Huh, that’d be cool.
I haven’t played a ton with it yet but Zephyr has some functionality around this. You can stream CTF trace data out via Segger RTT or with regular ITM. I haven’t looked too much into using ETM with it. The documentation mostly talks about consuming it with Percepio but I had no problem opening the saved trace data in Trace Compass (free, Eclipse-based, I don’t love it but it works) or in Python for scripted analysis.
https://docs.zephyrproject.org/latest/services/tracing/index.html
The address-to-symbol (and source line) lookup seems to be Linux-specific, i.e. relies on parsing ELF files. Any love for us Mach-O types?
That said, I’ve always been pretty happy with Apple’s sampling profiler, and the Instruments app now displays flame charts, which I’m finding super useful.
The whole thing is x86/Linux-specific, like it says on the tin. But since it’s very small - less than 3K LOC of compile time, runtime and offline code - it should be easy to port, if you wish to do it, which I guess you don’t since you are happy with a sampling profiler, inapplicable as it might be when it comes to worst case latency.