Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low/Minimal Allocation version of tracing_tracy #42

Closed
james7132 opened this issue Jun 11, 2022 · 9 comments
Closed

Low/Minimal Allocation version of tracing_tracy #42

james7132 opened this issue Jun 11, 2022 · 9 comments

Comments

@james7132
Copy link

We've been extensively using tracing_tracy to profile and optimize bevy. However, one potential pain point with our usage of the crate is the overhead incurred when entering a span. This has been reported in bevyengine/bevy#4892, where I'm seeing ~10us of overhead per span. Using these in tight loops artificially inflates the size of the surrounding span and is leading to inaccurate measurements.

image

@davidbarsky suggested that it might be a good idea to write a tracing subscriber instead of a Layer using sharded_slab and caching the allocation keys in the span instead of using extensions.

@nagisa
Copy link
Owner

nagisa commented Jun 11, 2022

So, while tracing_subscriber::Registry probably isn't entirely blame-less, I would be reluctant to attribute all of this overhead to it alone. I believe that majority of the overhead actually comes from our use of the Tracy client and the limitations imposed by it. In particular, it provides to different kinds of APIs to emit a span:

  1. ___tracy_emit_zone_begin with ___tracy_source_location_data, which requires that strings are 0-terminated and that the strings here have a 'static lifetime (more accurately – until the profiler application reads them out, which can happen late in execution). While tracing's Metadata is 'static right now, the data therein is not 0-terminated and so we're in a tricky spot where we'd need to allocate to 0-terminate the strings and leak that data into the heap so that it becomes 'static anyway. This is not particularly onerous and is something I ended up doing in gstreamer. I think it could be interesting to replicate it here as well, but it would only give an incremental improvement, and only help where the same span is emitted multiple times. The first span with a specific callsite would still be slow.
  2. ___tracy_alloc_srcloc and ___tracy_emit_zone_begin_alloc is what we're using right now. These, as the name implies, allocate but if you look closely at the first function you will also notice that it will allow specifying sizes of the strings passed in, thus allowing us to directly pass in non-0-terminated &strs to Tracy.
    • Note these srclocs must be allocated for each span as tracy frees them internally after the zone/span is delivered to the profiler. So srclocs are less-so cached and more-so kept around to construct a stack of live zones.

I had in the past experimented with an idea of making ___tracy_emit_zone_begin (the non-allocating version) not require 0-termination, but I remember the change being super involved and it isn't clear if Bartosz would be interested in accepting such a change upstream. Not to mention that there are also some discussions to relax the 'static lifetime on the tracing side

@nagisa
Copy link
Owner

nagisa commented Jun 11, 2022

One thing to note: by default tracing_tracy layer collects callstacks (backtraces). Upon benchmarking a little, it turns out to take a significant chunk of the time here, so you may want to disable that with with_callstack(0).

span/callstack          time:   [29.862 us 31.755 us 34.008 us]
span/no_callstack       time:   [3.0617 us 3.3857 us 3.9540 us]
event/callstack         time:   [27.972 us 30.645 us 33.725 us]
event/no_callstack      time:   [796.49 ns 893.48 ns 1.0469 us]

I wonder if we should flip the default here. Noticing that callstacks are desired, and then finding documentation on how to enable it seems like it would make it much easier to document this performance tradeoff, than realizing that you need to disable callstacks if you want more performant tracing.

@nagisa
Copy link
Owner

nagisa commented Jun 11, 2022

And then benchmarking the tracy_client APIs directly:

span_alloc_callstack/0  time:   [270.43 ns 279.41 ns 288.75 ns]
span_alloc_callstack/100 time:   [21.930 us 23.806 us 25.818 us]
span_callstack/0        time:   [193.58 ns 199.51 ns 206.07 ns]
span_callstack/100      time:   [22.169 us 24.199 us 26.536 us]

So it does indeed seem like majority of the overhead is in the implementation of the TracyLayer after all.

@nagisa
Copy link
Owner

nagisa commented Jun 11, 2022

Messages

Here's a profile for messages with callstacks disabled. One thing that figures somewhat prominently (~17% of the runtime) in the profile is this String::with_capacity allocation:

fn on_event(&self, event: &Event, _: Context<'_, S>) {
let mut visitor = TracyEventFieldVisitor {
dest: String::with_capacity(64),
first: true,
frame_mark: false,
};
event.record(&mut visitor);

This allocation is used to construct and format the message. In theory for simple events that only ever specify a message field, this could avoid allocation entirely, but the design of tracing::Visit and valuable::Visit prevent this from being implemented today. Hopefully this can be resolved for tracing 0.2 though – here's an issue to that effect I filed against valuable: tokio-rs/valuable#97

There's also a bunch of time (~32%) being spent just formatting the message and field. Here we must pick between whether we want more information in the events being sent out or whether we'd rather make them go faster. This seems like it could be configurable in the subscriber.

Finally the remaining chunk is 40% in tracy itself which we can do nothing about. Summa summarum tracing-tracy is about 2 times slower than sending messages directly.

Spans

Here's a profile for spans with callstacks and fibers disabled. Overall ___tracy functions appear to take ~12% of the time, with the rest being overhead from tracing and tracing-tracy. So indeed a fair amount of overhead in tracing and tracing-tracy as a confirmation to the results seen in benchmarks above. 30% of the time in particular is spent on formatting the span names (this could also be made optionally cheaper in ways similar to events, when possible.) And the rest seems to be overhead in in tracing_subscriber.


I went ahead and submitted #43 and #44 which should improve the situation here substantially, but I won't have any resources in the foreseeable future to make any significant changes to the structure or the architecture here.

@SUPERCILEX
Copy link
Contributor

SUPERCILEX commented Dec 27, 2023

tokio-rs/tracing#428 will soon be the last remaining item here.

@SUPERCILEX
Copy link
Contributor

Tracing side of things will be fixed by tokio-rs/tracing#2840

@SUPERCILEX
Copy link
Contributor

So I think this is pretty much fixed on the tracy side once #87 goes through.

@nagisa
Copy link
Owner

nagisa commented Jan 6, 2024

Thanks for the effort! This work has been released to crates.io

@nagisa nagisa closed this as completed Jan 6, 2024
@SUPERCILEX
Copy link
Contributor

🤩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants