Low/Minimal Allocation version of tracing_tracy #42

james7132 · 2022-06-11T06:55:14Z

We've been extensively using tracing_tracy to profile and optimize bevy. However, one potential pain point with our usage of the crate is the overhead incurred when entering a span. This has been reported in bevyengine/bevy#4892, where I'm seeing ~10us of overhead per span. Using these in tight loops artificially inflates the size of the surrounding span and is leading to inaccurate measurements.

@davidbarsky suggested that it might be a good idea to write a tracing subscriber instead of a Layer using sharded_slab and caching the allocation keys in the span instead of using extensions.

The text was updated successfully, but these errors were encountered:

nagisa · 2022-06-11T10:38:18Z

So, while tracing_subscriber::Registry probably isn't entirely blame-less, I would be reluctant to attribute all of this overhead to it alone. I believe that majority of the overhead actually comes from our use of the Tracy client and the limitations imposed by it. In particular, it provides to different kinds of APIs to emit a span:

___tracy_emit_zone_begin with ___tracy_source_location_data, which requires that strings are 0-terminated and that the strings here have a 'static lifetime (more accurately – until the profiler application reads them out, which can happen late in execution). While tracing's Metadata is 'static right now, the data therein is not 0-terminated and so we're in a tricky spot where we'd need to allocate to 0-terminate the strings and leak that data into the heap so that it becomes 'static anyway. This is not particularly onerous and is something I ended up doing in gstreamer. I think it could be interesting to replicate it here as well, but it would only give an incremental improvement, and only help where the same span is emitted multiple times. The first span with a specific callsite would still be slow.
___tracy_alloc_srcloc and ___tracy_emit_zone_begin_alloc is what we're using right now. These, as the name implies, allocate but if you look closely at the first function you will also notice that it will allow specifying sizes of the strings passed in, thus allowing us to directly pass in non-0-terminated &strs to Tracy.
- Note these srclocs must be allocated for each span as tracy frees them internally after the zone/span is delivered to the profiler. So srclocs are less-so cached and more-so kept around to construct a stack of live zones.

I had in the past experimented with an idea of making ___tracy_emit_zone_begin (the non-allocating version) not require 0-termination, but I remember the change being super involved and it isn't clear if Bartosz would be interested in accepting such a change upstream. Not to mention that there are also some discussions to relax the 'static lifetime on the tracing side…

nagisa · 2022-06-11T14:13:13Z

One thing to note: by default tracing_tracy layer collects callstacks (backtraces). Upon benchmarking a little, it turns out to take a significant chunk of the time here, so you may want to disable that with with_callstack(0).

span/callstack          time:   [29.862 us 31.755 us 34.008 us]
span/no_callstack       time:   [3.0617 us 3.3857 us 3.9540 us]
event/callstack         time:   [27.972 us 30.645 us 33.725 us]
event/no_callstack      time:   [796.49 ns 893.48 ns 1.0469 us]

I wonder if we should flip the default here. Noticing that callstacks are desired, and then finding documentation on how to enable it seems like it would make it much easier to document this performance tradeoff, than realizing that you need to disable callstacks if you want more performant tracing.

nagisa · 2022-06-11T14:21:48Z

And then benchmarking the tracy_client APIs directly:

span_alloc_callstack/0  time:   [270.43 ns 279.41 ns 288.75 ns]
span_alloc_callstack/100 time:   [21.930 us 23.806 us 25.818 us]
span_callstack/0        time:   [193.58 ns 199.51 ns 206.07 ns]
span_callstack/100      time:   [22.169 us 24.199 us 26.536 us]

So it does indeed seem like majority of the overhead is in the implementation of the TracyLayer after all.

nagisa · 2022-06-11T18:31:34Z

Messages

Here's a profile for messages with callstacks disabled. One thing that figures somewhat prominently (~17% of the runtime) in the profile is this String::with_capacity allocation:

rust_tracy_client/tracing-tracy/src/lib.rs

Lines 228 to 234 in a4cb0d1

    
           fn on_event(&self, event: &Event, _: Context<'_, S>) { 
        
               let mut visitor = TracyEventFieldVisitor { 
        
                   dest: String::with_capacity(64), 
        
                   first: true, 
        
                   frame_mark: false, 
        
               }; 
        
               event.record(&mut visitor);

This allocation is used to construct and format the message. In theory for simple events that only ever specify a message field, this could avoid allocation entirely, but the design of tracing::Visit and valuable::Visit prevent this from being implemented today. Hopefully this can be resolved for tracing 0.2 though – here's an issue to that effect I filed against valuable: tokio-rs/valuable#97

There's also a bunch of time (~32%) being spent just formatting the message and field. Here we must pick between whether we want more information in the events being sent out or whether we'd rather make them go faster. This seems like it could be configurable in the subscriber.

Finally the remaining chunk is 40% in tracy itself which we can do nothing about. Summa summarum tracing-tracy is about 2 times slower than sending messages directly.

Spans

Here's a profile for spans with callstacks and fibers disabled. Overall ___tracy functions appear to take ~12% of the time, with the rest being overhead from tracing and tracing-tracy. So indeed a fair amount of overhead in tracing and tracing-tracy as a confirmation to the results seen in benchmarks above. 30% of the time in particular is spent on formatting the span names (this could also be made optionally cheaper in ways similar to events, when possible.) And the rest seems to be overhead in in tracing_subscriber.

I went ahead and submitted #43 and #44 which should improve the situation here substantially, but I won't have any resources in the foreseeable future to make any significant changes to the structure or the architecture here.

SUPERCILEX · 2023-12-27T22:21:00Z

tokio-rs/tracing#428 will soon be the last remaining item here.

SUPERCILEX · 2024-01-01T02:46:42Z

Tracing side of things will be fixed by tokio-rs/tracing#2840

SUPERCILEX · 2024-01-01T02:49:31Z

So I think this is pretty much fixed on the tracy side once #87 goes through.

nagisa · 2024-01-06T01:43:29Z

Thanks for the effort! This work has been released to crates.io

SUPERCILEX · 2024-01-06T02:19:14Z

🤩

nagisa closed this as completed Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low/Minimal Allocation version of tracing_tracy #42

Low/Minimal Allocation version of tracing_tracy #42

james7132 commented Jun 11, 2022

nagisa commented Jun 11, 2022

nagisa commented Jun 11, 2022 •

edited

Loading

nagisa commented Jun 11, 2022

nagisa commented Jun 11, 2022 •

edited

Loading

SUPERCILEX commented Dec 27, 2023 •

edited

Loading

SUPERCILEX commented Jan 1, 2024

SUPERCILEX commented Jan 1, 2024

nagisa commented Jan 6, 2024

SUPERCILEX commented Jan 6, 2024

Low/Minimal Allocation version of tracing_tracy #42

Low/Minimal Allocation version of tracing_tracy #42

Comments

james7132 commented Jun 11, 2022

nagisa commented Jun 11, 2022

nagisa commented Jun 11, 2022 • edited Loading

nagisa commented Jun 11, 2022

nagisa commented Jun 11, 2022 • edited Loading

Messages

Spans

SUPERCILEX commented Dec 27, 2023 • edited Loading

SUPERCILEX commented Jan 1, 2024

SUPERCILEX commented Jan 1, 2024

nagisa commented Jan 6, 2024

SUPERCILEX commented Jan 6, 2024

nagisa commented Jun 11, 2022 •

edited

Loading

nagisa commented Jun 11, 2022 •

edited

Loading

SUPERCILEX commented Dec 27, 2023 •

edited

Loading