Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

cijothomas · 2023-08-17T00:48:43Z

Noticed while stress testing Logs, but should be applicable to Tracing as well, as they follow the same pattern, where Logger (Tracer), holds a Weak ref to LoggerProvider (TracerProvider), and in hoth path (logger.EmitLog, span.Record), the weak reference to provider is upgraded to Arc, to obtain things like processor list, resource etc. from the provider. This weak -> Arc upgrade seems to be the bottleneck.

Here's how I tested:

Ran the stress test for logs, in my box, and it shows ~7M/sec throughput.
Modified the NoOpLogProcessor in the above test to return false, which should make the throughput skyrocket as we don't have to do anything like creating LogRecord or invoke processors etc. This did increase the throughput to ~35M/sec, but I was expecting 200+M/sec, as I get similar throughout when using tracing+no-op-tracing-subscriber, without any OpenTelemetry component.
After some manual explorations, found that the upgrading of weak to Arc in the hot path of checking if event is enabled is the bottleneck.
I refactored the code so that Logger is holding an Arc instead of Weak (draft code ), so there is no longer the need for upgrading to Arc in hotpath. With this change, the throughput shot to ~250M/sec (in line with what was seen with using tracing+no-op subscriber alone).

The weak ref to provider was introduced for Tracing (and then followed for Logging), in this PR. Using Arc does mean that the Provider won't be dropped and shutdown won't be signaled until the last Logger is dropped, but that seems okay to me. In the case of Logging, Logger's are held only by the appenders (tokio-tracing subscribers etc), which, when dropped, will drop their Logger as well, allowing provider to be dropped....
I do not know if this has any further implications, especially for Tracer/Span case.

The (perf) issue is applicable to Span as well as this upgrade occurs (I think) twice - when span begins and when span ends...

Opening this issue to get feedback on this from experts and to check if this draft code is a reasonable direction to further explore.

The text was updated successfully, but these errors were encountered:

cijothomas · 2023-08-17T00:50:39Z

@frigus02 Tagging you as you mentioned something along this line in the original, so want to see if they are still applicable..

TommyCpp · 2023-08-17T06:04:07Z

I think for Tracer it's a little bit easier to leak it into some blocking tasks and blocks the TracerProvider to drop if we switch to Arc

djc · 2023-08-17T09:08:27Z

@shaun-cox this seems like it might be of interest to you.

cijothomas · 2023-08-17T15:59:45Z

@shaun-cox this seems like it might be of interest to you.

Yes!! I did consult (offline) @shaun-cox and @lalitb a lot to narrow down the bottleneck to this point and also for the potential fix!

cijothomas · 2023-08-17T16:02:14Z

I think for Tracer it's a little bit easier to leak it into some blocking tasks and blocks the TracerProvider to drop if we switch to Arc

What is the implication of that? i.e What if TracerProvider is not dropped? Is the impact limited to losing those spans/metrics still in-memory? Or something more severe like apps never exiting etc.?

Also, is that something which users can resolve by ensuring they call shutdown themselves at the app-end?
Similar to how CPP : https://github.com/open-telemetry/opentelemetry-cpp/blob/main/examples/logs_simple/main.cc#L77-L78
and other languages are doing.

lalitb · 2023-08-17T18:04:03Z

Also, is that something which users can resolve by ensuring they call shutdown themselves at the app-end?

In current design, shutdown_tracer_provider method removes the reference of the currently active TracerProvider from the global singleton, which in-turn causes the destructor/drop to be called for this removed TracerProvider instance, which further shutdowns all the processors. With the change to Arc, we can add a new TracerProvider::Shutdown() method, and explicitly invoke it from within shutdown_tracer_provider . This new method will invoke shutdown on all processors, and so ensure that all the existing events/spans are flushed. And no new spans can be exported with processors in shutdown state.

I think for Tracer it's a little bit easier to leak it into some blocking tasks and blocks the TracerProvider to drop if we switch to Arc

In otel-cpp, TracerProvider maintains the list of all the created Tracers, so also controls it's lifetime. But I think this is pure design choice. In Rust, we can let TracerProvider instance remain active in shutdown state.

bIgBV · 2023-12-12T21:07:16Z

I can take a look at this.

bIgBV · 2023-12-13T05:14:39Z

After talking to @cijothomas and doing some of my own digging, it looks like LoggerProvider was modeled off of the TracerProvider.

#229 was the PR which introduced using Weak references from the Tracer to the TracerProvider so that the object could be dropped when the pipeline was shutting down. This pattern was then brought into the `LoggerProvider.

I don't see why we cannot use the Arc directly in both cases as we cans still use the existing shutdown mechanism without paying the cost of the upgrade call. I can update the branch with @cijothomas 's draft code and have a PR open.

cijothomas · 2023-12-14T01:36:13Z

Thanks! Could you open a PR with the proposed changes for logs first, and then we can extend it to Traces. The gains would be more relevant for Spans as this bottleneck is faced twice - start, end of span.

hdost · 2023-12-26T09:36:02Z

This weak -> Arc upgrade seems to be the bottleneck.

I am not sure we should be upgrading the Weak to an Arc in the hot path, but instead we might want to look at making the reference Weak in the hot path also.

The interface to an item shouldn't take an inner value since it's considered inner, this also allows for further optimizations in the future as it hides the complexity from the user. Relates open-telemetry#1209

The interface to an item shouldn't take an inner value since it's considered inner, this also allows for further optimizations in the future as it hides the complexity from the user. Rational: This removes exposing the inner which doesn't need to be provided outside of the class. The advantage of this approach is that it's a cleaner implementation. This also removes a weak reference upgrade from the hotpath since we need to have a strong reference in order to access the information. Relates open-telemetry#1209

cijothomas · 2024-05-22T05:53:47Z

Closing this issue for Logs. Will open a separate one for doing similar change for Traces.

TommyCpp added A-trace Area: issues related to tracing priority:p2 Medium priority issues and bugs. A-log Area: Issues related to logs triage:accepted Has been triaged and accepted. labels Aug 17, 2023

cijothomas mentioned this issue Oct 11, 2023

SpanAttributes modified to use Vec instead of OrderMap/EvictedHashMap #1293

Merged

cijothomas mentioned this issue Oct 24, 2023

Closing and exporting all spans on panic (when panic=abort)? #542

Open

cijothomas assigned bIgBV Dec 13, 2023

bIgBV mentioned this issue Dec 18, 2023

Use Arc to store reference to LoggerProvider from Logger #1446

Closed

4 tasks

hdost mentioned this issue Dec 26, 2023

Change to owned LoggerProvider #1455

Merged

4 tasks

cijothomas closed this as completed May 22, 2024

cijothomas mentioned this issue May 22, 2024

Tracer does weak->Arc upgrade twice in hot path leading to bottleneck #1801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

cijothomas commented Aug 17, 2023

cijothomas commented Aug 17, 2023

TommyCpp commented Aug 17, 2023

djc commented Aug 17, 2023

cijothomas commented Aug 17, 2023

cijothomas commented Aug 17, 2023

lalitb commented Aug 17, 2023 •

edited

Loading

bIgBV commented Dec 12, 2023

bIgBV commented Dec 13, 2023

cijothomas commented Dec 14, 2023

hdost commented Dec 26, 2023 •

edited

Loading

cijothomas commented May 22, 2024

Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

Comments

cijothomas commented Aug 17, 2023

cijothomas commented Aug 17, 2023

TommyCpp commented Aug 17, 2023

djc commented Aug 17, 2023

cijothomas commented Aug 17, 2023

cijothomas commented Aug 17, 2023

lalitb commented Aug 17, 2023 • edited Loading

bIgBV commented Dec 12, 2023

bIgBV commented Dec 13, 2023

cijothomas commented Dec 14, 2023

hdost commented Dec 26, 2023 • edited Loading

cijothomas commented May 22, 2024

lalitb commented Aug 17, 2023 •

edited

Loading

hdost commented Dec 26, 2023 •

edited

Loading