Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

Closed
cijothomas opened this issue Aug 17, 2023 · 11 comments
Closed

Perf bottleneck when Loggers upgrade weak ref to LoggerProvider #1209

cijothomas opened this issue Aug 17, 2023 · 11 comments
Assignees
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing priority:p2 Medium priority issues and bugs. triage:accepted Has been triaged and accepted.

Comments

@cijothomas
Copy link
Member

Noticed while stress testing Logs, but should be applicable to Tracing as well, as they follow the same pattern, where Logger (Tracer), holds a Weak ref to LoggerProvider (TracerProvider), and in hoth path (logger.EmitLog, span.Record), the weak reference to provider is upgraded to Arc, to obtain things like processor list, resource etc. from the provider. This weak -> Arc upgrade seems to be the bottleneck.

Here's how I tested:

  1. Ran the stress test for logs, in my box, and it shows ~7M/sec throughput.
  2. Modified the NoOpLogProcessor in the above test to return false, which should make the throughput skyrocket as we don't have to do anything like creating LogRecord or invoke processors etc. This did increase the throughput to ~35M/sec, but I was expecting 200+M/sec, as I get similar throughout when using tracing+no-op-tracing-subscriber, without any OpenTelemetry component.
  3. After some manual explorations, found that the upgrading of weak to Arc in the hot path of checking if event is enabled is the bottleneck.
  4. I refactored the code so that Logger is holding an Arc instead of Weak (draft code ), so there is no longer the need for upgrading to Arc in hotpath. With this change, the throughput shot to ~250M/sec (in line with what was seen with using tracing+no-op subscriber alone).

The weak ref to provider was introduced for Tracing (and then followed for Logging), in this PR. Using Arc does mean that the Provider won't be dropped and shutdown won't be signaled until the last Logger is dropped, but that seems okay to me. In the case of Logging, Logger's are held only by the appenders (tokio-tracing subscribers etc), which, when dropped, will drop their Logger as well, allowing provider to be dropped....
I do not know if this has any further implications, especially for Tracer/Span case.

The (perf) issue is applicable to Span as well as this upgrade occurs (I think) twice - when span begins and when span ends...

Opening this issue to get feedback on this from experts and to check if this draft code is a reasonable direction to further explore.

@cijothomas
Copy link
Member Author

@frigus02 Tagging you as you mentioned something along this line in the original, so want to see if they are still applicable..

@TommyCpp TommyCpp added A-trace Area: issues related to tracing priority:p2 Medium priority issues and bugs. A-log Area: Issues related to logs triage:accepted Has been triaged and accepted. labels Aug 17, 2023
@TommyCpp
Copy link
Contributor

I think for Tracer it's a little bit easier to leak it into some blocking tasks and blocks the TracerProvider to drop if we switch to Arc

@djc
Copy link
Contributor

djc commented Aug 17, 2023

@shaun-cox this seems like it might be of interest to you.

@cijothomas
Copy link
Member Author

@shaun-cox this seems like it might be of interest to you.

Yes!! I did consult (offline) @shaun-cox and @lalitb a lot to narrow down the bottleneck to this point and also for the potential fix!

@cijothomas
Copy link
Member Author

I think for Tracer it's a little bit easier to leak it into some blocking tasks and blocks the TracerProvider to drop if we switch to Arc

What is the implication of that? i.e What if TracerProvider is not dropped? Is the impact limited to losing those spans/metrics still in-memory? Or something more severe like apps never exiting etc.?

Also, is that something which users can resolve by ensuring they call shutdown themselves at the app-end?
Similar to how CPP : https://github.com/open-telemetry/opentelemetry-cpp/blob/main/examples/logs_simple/main.cc#L77-L78
and other languages are doing.

@lalitb
Copy link
Member

lalitb commented Aug 17, 2023

Also, is that something which users can resolve by ensuring they call shutdown themselves at the app-end?

In current design, shutdown_tracer_provider method removes the reference of the currently active TracerProvider from the global singleton, which in-turn causes the destructor/drop to be called for this removed TracerProvider instance, which further shutdowns all the processors. With the change to Arc, we can add a new TracerProvider::Shutdown() method, and explicitly invoke it from within shutdown_tracer_provider . This new method will invoke shutdown on all processors, and so ensure that all the existing events/spans are flushed. And no new spans can be exported with processors in shutdown state.

I think for Tracer it's a little bit easier to leak it into some blocking tasks and blocks the TracerProvider to drop if we switch to Arc

In otel-cpp, TracerProvider maintains the list of all the created Tracers, so also controls it's lifetime. But I think this is pure design choice. In Rust, we can let TracerProvider instance remain active in shutdown state.

@bIgBV
Copy link
Contributor

bIgBV commented Dec 12, 2023

I can take a look at this.

@bIgBV
Copy link
Contributor

bIgBV commented Dec 13, 2023

After talking to @cijothomas and doing some of my own digging, it looks like LoggerProvider was modeled off of the TracerProvider.

#229 was the PR which introduced using Weak references from the Tracer to the TracerProvider so that the object could be dropped when the pipeline was shutting down. This pattern was then brought into the `LoggerProvider.

I don't see why we cannot use the Arc directly in both cases as we cans still use the existing shutdown mechanism without paying the cost of the upgrade call. I can update the branch with @cijothomas 's draft code and have a PR open.

@cijothomas
Copy link
Member Author

Thanks! Could you open a PR with the proposed changes for logs first, and then we can extend it to Traces. The gains would be more relevant for Spans as this bottleneck is faced twice - start, end of span.

@hdost
Copy link
Contributor

hdost commented Dec 26, 2023

This weak -> Arc upgrade seems to be the bottleneck.

I am not sure we should be upgrading the Weak to an Arc in the hot path, but instead we might want to look at making the reference Weak in the hot path also.

hdost added a commit to hdost/opentelemetry-rust that referenced this issue Dec 26, 2023
The interface to an item shouldn't take an inner value since it's
considered inner, this also allows for further optimizations in the
future as it hides the complexity from the user.

Relates open-telemetry#1209
hdost added a commit to hdost/opentelemetry-rust that referenced this issue Dec 27, 2023
The interface to an item shouldn't take an inner value since it's
considered inner, this also allows for further optimizations in the
future as it hides the complexity from the user.

Relates open-telemetry#1209
hdost added a commit to hdost/opentelemetry-rust that referenced this issue Feb 6, 2024
The interface to an item shouldn't take an inner value since it's
considered inner, this also allows for further optimizations in the
future as it hides the complexity from the user.

Rational:

This removes exposing the inner which doesn't need to be provided
outside of the class. The advantage of this approach is that it's a
cleaner implementation. This also removes a weak reference upgrade from
the hotpath since we need to have a strong reference in order to access
the information.

Relates open-telemetry#1209
hdost added a commit to hdost/opentelemetry-rust that referenced this issue Feb 7, 2024
The interface to an item shouldn't take an inner value since it's
considered inner, this also allows for further optimizations in the
future as it hides the complexity from the user.

Rational:

This removes exposing the inner which doesn't need to be provided
outside of the class. The advantage of this approach is that it's a
cleaner implementation. This also removes a weak reference upgrade from
the hotpath since we need to have a strong reference in order to access
the information.

Relates open-telemetry#1209
@cijothomas
Copy link
Member Author

Closing this issue for Logs. Will open a separate one for doing similar change for Traces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing priority:p2 Medium priority issues and bugs. triage:accepted Has been triaged and accepted.
Projects
None yet
Development

No branches or pull requests

6 participants