Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove is_shutdown flag from processors. And fix logger::emit() to check for the flag before emit. #2462

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Dec 20, 2024

Changes

To discuss the change suggested be @cijothomas here - #2381 (comment)

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@@ -268,6 +268,14 @@ impl opentelemetry::logs::Logger for Logger {

/// Emit a `LogRecord`.
fn emit(&self, mut record: Self::LogRecord) {
if self.provider.inner.is_shutdown.load(Ordering::Relaxed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this is, this can affect throughput due to the contention introduced here. (Logs so far has no contention when using etw/user_events)....
Can you check stress test before and after?

I am unsure of a solution. Maybe don't check shutdown anywhere except in stdout like non-prod processors, and rely on mechanisms like export client/ etw etc. returning errors..
what harm can it cause 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, it doesn’t make sense to introduce the contention in the hot path, even if the contention is at atomic level.

Copy link
Contributor

@utpilla utpilla Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are only reading the atomic variable and the variable's value does not change for the most part of application lifetime, it most likely should not have any visible effect on the throughput.

We should be able to confirm that with the stress test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the effect of uncontested atomic would be negligible enough to be noticed over stress test. It is slightly more than normal bool-check, but much less than the perf associated in case of contention. I added the benchmark over logger:emit() in this PR, which shows the latency of 1-2ns:
main branch:

logger_emit             time:   [37.916 ns 37.977 ns 38.077 ns]
                        change: [-0.2733% -0.0648% +0.1328%] (p = 0.58 > 0.05)
                        No change in performance detected.

PR branch:

logger_emit             time:   [38.941 ns 39.027 ns 39.172 ns]
                        change: [+2.6756% +2.9861% +3.3292%] (p = 0.00 < 0.05)
                        Performance has regressed.

Copy link
Member Author

@lalitb lalitb Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure of a solution. Maybe don't check shutdown anywhere except in stdout like non-prod processors, and rely on mechanisms like export client/ etw etc. returning errors..
what harm can it cause

The custom exporter which can be connected to reentrant and simple processor need to handle the shutdown properly in that case. As of now, etw and user_events don't do anything, so they will continue to emit even after shutdown invoked by user.

Copy link
Contributor

@utpilla utpilla Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this "if check" in Logger::emit() is not only a more efficient short-circuiting operation but it also simplifies things. It lets us offer a more robust guarantee:
"SDK would not do any wasteful computation once its shutdown (regardless of what processors/exporters are used)"

Copy link
Member

@cijothomas cijothomas Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it simplifies things. How do we determine the best place to do the is_shutdown check? Logger::emit() is one place.. how about the processor? how about the exporter? or even at the appender itself?

"SDK would not do any wasteful computation once its shutdown (regardless of what processors/exporters are used)"

I don't know if we need to make such a statement? If user want to achieve that, they can drop the providers. (even then, we need to redesign something in Metrics to make it truly no-ops(#2442)

Copy link
Contributor

@utpilla utpilla Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it simplifies things.

The simplification I'm referring to is the effort required to achieving the expected shutdown behavior. Why call emit() on each processor, if the provider is shutdown. With having the "if check" in Loggger::emit() we effectively make it a no-op SDK, when the provider is shutdown.

How do we determine the best place to do the is_shutdown check? Logger::emit() is one place.. how about the processor? how about the exporter? or even at the appender itself?

Since the shutdown is an SDK concept, it would only make sense to have the check in an SDK component. Appenders do not depend on SDK, so they are out of question. Maybe we can update the LoggerProvider trait to add another method is_shutdown which the appenders can use to avoid creating log records altogether. That's a different discussion though as it involves updating the public trait from API.

With the current state of code, among the SDK components: Logger, Processor, and Exporter, I think the best place to check for shutdown would be Logger as that's the first SDK component that interacts with the appender. By checking it in the Logger, we can avoid making calls to processors altogether.

I don't know if we need to make such a statement?

We should have some stance about shutdown behavior. It could either be

  1. The effectiveness of the shutdown depends on the processors/exporters used
  2. Or that shutdown works regardless, of the configured pipeline.

I think it's nice to have # 2 and having the "if check" is a very easy way to achieve that. I don't see any disadvantage to doing that.

If user want to achieve that, they can drop the providers.

I'm not sure if it's as straightforward. Since the provider is essentially an Arc<LoggerProviderInner>, the inner value would not be dropped until there are no more references to it. The appender layer keeps a clone of the provider which then moved to the tracing_sbuscriber::registry(). So simply dropping the provider might not help.

Copy link
Member Author

@lalitb lalitb Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some stance about shutdown behavior. It could either be
The effectiveness of the shutdown depends on the processors/exporters used
Or that shutdown works regardless, of the configured pipeline.
I think it's nice to have # 2 and having the "if check" is a very easy way to achieve that. I don't see any disadvantage to doing that.

Thanks, these are good discussions.

I am also inclined with #2, as it ensures consistent shutdown behavior regardless of pipeline configuration or internal component details. Relying on assumptions, such as the batch processor closing its channel during shutdown, creates unnecessary dependencies that could break with future changes. An explicit provider/logger level check provides a more reliable solution, more importantly since cost of adding it is small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplification I'm referring to is the effort required to achieving the expected shutdown behavior. Why call emit() on each processor, if the provider is shutdown. With having the "if check" in Loggger::emit() we effectively make it a no-op SDK, when the provider is shutdown.

Without further redesign/refactoring, this won't make it a no-op sdk. There is non-trivial amount of work done even before emit() is invoked, to prepare LogRecord, and that'll continue even after shutdown. Yes, we save some work, but not quite no-op.

We should have some stance about shutdown behavior. It could either be

1 The effectiveness of the shutdown depends on the processors/exporters used
2 Or that shutdown works regardless, of the configured pipeline.

So far, we have been mostly sticking with 1 - we expect components to behave as expected. Example: We expect ObservableCallbacks to return in reasonable time, and if they don't return, Metrics SDK never recovers. Similarly we expect Exporters to return in reasonable time.
Not a strict comparison, but if we want to make a stance that Otel SDK works predictable, irrespective of the components plugged in, then it need a lot of redesign.
If we generally expect other components to behave properly, then why not continue rely on them to deal with shutdown too?

Another interesting thing we can do here (Similar to what Otel .NET does for its console exporter). If we make shutdown a pure exporter problem, we can have Stdout exporter continue to work even after shutdown is invoked, but can do something in the Stdout exporter to let user know that they have incorrect lifecycle management. IMHO, this is better than short-circuiting at up-stream sdk.(Of course internal logs can help, but unless users take care of avoiding self-logging, it can cause stack overflow due to cyclic loop)

I'm not sure if it's as straightforward. Since the provider is essentially an Arc, the inner value would not be dropped until there are no more references to it. The appender layer keeps a clone of the provider which then moved to the tracing_sbuscriber::registry(). So simply dropping the provider might not help.

Excellent point! This is something we need to have some guidance. (Even more challenging is the Metric case referred earlier).
If user decided to setup a tracing::subscriber globally (via init()), then they won't be able to really drop anything, unless tracing::subscriber provides a way. In other words, user is signing up to a global subscriber, so they should not expect to have the ability to drop things. tracing has alternate solutions to set a subscriber for a scope only, and they can drop the scope which in-turn can drop the logger_provider.

(#1660 demonstrates this issue already)

@@ -201,15 +187,6 @@ impl Debug for BatchLogProcessor {

impl LogProcessor for BatchLogProcessor {
fn emit(&self, record: &mut LogRecord, instrumentation: &InstrumentationScope) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an emit is called after shutdown, even if we don't do is_shutdown check, the channel would be already closed, so the error it returned is good enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

  1. Don't check for is_shutdown.
  2. Just export as usual.
  3. Since the channel is closed, it'll error out.
  4. Log that error.

No contention/perf cost for normal path. If logs are still emitted after shutdown, it clearly indicates some issue with user managing the lifetimes.

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.2%. Comparing base (1f35467) to head (f99657b).

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #2462   +/-   ##
=====================================
  Coverage   77.2%   77.2%           
=====================================
  Files        124     124           
  Lines      23009   22976   -33     
=====================================
- Hits       17765   17745   -20     
+ Misses      5244    5231   -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants