Remove is_shutdown flag from processors. And fix logger::emit() to check for the flag before emit. #2462

lalitb · 2024-12-20T18:15:00Z

Changes

To discuss the change suggested be @cijothomas here - #2381 (comment)

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

cijothomas · 2024-12-20T18:25:57Z

opentelemetry-sdk/src/logs/log_emitter.rs

@@ -268,6 +268,14 @@ impl opentelemetry::logs::Logger for Logger {

    /// Emit a `LogRecord`.
    fn emit(&self, mut record: Self::LogRecord) {
+        if self.provider.inner.is_shutdown.load(Ordering::Relaxed) {


The issue with this is, this can affect throughput due to the contention introduced here. (Logs so far has no contention when using etw/user_events)....
Can you check stress test before and after?

I am unsure of a solution. Maybe don't check shutdown anywhere except in stdout like non-prod processors, and rely on mechanisms like export client/ etw etc. returning errors..
what harm can it cause 🤔

Oh yes, it doesn’t make sense to introduce the contention in the hot path, even if the contention is at atomic level.

Since we are only reading the atomic variable and the variable's value does not change for the most part of application lifetime, it most likely should not have any visible effect on the throughput.

We should be able to confirm that with the stress test.

Yes, the effect of uncontested atomic would be negligible enough to be noticed over stress test. It is slightly more than normal bool-check, but much less than the perf associated in case of contention. I added the benchmark over logger:emit() in this PR, which shows the latency of 1-2ns:
main branch:

logger_emit time: [37.916 ns 37.977 ns 38.077 ns] change: [-0.2733% -0.0648% +0.1328%] (p = 0.58 > 0.05) No change in performance detected.

PR branch:

logger_emit time: [38.941 ns 39.027 ns 39.172 ns] change: [+2.6756% +2.9861% +3.3292%] (p = 0.00 < 0.05) Performance has regressed.

I am unsure of a solution. Maybe don't check shutdown anywhere except in stdout like non-prod processors, and rely on mechanisms like export client/ etw etc. returning errors..
what harm can it cause

The custom exporter which can be connected to reentrant and simple processor need to handle the shutdown properly in that case. As of now, etw and user_events don't do anything, so they will continue to emit even after shutdown invoked by user.

I think adding this "if check" in Logger::emit() is not only a more efficient short-circuiting operation but it also simplifies things. It lets us offer a more robust guarantee:
"SDK would not do any wasteful computation once its shutdown (regardless of what processors/exporters are used)"

Don't think it simplifies things. How do we determine the best place to do the is_shutdown check? Logger::emit() is one place.. how about the processor? how about the exporter? or even at the appender itself?

"SDK would not do any wasteful computation once its shutdown (regardless of what processors/exporters are used)"

I don't know if we need to make such a statement? If user want to achieve that, they can drop the providers. (even then, we need to redesign something in Metrics to make it truly no-ops(#2442)

Don't think it simplifies things.

The simplification I'm referring to is the effort required to achieving the expected shutdown behavior. Why call emit() on each processor, if the provider is shutdown. With having the "if check" in Loggger::emit() we effectively make it a no-op SDK, when the provider is shutdown.

How do we determine the best place to do the is_shutdown check? Logger::emit() is one place.. how about the processor? how about the exporter? or even at the appender itself?

Since the shutdown is an SDK concept, it would only make sense to have the check in an SDK component. Appenders do not depend on SDK, so they are out of question. Maybe we can update the LoggerProvider trait to add another method is_shutdown which the appenders can use to avoid creating log records altogether. That's a different discussion though as it involves updating the public trait from API.

With the current state of code, among the SDK components: Logger, Processor, and Exporter, I think the best place to check for shutdown would be Logger as that's the first SDK component that interacts with the appender. By checking it in the Logger, we can avoid making calls to processors altogether.

I don't know if we need to make such a statement?

We should have some stance about shutdown behavior. It could either be

The effectiveness of the shutdown depends on the processors/exporters used

Or that shutdown works regardless, of the configured pipeline.

I think it's nice to have # 2 and having the "if check" is a very easy way to achieve that. I don't see any disadvantage to doing that.

If user want to achieve that, they can drop the providers.

I'm not sure if it's as straightforward. Since the provider is essentially an Arc<LoggerProviderInner>, the inner value would not be dropped until there are no more references to it. The appender layer keeps a clone of the provider which then moved to the tracing_sbuscriber::registry(). So simply dropping the provider might not help.

We should have some stance about shutdown behavior. It could either be
The effectiveness of the shutdown depends on the processors/exporters used
Or that shutdown works regardless, of the configured pipeline.
I think it's nice to have # 2 and having the "if check" is a very easy way to achieve that. I don't see any disadvantage to doing that.

Thanks, these are good discussions.

I am also inclined with #2, as it ensures consistent shutdown behavior regardless of pipeline configuration or internal component details. Relying on assumptions, such as the batch processor closing its channel during shutdown, creates unnecessary dependencies that could break with future changes. An explicit provider/logger level check provides a more reliable solution, more importantly since cost of adding it is small.

The simplification I'm referring to is the effort required to achieving the expected shutdown behavior. Why call emit() on each processor, if the provider is shutdown. With having the "if check" in Loggger::emit() we effectively make it a no-op SDK, when the provider is shutdown.

Without further redesign/refactoring, this won't make it a no-op sdk. There is non-trivial amount of work done even before emit() is invoked, to prepare LogRecord, and that'll continue even after shutdown. Yes, we save some work, but not quite no-op.

We should have some stance about shutdown behavior. It could either be

1 The effectiveness of the shutdown depends on the processors/exporters used
2 Or that shutdown works regardless, of the configured pipeline.

So far, we have been mostly sticking with 1 - we expect components to behave as expected. Example: We expect ObservableCallbacks to return in reasonable time, and if they don't return, Metrics SDK never recovers. Similarly we expect Exporters to return in reasonable time.
Not a strict comparison, but if we want to make a stance that Otel SDK works predictable, irrespective of the components plugged in, then it need a lot of redesign.
If we generally expect other components to behave properly, then why not continue rely on them to deal with shutdown too?

Another interesting thing we can do here (Similar to what Otel .NET does for its console exporter). If we make shutdown a pure exporter problem, we can have Stdout exporter continue to work even after shutdown is invoked, but can do something in the Stdout exporter to let user know that they have incorrect lifecycle management. IMHO, this is better than short-circuiting at up-stream sdk.(Of course internal logs can help, but unless users take care of avoiding self-logging, it can cause stack overflow due to cyclic loop)

I'm not sure if it's as straightforward. Since the provider is essentially an Arc, the inner value would not be dropped until there are no more references to it. The appender layer keeps a clone of the provider which then moved to the tracing_sbuscriber::registry(). So simply dropping the provider might not help.

Excellent point! This is something we need to have some guidance. (Even more challenging is the Metric case referred earlier).
If user decided to setup a tracing::subscriber globally (via init()), then they won't be able to really drop anything, unless tracing::subscriber provides a way. In other words, user is signing up to a global subscriber, so they should not expect to have the ability to drop things. tracing has alternate solutions to set a subscriber for a scope only, and they can drop the scope which in-turn can drop the logger_provider.

(#1660 demonstrates this issue already)

cijothomas · 2024-12-20T18:29:38Z

opentelemetry-sdk/src/logs/log_processor.rs

@@ -201,15 +187,6 @@ impl Debug for BatchLogProcessor {

 impl LogProcessor for BatchLogProcessor {
    fn emit(&self, record: &mut LogRecord, instrumentation: &InstrumentationScope) {


if an emit is called after shutdown, even if we don't do is_shutdown check, the channel would be already closed, so the error it returned is good enough?

something like

Don't check for is_shutdown.

Just export as usual.

Since the channel is closed, it'll error out.

Log that error.

No contention/perf cost for normal path. If logs are still emitted after shutdown, it clearly indicates some issue with user managing the lifetimes.

codecov · 2024-12-21T08:12:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.2%. Comparing base (1f35467) to head (f99657b).

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #2462   +/-   ##
=====================================
  Coverage   77.2%   77.2%           
=====================================
  Files        124     124           
  Lines      23009   22976   -33     
=====================================
- Hits       17765   17745   -20     
+ Misses      5244    5231   -13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lalitb added 2 commits December 20, 2024 10:10

initial commit

da42d53

fix error name

cabb4da

lalitb requested a review from a team as a code owner December 20, 2024 18:15

lalitb mentioned this pull request Dec 20, 2024

chore: modify LogExporter and TraceExporter interfaces to support returning failure #2381

Open

4 tasks

cijothomas reviewed Dec 20, 2024

View reviewed changes

add benchmark

54543bc

lalitb added 2 commits December 21, 2024 15:05

Merge branch 'main' into remove_is_shutdown_flag_processor

5028be3

Merge branch 'main' into remove_is_shutdown_flag_processor

f99657b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove is_shutdown flag from processors. And fix logger::emit() to check for the flag before emit. #2462

Remove is_shutdown flag from processors. And fix logger::emit() to check for the flag before emit. #2462

lalitb commented Dec 20, 2024

cijothomas Dec 20, 2024

lalitb Dec 21, 2024

utpilla Dec 21, 2024 •

edited

Loading

lalitb Dec 21, 2024

lalitb Dec 21, 2024 •

edited

Loading

utpilla Dec 23, 2024 •

edited

Loading

cijothomas Dec 24, 2024 •

edited

Loading

utpilla Dec 24, 2024 •

edited

Loading

lalitb Dec 24, 2024 •

edited

Loading

cijothomas Dec 24, 2024

cijothomas Dec 20, 2024

cijothomas Dec 20, 2024

codecov bot commented Dec 21, 2024 •

edited

Loading

		@@ -201,15 +187,6 @@ impl Debug for BatchLogProcessor {

		impl LogProcessor for BatchLogProcessor {
		fn emit(&self, record: &mut LogRecord, instrumentation: &InstrumentationScope) {

Remove is_shutdown flag from processors. And fix logger::emit() to check for the flag before emit. #2462

Are you sure you want to change the base?

Remove is_shutdown flag from processors. And fix logger::emit() to check for the flag before emit. #2462

Conversation

lalitb commented Dec 20, 2024

Changes

Merge requirement checklist

cijothomas Dec 20, 2024

Choose a reason for hiding this comment

lalitb Dec 21, 2024

Choose a reason for hiding this comment

utpilla Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

lalitb Dec 21, 2024

Choose a reason for hiding this comment

lalitb Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

utpilla Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

cijothomas Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

utpilla Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

lalitb Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

cijothomas Dec 24, 2024

Choose a reason for hiding this comment

cijothomas Dec 20, 2024

Choose a reason for hiding this comment

cijothomas Dec 20, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 21, 2024 • edited Loading

Codecov Report

utpilla Dec 21, 2024 •

edited

Loading

lalitb Dec 21, 2024 •

edited

Loading

utpilla Dec 23, 2024 •

edited

Loading

cijothomas Dec 24, 2024 •

edited

Loading

utpilla Dec 24, 2024 •

edited

Loading

lalitb Dec 24, 2024 •

edited

Loading

codecov bot commented Dec 21, 2024 •

edited

Loading