Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: modify LogExporter and TraceExporter interfaces to support returning failure #2381

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

scottgerring
Copy link
Contributor

Fixes #2262

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Copy link

codecov bot commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 59.25926% with 33 lines in your changes missing coverage. Please review.

Project coverage is 77.1%. Comparing base (8d5f222) to head (de49068).

Files with missing lines Patch % Lines
...lemetry-sdk/src/testing/logs/in_memory_exporter.rs 73.0% 7 Missing ⚠️
opentelemetry-otlp/src/exporter/http/logs.rs 0.0% 3 Missing ⚠️
opentelemetry-otlp/src/exporter/http/trace.rs 0.0% 3 Missing ⚠️
opentelemetry-sdk/src/error.rs 0.0% 3 Missing ⚠️
opentelemetry-sdk/src/trace/span_processor.rs 57.1% 3 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/logs.rs 0.0% 2 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/trace.rs 0.0% 2 Missing ⚠️
...y-sdk/src/logs/log_processor_with_async_runtime.rs 71.4% 2 Missing ⚠️
...sdk/src/trace/span_processor_with_async_runtime.rs 50.0% 2 Missing ⚠️
opentelemetry-stdout/src/logs/exporter.rs 0.0% 2 Missing ⚠️
... and 3 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2381     +/-   ##
=======================================
- Coverage   77.1%   77.1%   -0.1%     
=======================================
  Files        124     124             
  Lines      23021   23059     +38     
=======================================
+ Hits       17771   17794     +23     
- Misses      5250    5265     +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scottgerring
Copy link
Contributor Author

@lalitb 👋

Here's a variant where i've simply propgated the errors through, like MetricsExporter. As I mentioned on the issue, I think triggering a lint in users projects when they upgrade ("do something with this unused return value") is likely preferable to multiplying out the interfaces to provide a shutdown and shutdown_clean variant. What do you think?

@lalitb
Copy link
Member

lalitb commented Dec 4, 2024

As I mentioned on the issue, I think triggering a lint in users projects when they upgrade ("do something with this unused return value") is likely preferable to multiplying out the interfaces to provide a shutdown and shutdown_clean variant. What do you think?

I believe it's acceptable to introduce this as a BREAKING change at this stage, especially since we anticipate the SDK to undergo more breaking changes, at least until the next release. The CHANGELOG can include migration steps to guide users on handling the return status. Initially, I considered deprecating the existing shutdown() method and introducing a new shutdown_new() method with a plan to eventually rename it. However, I don't think that approach would be straightforward or practical.

@scottgerring scottgerring force-pushed the chore/cleanup-shutdown branch from 33cb755 to 2df2353 Compare December 5, 2024 12:10
@scottgerring scottgerring marked this pull request as ready for review December 5, 2024 12:10
@scottgerring scottgerring requested a review from a team as a code owner December 5, 2024 12:10
@scottgerring
Copy link
Contributor Author

@cijothomas & @lalitb I think this should be good to go

@@ -10,6 +10,9 @@ use std::time::SystemTime;
/// Describes the result of an export.
pub type ExportResult = Result<(), TraceError>;

/// Describes the results of other operations on the trace API.
pub type TraceResult<T> = Result<T, TraceError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to ensure consistency here. LogExporter uses same LogResult for export and shutdown both, but TraceExporter uses different Results.

Is this something we can fix too?

Copy link
Contributor Author

@scottgerring scottgerring Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a type alias so that export is marked as returning TraceResult<()>, but the existing callers (all 38 of them in our codebase) don't have to change.

I've pushed a change that works this way. I can also just pull of the band-aid and fix all our existing internal usages to use TraceResult directly.

As we are looking at consistency here, another thing that jumps out is that we're using TraceError (which i've introduced TraceResult to match) , despite the trait itself being SpanExporter. Better would be SpanError and SpanResult. To give an indication of scale, we have 72 refs in this codebase to TraceError alone (plus external usages of course).

My suggestion is:

  • Leave TraceError alone
  • Change our internal usages of ExportResult to TraceResult<()> (not yet done, but easy enough)
  • Leave the ExportResult alias for any external usages but mark it as deprecated

Thoughts @cijothomas ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need some time to think through this. I am unsure if the current state is what we finally want. I am thinking that we should have very specific Enums for each Result returned to user.
If shutdown() is called on an already shutdown provider, then the Result should an Error variant, but a specific one like Error::ShutdownAlreadyInvoked, so users know exactly what failed and why instead of relying on "Other("something went wrong")"

This will require more refactoring, but something we should be going before Stable.
The "Other" should be truly for "others" - i.e something went wrong that is not accounted for, and should be rarely used, if not possible to avoid entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another quick thought - validate if it is considered breaking to expand the enum with new variants. Do we need to provide Into/From for std Error and that is sufficient?

Copy link
Contributor Author

@scottgerring scottgerring Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - so we're happy to take some breaking changes now and make things clean!

I think you'd typically see error-type-per-function, with explicitly mapped errors where it makes sense and thiserror mapped ones where the caller should handle the passed-through error (e.g., network issues).

I think this would be pretty idiomatic (note - errors and logic for illustration purposes only):

use thiserror::Error;

#[derive(Error, Debug)]
pub enum ExportError {
     #[error("Socket IO Error")]
    SocketError(#[from] io::Error),

    #[error("Exporter already shut down")]
    ExporterShutdown,
}

pub trait SpanExporter {
    async fn export(&mut self, batch: Vec<SpanData>) -> Result<(), ExportError> { 
          // explicit error
          if already_shutdown { 
              Err(ExportError::ExporterShutdown)
          }
          else { 
            // by way of example, imagine this may return an IO error
            // thiserror has setup the error mapping for us, so we can just use ? 
            let ack = my_socket.send()?;
         
            Ok(())
        }
     }
    async fn shutdown(&mut self) -> Result<(), ShutdownError>; 
}

I don't think the extra typedefs add much value.

Which is to say, I think we are on the right path? As @lalitb points out, we're already using thiserror; we'd just want to make sure we generally have error-per-func, rather than error-per-module. This is important for the caller, as they will be able to write reasonable matches over the top of the result to handle errors, and not be forced to handle errors not present in the particular function.

If we add another variant, exhaustive matches will break, so I think it'd be a breaking change.

What's harder to think about for me is - what about custom implementations of these interfaces? The error types are to a large extent a function of the concrete implementation .... maybe this is something that pushes more in the direction of higher level errors only (AlreadyShutdown) rather than mapped-through errors (io::ReadFailed or whatever) ?

That way we roll things up into a "logical language of failure" and each impl has to map back their arbitrary error domain to these.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood this fully, but the code already uses thiserror nicely to get the error status from exporters, with below skeleton. Are you proposing to change this in better way?

#[derive(thiserror::Error, Debug)]
#[non_exhaustive]
pub enum TraceError {
    #[error("Exporter {0} encountered the following error(s): {name}", name = .0.exporter_name())]
    ExportFailed(Box<dyn ExportError>),
    // Rest of variants
}
/// Trait for errors returned by exporters
pub trait ExportError: std::error::Error + Send + Sync + 'static {
    /// The name of exporter that returned this error
    fn exporter_name(&self) -> &'static str;
}
// and then in OTLP  crate
/// Wrap type for errors from this crate.
#[derive(thiserror::Error, Debug)]
pub enum Error {
    // OTLP specific error variants
}
impl opentelemetry::trace::ExportError for Error {
    fn exporter_name(&self) -> &'static str {
        "otlp"
    }
}
``

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, so it does. i'll fix my suggestion so it doesn't include adding thiserror as its already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// explicit error if already_shutdown { Err(ExportError::ExporterShutdown) }

^ I actually meant to suggest we use explicit ones like this, instead of "Other("Shutdown. already invoked"), consistently throughout.

Copy link
Contributor Author

@scottgerring scottgerring Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cijothomas @lalitb I think this makes sense.

Are we happy for me to go through and apply this? It will touch a fair bit so I want to make sure before doing it! Would look like:

shutdown(...) -> Result<(), ShutdownError>
export(...) -> Result<(), ExportError>

and an extension of the error types following the explicit-named-errors-style established already in LogError, e.g.:

pub enum LogError {
    /// Export failed with the error returned by the exporter.
    #[error("Exporter {0} encountered the following errors: {name}", name = .0.exporter_name())]
    ExportFailed(Box<dyn ExportError>),

    /// Export failed to finish after certain period and processor stopped the export.
    #[error("Exporter timed out after {} seconds", .0.as_secs())]
    ExportTimedOut(Duration),

    /// Processor is already shutdown
    #[error("{0} already shutdown")]
    AlreadyShutdown(String),

    /// Mutex lock poisoning
    #[error("mutex lock poisioning for {0}")]
    MutexPoisoned(String),

@@ -10,6 +10,8 @@
[#2338](https://github.com/open-telemetry/opentelemetry-rust/pull/2338)
- `ResourceDetector.detect()` no longer supports timeout option.
- `opentelemetry::global::shutdown_tracer_provider()` Removed from the API, should now use `tracer_provider.shutdown()` see [#2369](https://github.com/open-telemetry/opentelemetry-rust/pull/2369) for a migration example. "Tracer provider" is cheaply cloneable, so users are encouraged to set a clone of it as the global (ex: `global::set_tracer_provider(provider.clone()))`, so that instrumentations and other components can obtain tracers from `global::tracer()`. The tracer_provider must be kept around to call shutdown on it at the end of application (ex: `tracer_provider.shutdown()`)
- The trait functions `LogExporter.shutdown` and `TraceExporter.shutdown` now explicitly return a result. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we add something for the custom authors as well?
If you are an exporter author, shutdown now requires returning a result. Also, implementing shutdown is optional, as trait already provides a default implementation that returns Ok().

^ something like this...

Copy link
Contributor Author

@scottgerring scottgerring Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your comment further down ("The exporter() shutdown() is only called from the SDK, so end user wouldn't have to deal with the return type from it") I think we only need to address the exporter authors, right? Have changed comment accordingly.

@cijothomas
Copy link
Member

As I mentioned on the issue, I think triggering a lint in users projects when they upgrade ("do something with this unused return value") is likely preferable to multiplying out the interfaces to provide a shutdown and shutdown_clean variant. What do you think?

I believe it's acceptable to introduce this as a BREAKING change at this stage, especially since we anticipate the SDK to undergo more breaking changes, at least until the next release. The CHANGELOG can include migration steps to guide users on handling the return status. Initially, I considered deprecating the existing shutdown() method and introducing a new shutdown_new() method with a plan to eventually rename it. However, I don't think that approach would be straightforward or practical.

Is this going to affect end users or only affecting the SDK itself (and custom exporter authors)? The exporter() shutdown() is only called from the SDK, so end user wouldn't have to deal with the return type from it, right?

@@ -2,6 +2,9 @@

## vNext

- `OtlpHttpClient.shutdown` `TonicLogsClient.shutdown`, and `TonicTracesClient.shutdown` now explicitly return a result. The
semantics of the method have not changed, but you will have a new lint encouraging you to consume these results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2nd sentence maybe omitted as these are anyway expected to be called from upstream sdk.

@scottgerring scottgerring force-pushed the chore/cleanup-shutdown branch from f1be822 to 5b243a6 Compare December 19, 2024 14:02
@scottgerring
Copy link
Contributor Author

@cijothomas i've separated out ShutdownResult for both trace and log as discussed. This was more fiddly than I expected!
imho It'd be good to get this in soon as it touches the interfaces (and I will be off for a few weeks 🎄 )

pub enum ShutdownError {
/// Mutex lock poisoning
#[error("mutex lock poisioning for {0}")]
MutexPoisoned(String),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure if we should make these public.. There are strictly internal details, and a mutex poison is not user actionable at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were public before in the old type; but I agree with you!
I can hide it behind the GenericError or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I am missing some discussion - why do we need to have new error enum for shutdown. Can't we reuse the LogError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep one uber-error and the caller wants to handle it, they are forced to handle variants that have nothing to do with the operation they are calling

#[error("Shutdown timed out after {0:?}")]
Timeout(Duration),

/// The export client failed while holding the client lock. It is not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cijothomas I reckon this is a reasonable "API user summary" of what's happened with the mutex poisoning, and advice on dealing with it


impl<T> From<PoisonError<T>> for ShutdownError {
fn from(err: PoisonError<T>) -> Self {
ShutdownError::ClientFailed(format!("Mutex poisoned during shutdown: {}", err))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wrapping up the formatted inner error is plenty of context here given the narrow failure mode. I also considered wrapping up Box<Err> , but I think this is 1/ confusing in the interface and 2/ not at all actionable.

There's some good stuff on canonical's rust best practices doc on this stuff

@@ -149,14 +149,10 @@ impl SpanProcessor for SimpleSpanProcessor {
}

fn shutdown(&self) -> TraceResult<()> {
Copy link
Contributor Author

@scottgerring scottgerring Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should (in another issue!) chase this distinction between TraceError and ShutdownError further down the rabbit hole. For the same reasons I don't think it makes sense to use an über-error in the exporter trait

#[derive(Error, Debug)]
#[non_exhaustive]
pub enum ShutdownError {
/// The exporter has already been shut down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still unsure of this....I can understand that a Provider's shutdown method, if invoked more than once, can return AlreadyShutdown error. How would an Exporter's shutdown be invoked more than once, since provider protects the processor/exporter from entering that situation?

I am not even sure if processor/exporter should even have a variable like is_shutdown at all.. If its shutdown is invoked, it'll kill the transports etc. any further attempt to export will result in failure naturally as transports are killed.

(Sorry not really part of this PR, but this is something that I need some time to think through....)

Copy link
Member

@lalitb lalitb Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even sure if processor/exporter should even have a variable like is_shutdown at all.. If its shutdown is invoked, it'll kill the transports etc. any further attempt to export will result in failure naturally as transports are killed.

I feel processor should have this flag. It can't assume how the exporters are implemented. Maybe exporter's shutdown methods are just no-op, and in that case the export may continue happening even after shutdown is invoked, if the processor doesn't have is_shutdown safeguard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the need for processor to have the flag? If shutdown was already signaled, then no further shutdown signal will ever arrive at the LogProcessor, as the Provider takes care of that, and provider own Processors.

How would log processor ever get a 2nd shutdown call? Or how does it ever get an emit() call after shutdown is done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would log processor ever get a 2nd shutdown call? Or how does it ever get an emit() call after shutdown is done?

Yes correct, is_shutdown at provider level will take care of that. I missed that. However, logger::emit() as of now doesn't check for the provider flag, which needs to be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raised a PR to discuss this further - #2462

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it'd be cool if shutdown was idempotent - you call it, it shuts down, you call it again, it no-ops.

you call it again, it no-ops.

This maybe okay. But if a user is calling shutdown more than once, it indicates there have an bug in the way they orchestrate telemetry pipelines, so isn't it best to let them know about it...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is logging it sufficient ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its debatable I guess....We can technically make almost every return type go away in favor of logging. So need to find a balance where do we resort to logging vs explicit-return-type.

Related: Otel .NET decided to specialy handle emit logs after shutdown, by printing these nice message only in StdOut exporter:
https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.Console/ConsoleLogRecordExporter.cs#L39-L42

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon there’s a clear line there between errors that actually reflect some failure of the system, and errors that have no impact and are just hinting that the caller has done something funny.

By way of example a flush that fails become the connections gone is very different to “probably you didn’t mean to call this twice”.

That stdout error one is interesting. Is there a story there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also and regardless - it feels like something we should have a consistent view on - when to error, when not to - that we can point at apply uniformly. That consistency is maybe even more important than what the policy itself is 🤷‍♂️

@@ -110,10 +110,6 @@ pub struct SpanData {
#[derive(Error, Debug)]
#[non_exhaustive]
pub enum ShutdownError {
/// The exporter has already been shut down.
#[error("Shutdown already performed")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this as discussed

@@ -148,8 +148,17 @@ impl InMemoryLogExporter {
/// ```
///
pub fn get_emitted_logs(&self) -> LogResult<Vec<LogDataWithResource>> {
let logs_guard = self.logs.lock().map_err(LogError::from)?;
let resource_guard = self.resource.lock().map_err(LogError::from)?;
let logs_guard = self.logs.lock().map_err(|_| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutex poisoned errors here are an implementation detail of InMemoryLogExporter - that is, they have nothing to do with the common shape of a LogExporter.

So, I removed the automatic From for poisoned on the LogError type and explicitly map detailed errors through here.

@scottgerring
Copy link
Contributor Author

scottgerring commented Dec 23, 2024

@cijothomas @lalitb I've established what I think is an appropriate shape for the error handling in the SpanExporter and LogExporter types. The things I think that matter are 1/ error-per-method or logical method grouping (export/flush), 2/ sensible errors that make sense for all impls of the traits, 3/ transparent error boxing for everything else.

I've tried to do this with as few flow on effects into the rest of the code, because I think agreeing on the shape of the interface is the most important, 2/ there's so much flux in this area merge conflicts are a pain and 3/ I'm still not sure we agree so I don't want to burn time.

IMHO we should agree on the shape, merge it, and then smooth out the flow-on effects in subsequent PRs. I will rebase and deal with merge conflicts again once we are all happy!

What do you both think?

@lalitb
Copy link
Member

lalitb commented Dec 23, 2024

@scottgerring Could ShutdownError be a common enum shared across all three signals, or perhaps a common enum wrapped within signal-specific shutdown enums? It seems like error types for shutdown are mostly common for all signals.

Maybe for now, move it to sdk/error.rs, so it can be shared across all signals? It could then be re-exported in signal-specific modules, allowing users to access it as sdk::logs::ShutdownError, sdk::trace::ShutdownError, etc. Thinking out loud if it makes sense?

@scottgerring
Copy link
Contributor Author

scottgerring commented Dec 24, 2024

@scottgerring Could ShutdownError be a common enum shared across all three signals, or perhaps a common enum wrapped within signal-specific shutdown enums? It seems like error types for shutdown are mostly common for all signals.

Maybe for now, move it to sdk/error.rs, so it can be shared across all signals? It could then be re-exported in signal-specific modules, allowing users to access it as sdk::logs::ShutdownError, sdk::trace::ShutdownError, etc. Thinking out loud if it makes sense?

Good morning @lalitb 👋 !

I had a play with this just now. Good idea! I like it; it keeps it DRY, we've already got precedent with the common opentelemetry_sdk::error::Error type in there, and with the re-export we maintain flexibility should we need to extend one signal's variant later.

I extended opentelemetry_sdk::error::Error to contain a variant for this, too, but it is worth mentioning that there are no concrete usages of opentelemetry_sdk::error::Error type yet, and we'd really only want to use it when a more "targeted" error can't be used and the impl emitting the error can't meaningfully handle the error itself.

Relevant bits in the last commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different Exporter::shutdown() interface across signals
3 participants