-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move partialsuccess code to internal package #3146
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #3146 +/- ##
=====================================
Coverage 76.4% 76.5%
=====================================
Files 180 180
Lines 12014 12014
=====================================
+ Hits 9190 9192 +2
+ Misses 2583 2581 -2
Partials 241 241
|
I'll say that I didn't understand the reason to make the error private in the first place. @MadVikingGod had asked for a way to access the structure of the error, which is lost when it's internal. |
@MrAlias why not expose the new error? I can't understand what you're trying to avoid making public. There are good reasons for uses to want to count the number of dropped metric points, which is exactly what this new structure gave you. |
Can you help explain the good reasons a users would want this for? I do not understand what a user will do with this information when it is sent to the global error handler. They cannot retry the send, but all I can see is they can log this information. You don't need all these new exported types to do that. |
When it comes time to do what I'd like done, it can be done from inside the OTLP exporter with access to the internal package, however it would be great to be able to (a) write alternative instrumentation and (b) experiment with such instrumentation before standardizing it. The metric I'd like produced is to count the number of spans, metric data points, and logs that are accepted by the server successfully. To do that I want to subtract the number that are dropped due to repeated failure and the number of that were rejected by the server as being malformed in some way. The field contained in the partial-success structure is one of the inputs. I'm fiercely opposed to excessive logging. Handling errors in this way is already bad, in my opinion. By leaving a publicly-accessible struct here, I'm able to tailor my error handler to suppress partial-success errors, perhaps, meaning I could decide to sample them or rate-limit them differently than unrecognized errors, or even simply just count the number of points rejected. |
Could we split the difference, make an interface that the error will implement, but keep the implementation internal. This allows this information to be represented in some way beyond just a log, but also minimizes the exposed API surface. |
Thanks for providing the use-cases, they are insightful.
Adding metrics to the OTLP exporters is a great idea. I definitely agree we should add this into the exporter itself as the project progresses. However, I do not think the global error handler should be used in the prototyping of this data. Due to the global nature it will not be able to decern where these errors came from and how to allocate what error to what exporter. I would expect wrapping the exporter with a span processor that exposed these metrics to be a better approach (similar to this but it would wrap the In that situation, the error would need to be returned from the
I can see the desire to not have excessive logging. But I think we have made a misstep then in using the global error handler here instead of the global logger. If we were to log these things with the structured logging interface it provides it would natively allow loggers with rate limiters to handle this. As it is now we need to export these types, send types along the global error handler, parse the types with a registered error handler, and then send to a rate limited logger. I think based on these use-cases we may need to not only rethink these types, but I would like to consider again how we are "exporting" the partial success response itself. I think we could refactor these types and use them as return values from |
Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors. |
This probably shouldn't be part of the SDK, because this is an applicaiton level error for OTLP. Not every exporter will have this kind of error. I support this PR because it allows us time to make the "right" decision after this is released. Currently OTLP only returns protocol level errors because there wasn't any application errors. It would be a change of behavior to start returning the new application errors, so I think we need to understand how this might break current usage, if at all. At some practical level we will have to expose some API to make use of these errors, whether they are exposed via an ErrorHandler or directly from the ExportSpans. I would personally prefer to use the errors.Is(), but an interface with FailedSpanCount() or something similar would be just as effective. After we expose the error in some way I can see us having a migration path of first expose via otel.Handle() inside the exporter. Next, if we accept the change in behavior, expose the error via returning it and allow the SpanProcessor to call otel.Handle(). And finally, create some tool to wrap the exporter that can measure this with the advantage of know how many spans were sent. |
Not sure I agree. If there's a dedicated logger for logging errors produced inside the SDK, then maybe. Without more semantic conventions on errors produced by OTel SDKs, I think these have to be handled specially.
I was expecting code in a handler like:
Admittedly, use of a global handler is not ideal here. I would want a per-exporter handler. You used the return value from FWIW I argued against an unstructured error message being returned in the first place. The actual partial success errors being produced in the Lightstep metrics ingest path are structured to begin with, will say which metrics are failing and for which reasons. However, since the information is so dreadfully repetitive, it responds with one one example at most per response. Then, because OTLP doesn't support that structure, it ends up as a single example of formatted error message. 🤷 I just want to count the number of successful/failed metric points. |
I don't think that is correct way to think about the problem. If more than one exporter reports partial success, which seems likely, they should both report the same error type. Otherwise code interpreting the OTLP error will now need to be update for every exporter that reports this type of error. This would follow suit in the same way we have
I don't follow this. An error returned from
Reporting the error with the global error handler now means we will double report the error later or stop reporting the error with the handler in the future. Either are not ideal. |
There is, namely this.
Right, this is what I was expecting. But if you used the error logger to log this event that code would not be needed, nor the new
I must be missing something. The caller of the
Why would this not work in the metric SDK? An error is similarly returned from opentelemetry-go/sdk/metric/exporter.go Line 44 in bdb917e
I think this is key! I want that as well (and so did @dashpole). I see returning this information as an error from the call to |
I think you mean to replace the PartialSuccessError with either something like:
or maybe
Those are OK with me. I don't think we should be RETURNING these as errors because the export itself is not failing, so I don't expect to have to provide a new span-exporter, metrics-exporter, etc., just in order to get these messages to show on the console. Later, someone in OTel will organize a way to report on each signal -- at that point, where a non-nil error means a total failure, we will I assume have to inject some kind of error-handler to count the total loss of a batch of spans/metrics/logs, etc. |
Right. This was the approach I would defer to if we wanted to support logging (rate limited, or otherwise) of the partial success.
I disagree. The export network call may not have failed, but the call to Errors in Go are used to communicate when aberrant and unexpected events happen. Similar to an io.Writer, io.Reader, "html/template".ExecuteTemplate, or "text/template".ExecuteTemplate, returning an error explaining why part of the passed data was not handled is a common Go practice.
I don't follow this. Our current behavior of the simple span processor and the batch span processor send |
I created a proof-of-concept for how returning the error from a call to |
Addresses this unresolved comment originally posted by @MrAlias in #3106 (comment)