-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166
Comments
@dmitryax One of the cases where this block happens is when the
One option could be to not have retries in the component logic and instead rely on retry-sender but looks like retry-sender is disabled when batch-sender is in the pipeline (ref) -- not sure if this is intentional but I couldn't find it documented anywhere. Another option could be to handle canceling the retry in the component but that couldn't happen because the component will not know about the retry until its |
@lahsivjar That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases. |
Not all, but only those that are currently being batched. Once the current batch is delivered (or canceled due to timeout), the batch processor shutdown is complete and it behaves as pass-through to allow the memory queue to flush |
If you don't have timeout set, I believe it'll be exactly the same problem even without the batch processor. Especially if the memory queue is enabled |
Ah, that makes sense to me, thanks for the pointer. I do think the behaviour needs to be documented though -- I will open a PR for that. @dmitryax What do you think about the interaction of retry-sender and batch-sender? |
Not quite, since the retry-sender is shutdown before the batch sender it will only send the request once. |
Hmm, looks like timeout sender is the last in the chain ref. Since, the batch sender doesn't forward to any next sender in the happy path (ref) I don't see how this will be enforced. Am I missing something here? |
The timeout sender doesn't need to be shut down. It just cancels requests that take more than the pre-defined timeout duration. It prevents requests from taking more than that duration, assuming that the exporter handles the context cancellations. |
I'm not sure I understand this. The batch sender always sends the requests (batched or not) to the next sender which is typically the retry or the timeout sender. |
Maybe I am missing something obvious here but the code in send basically only calls |
@lahsivjar you're right! Now I see your point. We are completely ignoring the next senders in the chain. Let me work on a fix |
Sorry if I didn't describe the problem clear enough in the issue. Yes, using next senders in the chain will fix it. With the fix, retrySender will retry a shutdownErr on shutdown (which wasn't possible before because it is from an internal package), and shutdown can complete while retaining the entry in persistent queue. |
Submitted #10287 |
Describe the bug
With batchsender, in-flight requests that are indefinitely blocking (due to e.g. retries) will block batchsender shutdown and the collector shutdown.
Batchsender shutdown is blocked by https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/batch_sender.go#L220 which in turn blocks queue sender and exporter shutdown in https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/common.go#L330C1-L334C33
This is in contrast to the case without batchsender, where retries are handled by retry_sender (which is bypassed by batchsender since batchsender calls request.Export directly), and on shutdown, retry_sender will avoid retrying and return
experr.shutdownErr
if downstream senders return an error.Steps to reproduce
Write a request with
Export
that blocks indefinitely. Trigger collector shutdown. Observe that the collector hangs.What did you expect to see?
Collector should not hang on shutdown.
In-flight requests should be signaled that the collector is shutting down, possibly return an
experr.shutdownErr
in the call chain, so that the collector can exit cleanly without removing the request from persistent queue.What did you see instead?
Collector hangs as batchsender waits for all active requests to return.
What version did you use?
v0.100.0
What config did you use?
Environment
Additional context
Q:
be.ShutdownFunc.Shutdown(ctx))
is called, how to cleanly exit the collector without losing events in persistent queue? experr.shutdownErr is in an internal package, and contrib exporter code does not have a way to avoid queue item removal from persistent queue aside from blocking queueConsume
.The text was updated successfully, but these errors were encountered: