[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166

carsonip · 2024-05-16T00:28:33Z

Describe the bug

With batchsender, in-flight requests that are indefinitely blocking (due to e.g. retries) will block batchsender shutdown and the collector shutdown.

Batchsender shutdown is blocked by https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/batch_sender.go#L220 which in turn blocks queue sender and exporter shutdown in https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/common.go#L330C1-L334C33

This is in contrast to the case without batchsender, where retries are handled by retry_sender (which is bypassed by batchsender since batchsender calls request.Export directly), and on shutdown, retry_sender will avoid retrying and return experr.shutdownErr if downstream senders return an error.

Steps to reproduce

Write a request with Export that blocks indefinitely. Trigger collector shutdown. Observe that the collector hangs.

What did you expect to see?

Collector should not hang on shutdown.

In-flight requests should be signaled that the collector is shutting down, possibly return an experr.shutdownErr in the call chain, so that the collector can exit cleanly without removing the request from persistent queue.

What did you see instead?

Collector hangs as batchsender waits for all active requests to return.

What version did you use?

v0.100.0

What config did you use?

Environment

Additional context

Q:

Is it intentional to bypass retrysender and timeoutsender in batchsender? It was raised during code review
Even if batchsender.Shutdown returns without polling for in-flight requests to end, and that be.ShutdownFunc.Shutdown(ctx)) is called, how to cleanly exit the collector without losing events in persistent queue? experr.shutdownErr is in an internal package, and contrib exporter code does not have a way to avoid queue item removal from persistent queue aside from blocking queue Consume.

The text was updated successfully, but these errors were encountered:

dmitryax · 2024-05-29T21:53:09Z

Write a request with Export that blocks indefinitely

If the Export doesn't handle the context cancellation, it's expected that it can block collector even without the batch exporter

@carsonip please see if you run into #10255 which is fixed by #10258.

lahsivjar · 2024-05-30T16:17:32Z

@dmitryax One of the cases where this block happens is when the Export function relies on the Context to know about the shutdown events. For example: if a component flushes event on Export with infinite retries, relying on canceling retries based on the Context, then the batch-sender will never terminate. This will happen because:

When the collector is shutdown then batch-sender is shutdown first before the component's ShutdownFunc is called (ref).
The batchsender shutdown waits till ALL active requests are flushed (ref). However, in an event when the endpoint is not available and the infinite retry logic in the component is at work then it will never know that the shutdown is invoked as the context that it has will be the context that is passed to the send event in the batch sender and we don't cancel it (ref).

One option could be to not have retries in the component logic and instead rely on retry-sender but looks like retry-sender is disabled when batch-sender is in the pipeline (ref) -- not sure if this is intentional but I couldn't find it documented anywhere.

Another option could be to handle canceling the retry in the component but that couldn't happen because the component will not know about the retry until its Shutdown function is called which will be blocked waiting for the batch sender's shutdown to finish.

dmitryax · 2024-05-30T16:36:44Z

@lahsivjar That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases.

dmitryax · 2024-05-30T16:37:24Z

The batchsender shutdown waits till ALL active requests are flushed

Not all, but only those that are currently being batched. Once the current batch is delivered (or canceled due to timeout), the batch processor shutdown is complete and it behaves as pass-through to allow the memory queue to flush

dmitryax · 2024-05-30T16:47:37Z

If you don't have timeout set, I believe it'll be exactly the same problem even without the batch processor. Especially if the memory queue is enabled

lahsivjar · 2024-05-30T16:47:50Z

That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases.

Ah, that makes sense to me, thanks for the pointer. I do think the behaviour needs to be documented though -- I will open a PR for that.

@dmitryax What do you think about the interaction of retry-sender and batch-sender?

lahsivjar · 2024-05-30T16:57:21Z

If you don't have timeout set, I believe it'll be exactly the same problem even without the batch processor. Especially if the memory queue is enabled

Not quite, since the retry-sender is shutdown before the batch sender it will only send the request once.

lahsivjar · 2024-05-30T17:04:05Z

That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases.

Hmm, looks like timeout sender is the last in the chain ref. Since, the batch sender doesn't forward to any next sender in the happy path (ref) I don't see how this will be enforced. Am I missing something here?

dmitryax · 2024-05-30T18:17:14Z

The timeout sender doesn't need to be shut down. It just cancels requests that take more than the pre-defined timeout duration. It prevents requests from taking more than that duration, assuming that the exporter handles the context cancellations.

dmitryax · 2024-05-30T18:19:17Z

Since, the batch sender doesn't forward to any next sender in the happy path (ref) I don't see how this will be enforced

I'm not sure I understand this. The batch sender always sends the requests (batched or not) to the next sender which is typically the retry or the timeout sender.

lahsivjar · 2024-05-30T19:15:28Z

The batch sender always sends the requests (batched or not) to the next sender which is typically the retry or the timeout sender

Maybe I am missing something obvious here but the code in send basically only calls bs.nextSender.send if the bs.stopped.Load() returns true that means only when the batch sender's Shutdown has been called. That's why I think both retry and timeout sender (which are next in the chain) are bypassed for the happy path which calls sendMerge{SplitBatch, Batch} followed by Request#Export.

dmitryax · 2024-05-30T23:04:22Z

@lahsivjar you're right! Now I see your point. We are completely ignoring the next senders in the chain. Let me work on a fix

carsonip · 2024-05-31T09:26:30Z

@lahsivjar you're right! Now I see your point. We are completely ignoring the next senders in the chain. Let me work on a fix

Sorry if I didn't describe the problem clear enough in the issue. Yes, using next senders in the chain will fix it.

With the fix, retrySender will retry a shutdownErr on shutdown (which wasn't possible before because it is from an internal package), and shutdown can complete while retaining the entry in persistent queue.

dmitryax · 2024-06-01T17:53:08Z

Submitted #10287

…10287) This change fixes a bug when the retry and timeout logic was not applied with enabled batching. The batch sender was ignoring the next senders in the chain. Fixes #10166

carsonip added the bug Something isn't working label May 16, 2024

carsonip mentioned this issue May 16, 2024

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability open-telemetry/opentelemetry-collector-contrib#32632

Closed

dmitryax self-assigned this May 29, 2024

dmitryax mentioned this issue Jun 1, 2024

[exporterhelper] Fix batch sender ignoring next senders in the chain #10287

Merged

dmitryax closed this as completed in #10287 Jun 4, 2024

lahsivjar mentioned this issue Sep 3, 2024

REQUEST: New membership for lahsivjar open-telemetry/community#2325

Closed

6 tasks

github-actions bot mentioned this issue Nov 11, 2024

Link Checker Report signalfx/splunk-otel-collector#5593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166

[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166

carsonip commented May 16, 2024 •

edited

Loading

dmitryax commented May 29, 2024 •

edited

Loading

lahsivjar commented May 30, 2024

dmitryax commented May 30, 2024

dmitryax commented May 30, 2024 •

edited

Loading

dmitryax commented May 30, 2024

lahsivjar commented May 30, 2024

lahsivjar commented May 30, 2024

lahsivjar commented May 30, 2024 •

edited

Loading

dmitryax commented May 30, 2024

dmitryax commented May 30, 2024

lahsivjar commented May 30, 2024

dmitryax commented May 30, 2024

carsonip commented May 31, 2024

dmitryax commented Jun 1, 2024 •

edited

Loading

[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166

[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166

Comments

carsonip commented May 16, 2024 • edited Loading

dmitryax commented May 29, 2024 • edited Loading

lahsivjar commented May 30, 2024

dmitryax commented May 30, 2024

dmitryax commented May 30, 2024 • edited Loading

dmitryax commented May 30, 2024

lahsivjar commented May 30, 2024

lahsivjar commented May 30, 2024

lahsivjar commented May 30, 2024 • edited Loading

dmitryax commented May 30, 2024

dmitryax commented May 30, 2024

lahsivjar commented May 30, 2024

dmitryax commented May 30, 2024

carsonip commented May 31, 2024

dmitryax commented Jun 1, 2024 • edited Loading

carsonip commented May 16, 2024 •

edited

Loading

dmitryax commented May 29, 2024 •

edited

Loading

dmitryax commented May 30, 2024 •

edited

Loading

lahsivjar commented May 30, 2024 •

edited

Loading

dmitryax commented Jun 1, 2024 •

edited

Loading