-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/groupbytrace] Deadlock when eventMachineWorker's events queue is full #33719
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@vijayaggarwal , I can totally see what you mean, but would you please create a test case for this? Perhaps we should split the state machine to deal with deletion events separately? |
@jpkrohling I can't think of how to capture this in a unit test case as there are multiple functions involved and the initial condition is also non-trivial. If you can point me to some other test case which is even partly similar to this one, I will be happy to give it a shot. That said, here's a simple description: Given: When: Then: Fundamentally speaking, I guess the PS: There's one bit of detail I have skipped above for simplicity - The firing of event happens within |
That's helpful, thank you! Are you experiencing this in a particular scenario? As a workaround, can you increase the buffer size? |
@jpkrohling My configuration does make me more susceptible. Particularly,
This significantly increases the chances of the buffer getting full. That said, even with this configuration, I face deadlock only infrequently (like once in a week). For now, I have configured an alert on the |
@vijayaggarwal , I appreciate your feedback, that's very helpful, thank you! |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
processor/groupbytrace
What happened?
Description
I run this processor with
num_traces=100_000
(default is1_000_000
) to reduce the worst case memory requirement of this component.Also, I run this processor with
wait_duration=10s
(default is1s
) to allow the processor handle larger time intervals in receiving spans of a trace.Lastly, I use
num_workers=1
, which is also the default.If the component gets a burst of traffic and more than num_traces=100_000 traces get into the processor in a short span of time, then the ringBuffer will get filled up and there will be evictions. Also, it is quite likely that
eventMachineWorker.events
channel's buffer (capacity=10_000/num_workers) will also get filled up withtraceReceived
events.Now, when both ring buffer and events buffer are full, processing a
traceReceived
event will lead to firing oftraceRemoved
event (due to eviction) and the processor will get deadlocked as the worker will get blocked on pushing thetraceRemoved
event (since the worker is blocked, the events buffer will never get consumed).Expected Result
The processor should not get deadlocked.
Actual Result
The processor gets deadlocked.
Collector version
0.95.0, but the issue affects all recent versions of collector
Environment information
The issue is environment agnostic
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: