-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] The thread context is not properly cleared and messes up the traces #10789
Comments
Let me take a look. |
@Gaganjuneja please let me know if you need help |
@reta This issue seems to be happening because of below ThreadContext::stashContext which keeps the state when it's triggered from the create index cluster operation. It's happening at couple of other places as well. OpenSearch/server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java Line 380 in a09047a
OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java Line 982 in a09047a
OpenSearch/server/src/main/java/org/opensearch/index/seqno/RetentionLeaseBackgroundSyncAction.java Line 123 in a09047a
|
Thanks @Gaganjuneja , so it seems like the thread context and tracing state are in conflict. One of the options to explore - the stash should move the trace context from current context to new context (but that could cause other issues), will be looking into that this week. |
@Gaganjuneja it took me a lot of time but I think I clearly understand what is happening. The culprit is #10291: the local transport hands off the request using thread pool and at one of the places (still hunting the exact one) it captures the span from the current thread and never cleans it up. It causes this issue to manifest. The fix I did for now:
See please #10873 |
@reta, thanks for putting this up. I am on vacation and take a deeper look once back. I have take a glance and looks like we need to remove the threadcontext state in most of the cases except headers. |
Thanks @Gaganjuneja , the thread context is not necessarily the problem (I think), its management is: it is based on thread locals so we would do the similar things anyway, I think it will become more clear when we pick #10291 |
@reta, Are you able to find the place where the span is not getting cleaned up. I want to understand better why you think the issue is because of local transport. If the similar hand off happens in the non-local transport then also we will end up in the same situation (after #10873) if we are still storing the span inside the ThreadContext? |
@Gaganjuneja It think I know the suspect (TransportService::sendLocalRequest) and it looks to me that in case of local transport the callbacks we expect to be called aren't called (I suspect this is because of
I don't think this is a generic handoff problem but really specific to local transport (at least, I haven't seen any messed traces after the fix but surely we are instrumenting less now). |
@reta, I deep dove and found the issue. Still need to find the fix for it. This is happening while the Index creation and particularly during the createShard. The flow looks like that IndexShard:syncRetentionLeases -> retentionLeaseSyncer:backgroundSync -> RetentionLeaseBackgroundSyncAction:backgroundSync Here, OpenSearch/server/src/main/java/org/opensearch/index/seqno/RetentionLeaseBackgroundSyncAction.java Line 122 in 675dd41
Here it schedules the task OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/AbstractAsyncTask.java Line 109 in 675dd41
Scheduled task gets executed in the current threads context, which copies the stale span and keeps it. These scheduled tasks run at a particular frequency and keep on having the stale span as a parent span.
I think this is a specific case where hand off is not working as expected but we should definitely handle this from the framework itself. Looking forward to your thoughts on this. |
Thanks @Gaganjuneja
I believe you are seeing the consequences of the problem: the Have your tried to see what is happening when using the code from #10873? |
RetentionLeaseBackgroundSyncAction started by a createShard call and on the start itself it takes the calling thread's context and stores it for all further scheduled executions. Following code creates the AsyncRetentionLeaseSyncTask task which internally schedules this action and at this point the current thread's context has the span from the incoming indexing request.
Yes, I tried this fix. Here the issue is not visible but If I go and debug the ThreadContext state then it still has the stale state but since we are not instrumenting the local transport so it's not visible. |
Cool, thank you, so I think we could merge it (since at least it does not mess traces) and work on the fix as part of the #10291, right now the feature is unusable. |
Yes, we can meanwhile go ahead with #10873 as it contains some good code refactoring and isolation. |
Describe the bug
We have the issue with cleaning up the thread context upon certain transport action invocations, the thread context keeps holding the spans from previous invocations, messing up the traces.
To Reproduce
Consider this simple PUT request to create an index:
It generates the following trace:
Now wait just a bit and observe the same trace is growing:
And growing:
The reason for that is that thread context was not cleaned up and the background tasks still picking the last span as the parent, attaching more and more spans to it.
Expected behavior
The thread context must be properly cleaned up.
Plugins
OpenTelementry
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
CC @Gaganjuneja this is serious one
The text was updated successfully, but these errors were encountered: