-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EventSource Activity ID isn't always reset across async await #51608
Comments
Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti Issue DetailsDescriptionThe TplEventSource offers Activity tracking which is intended correlate how work moves across threads and how it is nested within pairs of EventSource Start/Stop events. If a thread transitions from having two or more levels of activity nesting to having none the Activity information isn't properly reset. In the common case of async work happening on the threadpool the incorrect activity information makes it appear that work items that happen to follow each other sequentially on the same thread are part of the same operation when they are actually unrelated. Reprousing System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Threading.Tasks;
namespace ConsoleApp22
{
class Program
{
static void Main(string[] args)
{
MyEventListener l = new MyEventListener();
List<Task> tasks = new List<Task>();
for(int i = 0; i < 50; i++)
{
tasks.Add(Task.Run(DoWork));
}
Task.WaitAll(tasks.ToArray());
}
static async Task DoWork()
{
Guid id1 = EventSource.CurrentThreadActivityId;
Trace.Assert(id1 == Guid.Empty); // this will fail on some (not all) invocations of DoWork()
MyEventSource.Log.FooStart("starting");
MyEventSource.Log.FooStart("starting nested");
await Task.Yield();
MyEventSource.Log.FooStop("Stopping nested");
MyEventSource.Log.FooStop("Stopping");
}
}
[EventSource(Name ="MyEventSource")]
class MyEventSource : EventSource
{
public static MyEventSource Log = new MyEventSource();
[Event(1)]
public void Message1(string arg1) => WriteEvent(1, arg1);
[Event(2, ActivityOptions = EventActivityOptions.Recursive)]
public void FooStart(string arg1) => WriteEvent(2, arg1);
[Event(3, ActivityOptions = EventActivityOptions.Recursive)]
public void FooStop(string arg1) => WriteEvent(3, arg1);
}
class MyEventListener : EventListener
{
protected override void OnEventSourceCreated(EventSource eventSource)
{
if(eventSource.Name == "System.Threading.Tasks.TplEventSource")
{
EnableEvents(eventSource, EventLevel.LogAlways, (EventKeywords)0x80);
}
else if(eventSource.Name == "MyEventSource")
{
EnableEvents(eventSource, EventLevel.Informational);
}
}
}
} Note: if you run this under VS debugger on Windows an ETW session is started that sets some different keywords on TplEventSource and may interfere with Activity tracing. That behavior is a separate issue, #39353. If needed you can undo the debugger modifications by stopping just inside of Main and evaluating Expected BehaviorThe assert never fires. There is no enclosing scope of Start/Stop events that should have set a non-zero ActivityID. Actual BehaviorThe assert will fail once a threadpool thread is reused to run another invocation of DoWork(). The ID will be the same one that was present at the time await Task.Yield() was run in the previous work item on the threadpool thread. This makes it appear that the new work item is the continuation of the previous one. CauseThere is code in ActivityTracer.cs that keeps the ActivityInfo chain stored in AsyncLocal m_current synchronized with the Windows ETW thread-local ActivityID. There are three kinds of updates this method might get:
The 3rd case is not fully handled. In the repro above at the point a threadpool thread executes the 'await Task.Yield()' it will switch from m_current pointing at a stack of two ActivityInfos back to the thread's default ExecutionContext which has m_current null. Because there are >= 2 ActivityInfos, prev.m_creator != null and the code does not enter the the first block. Because cur == null the code also does not enter the second block. The final comment says that the code is intentionally leaving the id alone rather than reseting it though it doesn't provide any rationale why that would be a better outcome. So far I haven't come up with any scenario that appears better served by the current behavior. cc @davmason @sywhang @josalem @brianrob
|
@brianrob - would it concern you if I changed this code to set activityID = 0 in the case we are switching the thread back to a default ExecutionContext? |
It would seem right to me to set the activity ID back to 0. To make sure I understand properly: In this case, we're essentially pausing the current task and "freeing" the thread to do other work. Freeing the thread should clear the activity ID for any thread pool thread, since that thread logically belongs to the thread pool and could dequeue work that is completely unrelated to what it was just doing. We also wouldn't want the cost of dequeueing work to be associated with the previous work item. The one scenario that I'm trying to construct, though I'm not sure I can, is if we're on a non-thread pool thread and "pausing" a Task. I guess in this case if we switch back to the default ExecutionContext, then the activity ID should still be 0, because if it was set to something non-zero through an API that should flow the ID then we'd have an activity chain present, and could refer back to it. Does that make sense? I'm not 100% sure about my last statement here... |
@brianrob - thanks and yep that is an accurate description.
It depends how it was set which is a murkier part of the story. Currently there are multiple ways you could set ActivityID:
The first one follows async flow (this bug aside). The 2nd and 3rd are thread-local (not async-local) mechanisms. However if you mix the first mechanism with either of the latter two things start getting interesting. I haven't seen the intended behavior documented anywhere, but based on my understanding of the code what happens in practice is using an EventSource start event converts the last set activity ID from thread-local behavior to async-local behavior. Ie both these tests pass: static async Task PropagateActivityID()
{
Guid id = Guid.NewGuid();
EventSource.SetCurrentThreadActivityId(id);
MyEventSource.Log.FooStart("starting");
await Task.Yield();
MyEventSource.Log.FooStop("Stopping");
Debug.Assert(EventSource.CurrentThreadActivityId == id);
}
static async Task NoPropagateActivityID()
{
Guid id = Guid.NewGuid();
EventSource.SetCurrentThreadActivityId(id);
await Task.Yield();
Debug.Assert(EventSource.CurrentThreadActivityId != id);
} |
Fixes dotnet#51608 Previously running an async task with two or more levels of nesting on the EventSource activity stack would leave the ActivityID set on the old thread after being swapped out at a yield point. This meant that threadpool threads would often have stale ActivityIDs present when they began a new work item. This in turn produced non-sensical logs where it appeared that unrelated work items were nested within each other. Most of this change is adding tests. The ActivityIdIsZeroedOnThreadSwitchOut test is the one being fixed. The remainder of the new tests pass both before and after the change. I added those to better describe some of the existing behavior and add a little confidence that the change didn't have unintended effects elsewhere. As best I can tell the Activity tracking feature didn't have any testing previously and there is certainly still room for improvement on test coverage.
Prevent ActivityId leak across threads Fixes #51608 Previously running an async task with two or more levels of nesting on the EventSource activity stack would leave the ActivityID set on the old thread after being swapped out at a yield point. This meant that threadpool threads would often have stale ActivityIDs present when they began a new work item. This in turn produced non-sensical logs where it appeared that unrelated work items were nested within each other. Most of this change is adding tests. The ActivityIdIsZeroedOnThreadSwitchOut test is the one being fixed. The remainder of the new tests pass both before and after the change. I added those to better describe some of the existing behavior and add a little confidence that the change didn't have unintended effects elsewhere. As best I can tell the Activity tracking feature didn't have any testing previously and there is certainly still room for improvement on test coverage.
Description
The TplEventSource offers Activity tracking which is intended correlate how work moves across threads and how it is nested within pairs of EventSource Start/Stop events. If a thread transitions from having two or more levels of activity nesting to having none the Activity information isn't properly reset. In the common case of async work happening on the threadpool the incorrect activity information makes it appear that work items that happen to follow each other sequentially on the same thread are part of the same operation when they are actually unrelated.
Repro
Note: if you run this under VS debugger on Windows an ETW session is started that sets some different keywords on TplEventSource and may interfere with Activity tracing. That behavior is a separate issue, #39353. If needed you can undo the debugger modifications by stopping just inside of Main and evaluating
System.Threading.Tasks.TplEventSource.Log.TasksSetActivityIds = false
in the immediate window before continuing.Expected Behavior
The assert never fires. There is no enclosing scope of Start/Stop events that should have set a non-zero ActivityID.
Actual Behavior
The assert will fail once a threadpool thread is reused to run another invocation of DoWork(). The ID will be the same one that was present at the time await Task.Yield() was run in the previous work item on the threadpool thread. This makes it appear that the new work item is the continuation of the previous one.
Cause
There is code in ActivityTracer.cs that keeps the ActivityInfo chain stored in AsyncLocal m_current synchronized with the Windows ETW thread-local ActivityID. There are three kinds of updates this method might get:
The 3rd case is not fully handled. In the repro above at the point a threadpool thread executes the 'await Task.Yield()' it will switch from m_current pointing at a stack of two ActivityInfos back to the thread's default ExecutionContext which has m_current null. Because there are >= 2 ActivityInfos, prev.m_creator != null and the code does not enter the the first block. Because cur == null the code also does not enter the second block. The final comment says that the code is intentionally leaving the id alone rather than reseting it though it doesn't provide any rationale why that would be a better outcome. So far I haven't come up with any scenario that appears better served by the current behavior.
cc @davmason @sywhang @josalem @brianrob
The text was updated successfully, but these errors were encountered: