EventSource Activity ID isn't always reset across async await #51608

noahfalk · 2021-04-21T02:41:12Z

Description

The TplEventSource offers Activity tracking which is intended correlate how work moves across threads and how it is nested within pairs of EventSource Start/Stop events. If a thread transitions from having two or more levels of activity nesting to having none the Activity information isn't properly reset. In the common case of async work happening on the threadpool the incorrect activity information makes it appear that work items that happen to follow each other sequentially on the same thread are part of the same operation when they are actually unrelated.

Repro

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Threading.Tasks;

namespace ConsoleApp22
{
    class Program
    {
        static void Main(string[] args)
        {
            MyEventListener l = new MyEventListener();
            List<Task> tasks = new List<Task>();
            for(int i = 0; i < 50; i++)
            {
                tasks.Add(Task.Run(DoWork));
            }

            Task.WaitAll(tasks.ToArray());
        }

        static async Task DoWork()
        {
            Guid id1 = EventSource.CurrentThreadActivityId;
            Trace.Assert(id1 == Guid.Empty);                                    // this will fail on some (not all) invocations of DoWork()
            MyEventSource.Log.FooStart("starting");
            MyEventSource.Log.FooStart("starting nested");
            await Task.Yield();
            MyEventSource.Log.FooStop("Stopping nested");
            MyEventSource.Log.FooStop("Stopping");
        }
    }

    [EventSource(Name ="MyEventSource")]
    class MyEventSource : EventSource
    {
        public static MyEventSource Log = new MyEventSource();

        [Event(1)]
        public void Message1(string arg1) => WriteEvent(1, arg1);

        [Event(2, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStart(string arg1) => WriteEvent(2, arg1);

        [Event(3, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStop(string arg1) => WriteEvent(3, arg1);
    }

    class MyEventListener : EventListener
    {
        protected override void OnEventSourceCreated(EventSource eventSource)
        {
            if(eventSource.Name == "System.Threading.Tasks.TplEventSource")
            {
                EnableEvents(eventSource, EventLevel.LogAlways, (EventKeywords)0x80);
            }
            else if(eventSource.Name == "MyEventSource")
            {
                EnableEvents(eventSource, EventLevel.Informational);
            }
        }
    }
}

Note: if you run this under VS debugger on Windows an ETW session is started that sets some different keywords on TplEventSource and may interfere with Activity tracing. That behavior is a separate issue, #39353. If needed you can undo the debugger modifications by stopping just inside of Main and evaluating System.Threading.Tasks.TplEventSource.Log.TasksSetActivityIds = false in the immediate window before continuing.

Expected Behavior

The assert never fires. There is no enclosing scope of Start/Stop events that should have set a non-zero ActivityID.

Actual Behavior

The assert will fail once a threadpool thread is reused to run another invocation of DoWork(). The ID will be the same one that was present at the time await Task.Yield() was run in the previous work item on the threadpool thread. This makes it appear that the new work item is the continuation of the previous one.

Cause

There is code in ActivityTracer.cs that keeps the ActivityInfo chain stored in AsyncLocal m_current synchronized with the Windows ETW thread-local ActivityID. There are three kinds of updates this method might get:

A new ActivityInfo is pushed on the stack
An existing ActivityInfo is popped off the stack
The underlying thread switches to running code with a new ExecutionContext which may have a completely different value of the m_current.

The 3rd case is not fully handled. In the repro above at the point a threadpool thread executes the 'await Task.Yield()' it will switch from m_current pointing at a stack of two ActivityInfos back to the thread's default ExecutionContext which has m_current null. Because there are >= 2 ActivityInfos, prev.m_creator != null and the code does not enter the the first block. Because cur == null the code also does not enter the second block. The final comment says that the code is intentionally leaving the id alone rather than reseting it though it doesn't provide any rationale why that would be a better outcome. So far I haven't come up with any scenario that appears better served by the current behavior.

cc @davmason @sywhang @josalem @brianrob

The text was updated successfully, but these errors were encountered:

ghost · 2021-04-21T02:41:16Z

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The TplEventSource offers Activity tracking which is intended correlate how work moves across threads and how it is nested within pairs of EventSource Start/Stop events. If a thread transitions from having two or more levels of activity nesting to having none the Activity information isn't properly reset. In the common case of async work happening on the threadpool the incorrect activity information makes it appear that work items that happen to follow each other sequentially on the same thread are part of the same operation when they are actually unrelated.

Repro

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Threading.Tasks;

namespace ConsoleApp22
{
    class Program
    {
        static void Main(string[] args)
        {
            MyEventListener l = new MyEventListener();
            List<Task> tasks = new List<Task>();
            for(int i = 0; i < 50; i++)
            {
                tasks.Add(Task.Run(DoWork));
            }

            Task.WaitAll(tasks.ToArray());
        }

        static async Task DoWork()
        {
            Guid id1 = EventSource.CurrentThreadActivityId;
            Trace.Assert(id1 == Guid.Empty);                                    // this will fail on some (not all) invocations of DoWork()
            MyEventSource.Log.FooStart("starting");
            MyEventSource.Log.FooStart("starting nested");
            await Task.Yield();
            MyEventSource.Log.FooStop("Stopping nested");
            MyEventSource.Log.FooStop("Stopping");
        }
    }

    [EventSource(Name ="MyEventSource")]
    class MyEventSource : EventSource
    {
        public static MyEventSource Log = new MyEventSource();

        [Event(1)]
        public void Message1(string arg1) => WriteEvent(1, arg1);

        [Event(2, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStart(string arg1) => WriteEvent(2, arg1);

        [Event(3, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStop(string arg1) => WriteEvent(3, arg1);
    }

    class MyEventListener : EventListener
    {
        protected override void OnEventSourceCreated(EventSource eventSource)
        {
            if(eventSource.Name == "System.Threading.Tasks.TplEventSource")
            {
                EnableEvents(eventSource, EventLevel.LogAlways, (EventKeywords)0x80);
            }
            else if(eventSource.Name == "MyEventSource")
            {
                EnableEvents(eventSource, EventLevel.Informational);
            }
        }
    }
}

Note: if you run this under VS debugger on Windows an ETW session is started that sets some different keywords on TplEventSource and may interfere with Activity tracing. That behavior is a separate issue, #39353. If needed you can undo the debugger modifications by stopping just inside of Main and evaluating System.Threading.Tasks.TplEventSource.Log.TasksSetActivityIds = false in the immediate window before continuing.

Expected Behavior

The assert never fires. There is no enclosing scope of Start/Stop events that should have set a non-zero ActivityID.

Actual Behavior

The assert will fail once a threadpool thread is reused to run another invocation of DoWork(). The ID will be the same one that was present at the time await Task.Yield() was run in the previous work item on the threadpool thread. This makes it appear that the new work item is the continuation of the previous one.

Cause

There is code in ActivityTracer.cs that keeps the ActivityInfo chain stored in AsyncLocal m_current synchronized with the Windows ETW thread-local ActivityID. There are three kinds of updates this method might get:

A new ActivityInfo is pushed on the stack
An existing ActivityInfo is popped off the stack
The underlying thread switches to running code with a new ExecutionContext which may have a completely different value of the m_current.

The 3rd case is not fully handled. In the repro above at the point a threadpool thread executes the 'await Task.Yield()' it will switch from m_current pointing at a stack of two ActivityInfos back to the thread's default ExecutionContext which has m_current null. Because there are >= 2 ActivityInfos, prev.m_creator != null and the code does not enter the the first block. Because cur == null the code also does not enter the second block. The final comment says that the code is intentionally leaving the id alone rather than reseting it though it doesn't provide any rationale why that would be a better outcome. So far I haven't come up with any scenario that appears better served by the current behavior.

cc @davmason @sywhang @josalem @brianrob

Author:	noahfalk
Assignees:	-
Labels:	`area-System.Diagnostics.Tracing`
Milestone:	6.0.0

noahfalk · 2021-04-22T00:53:39Z

@brianrob - would it concern you if I changed this code to set activityID = 0 in the case we are switching the thread back to a default ExecutionContext?

brianrob · 2021-04-22T23:19:44Z

It would seem right to me to set the activity ID back to 0. To make sure I understand properly: In this case, we're essentially pausing the current task and "freeing" the thread to do other work. Freeing the thread should clear the activity ID for any thread pool thread, since that thread logically belongs to the thread pool and could dequeue work that is completely unrelated to what it was just doing. We also wouldn't want the cost of dequeueing work to be associated with the previous work item.

The one scenario that I'm trying to construct, though I'm not sure I can, is if we're on a non-thread pool thread and "pausing" a Task. I guess in this case if we switch back to the default ExecutionContext, then the activity ID should still be 0, because if it was set to something non-zero through an API that should flow the ID then we'd have an activity chain present, and could refer back to it. Does that make sense? I'm not 100% sure about my last statement here...

noahfalk · 2021-04-23T23:03:52Z

To make sure I understand properly:

@brianrob - thanks and yep that is an accurate description.

because if it was set to something non-zero through an API that should flow the ID then we'd have an activity chain present, and could refer back to it.

It depends how it was set which is a murkier part of the story. Currently there are multiple ways you could set ActivityID:

Use EventSource events with Start/Stop + TplEventSource keyword 0x80
Use EventSource.CurrentThreadActivityID
Use native win32 ETW APIs

The first one follows async flow (this bug aside). The 2nd and 3rd are thread-local (not async-local) mechanisms. However if you mix the first mechanism with either of the latter two things start getting interesting. I haven't seen the intended behavior documented anywhere, but based on my understanding of the code what happens in practice is using an EventSource start event converts the last set activity ID from thread-local behavior to async-local behavior. Ie both these tests pass:

        static async Task PropagateActivityID()
        {
            Guid id = Guid.NewGuid();
            EventSource.SetCurrentThreadActivityId(id);
            MyEventSource.Log.FooStart("starting");
            await Task.Yield();
            MyEventSource.Log.FooStop("Stopping");
            Debug.Assert(EventSource.CurrentThreadActivityId == id);
        }

        static async Task NoPropagateActivityID()
        {
            Guid id = Guid.NewGuid();
            EventSource.SetCurrentThreadActivityId(id);
            await Task.Yield();
            Debug.Assert(EventSource.CurrentThreadActivityId != id);
        }

Fixes dotnet#51608 Previously running an async task with two or more levels of nesting on the EventSource activity stack would leave the ActivityID set on the old thread after being swapped out at a yield point. This meant that threadpool threads would often have stale ActivityIDs present when they began a new work item. This in turn produced non-sensical logs where it appeared that unrelated work items were nested within each other. Most of this change is adding tests. The ActivityIdIsZeroedOnThreadSwitchOut test is the one being fixed. The remainder of the new tests pass both before and after the change. I added those to better describe some of the existing behavior and add a little confidence that the change didn't have unintended effects elsewhere. As best I can tell the Activity tracking feature didn't have any testing previously and there is certainly still room for improvement on test coverage.

Prevent ActivityId leak across threads Fixes #51608 Previously running an async task with two or more levels of nesting on the EventSource activity stack would leave the ActivityID set on the old thread after being swapped out at a yield point. This meant that threadpool threads would often have stale ActivityIDs present when they began a new work item. This in turn produced non-sensical logs where it appeared that unrelated work items were nested within each other. Most of this change is adding tests. The ActivityIdIsZeroedOnThreadSwitchOut test is the one being fixed. The remainder of the new tests pass both before and after the change. I added those to better describe some of the existing behavior and add a little confidence that the change didn't have unintended effects elsewhere. As best I can tell the Activity tracking feature didn't have any testing previously and there is certainly still room for improvement on test coverage.

noahfalk added the area-System.Diagnostics.Tracing label Apr 21, 2021

noahfalk added this to the 6.0.0 milestone Apr 21, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 21, 2021

tommcdon removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021

noahfalk mentioned this issue Jul 14, 2021

Prevent ActivityId leak across threads #55625

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 14, 2021

noahfalk self-assigned this Jul 14, 2021

noahfalk closed this as completed in #55625 Jul 15, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 15, 2021

ghost locked as resolved and limited conversation to collaborators Aug 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventSource Activity ID isn't always reset across async await #51608

EventSource Activity ID isn't always reset across async await #51608

noahfalk commented Apr 21, 2021

ghost commented Apr 21, 2021

Description

Repro

Expected Behavior

Actual Behavior

Cause

noahfalk commented Apr 22, 2021

brianrob commented Apr 22, 2021

noahfalk commented Apr 23, 2021

EventSource Activity ID isn't always reset across async await #51608

EventSource Activity ID isn't always reset across async await #51608

Comments

noahfalk commented Apr 21, 2021

Description

Repro

Expected Behavior

Actual Behavior

Cause

ghost commented Apr 21, 2021

Description

Repro

Expected Behavior

Actual Behavior

Cause

noahfalk commented Apr 22, 2021

brianrob commented Apr 22, 2021

noahfalk commented Apr 23, 2021