Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventSource Activity ID isn't always reset across async await #51608

Closed
noahfalk opened this issue Apr 21, 2021 · 4 comments · Fixed by #55625
Closed

EventSource Activity ID isn't always reset across async await #51608

noahfalk opened this issue Apr 21, 2021 · 4 comments · Fixed by #55625

Comments

@noahfalk
Copy link
Member

Description

The TplEventSource offers Activity tracking which is intended correlate how work moves across threads and how it is nested within pairs of EventSource Start/Stop events. If a thread transitions from having two or more levels of activity nesting to having none the Activity information isn't properly reset. In the common case of async work happening on the threadpool the incorrect activity information makes it appear that work items that happen to follow each other sequentially on the same thread are part of the same operation when they are actually unrelated.

Repro

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Threading.Tasks;

namespace ConsoleApp22
{
    class Program
    {
        static void Main(string[] args)
        {
            MyEventListener l = new MyEventListener();
            List<Task> tasks = new List<Task>();
            for(int i = 0; i < 50; i++)
            {
                tasks.Add(Task.Run(DoWork));
            }

            Task.WaitAll(tasks.ToArray());
        }

        static async Task DoWork()
        {
            Guid id1 = EventSource.CurrentThreadActivityId;
            Trace.Assert(id1 == Guid.Empty);                                    // this will fail on some (not all) invocations of DoWork()
            MyEventSource.Log.FooStart("starting");
            MyEventSource.Log.FooStart("starting nested");
            await Task.Yield();
            MyEventSource.Log.FooStop("Stopping nested");
            MyEventSource.Log.FooStop("Stopping");
        }
    }

    [EventSource(Name ="MyEventSource")]
    class MyEventSource : EventSource
    {
        public static MyEventSource Log = new MyEventSource();

        [Event(1)]
        public void Message1(string arg1) => WriteEvent(1, arg1);

        [Event(2, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStart(string arg1) => WriteEvent(2, arg1);

        [Event(3, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStop(string arg1) => WriteEvent(3, arg1);
    }

    class MyEventListener : EventListener
    {
        protected override void OnEventSourceCreated(EventSource eventSource)
        {
            if(eventSource.Name == "System.Threading.Tasks.TplEventSource")
            {
                EnableEvents(eventSource, EventLevel.LogAlways, (EventKeywords)0x80);
            }
            else if(eventSource.Name == "MyEventSource")
            {
                EnableEvents(eventSource, EventLevel.Informational);
            }
        }
    }
}

Note: if you run this under VS debugger on Windows an ETW session is started that sets some different keywords on TplEventSource and may interfere with Activity tracing. That behavior is a separate issue, #39353. If needed you can undo the debugger modifications by stopping just inside of Main and evaluating System.Threading.Tasks.TplEventSource.Log.TasksSetActivityIds = false in the immediate window before continuing.

Expected Behavior

The assert never fires. There is no enclosing scope of Start/Stop events that should have set a non-zero ActivityID.

Actual Behavior

The assert will fail once a threadpool thread is reused to run another invocation of DoWork(). The ID will be the same one that was present at the time await Task.Yield() was run in the previous work item on the threadpool thread. This makes it appear that the new work item is the continuation of the previous one.

Cause

There is code in ActivityTracer.cs that keeps the ActivityInfo chain stored in AsyncLocal m_current synchronized with the Windows ETW thread-local ActivityID. There are three kinds of updates this method might get:

  1. A new ActivityInfo is pushed on the stack
  2. An existing ActivityInfo is popped off the stack
  3. The underlying thread switches to running code with a new ExecutionContext which may have a completely different value of the m_current.

The 3rd case is not fully handled. In the repro above at the point a threadpool thread executes the 'await Task.Yield()' it will switch from m_current pointing at a stack of two ActivityInfos back to the thread's default ExecutionContext which has m_current null. Because there are >= 2 ActivityInfos, prev.m_creator != null and the code does not enter the the first block. Because cur == null the code also does not enter the second block. The final comment says that the code is intentionally leaving the id alone rather than reseting it though it doesn't provide any rationale why that would be a better outcome. So far I haven't come up with any scenario that appears better served by the current behavior.

cc @davmason @sywhang @josalem @brianrob

@noahfalk noahfalk added this to the 6.0.0 milestone Apr 21, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 21, 2021
@ghost
Copy link

ghost commented Apr 21, 2021

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The TplEventSource offers Activity tracking which is intended correlate how work moves across threads and how it is nested within pairs of EventSource Start/Stop events. If a thread transitions from having two or more levels of activity nesting to having none the Activity information isn't properly reset. In the common case of async work happening on the threadpool the incorrect activity information makes it appear that work items that happen to follow each other sequentially on the same thread are part of the same operation when they are actually unrelated.

Repro

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Threading.Tasks;

namespace ConsoleApp22
{
    class Program
    {
        static void Main(string[] args)
        {
            MyEventListener l = new MyEventListener();
            List<Task> tasks = new List<Task>();
            for(int i = 0; i < 50; i++)
            {
                tasks.Add(Task.Run(DoWork));
            }

            Task.WaitAll(tasks.ToArray());
        }

        static async Task DoWork()
        {
            Guid id1 = EventSource.CurrentThreadActivityId;
            Trace.Assert(id1 == Guid.Empty);                                    // this will fail on some (not all) invocations of DoWork()
            MyEventSource.Log.FooStart("starting");
            MyEventSource.Log.FooStart("starting nested");
            await Task.Yield();
            MyEventSource.Log.FooStop("Stopping nested");
            MyEventSource.Log.FooStop("Stopping");
        }
    }

    [EventSource(Name ="MyEventSource")]
    class MyEventSource : EventSource
    {
        public static MyEventSource Log = new MyEventSource();

        [Event(1)]
        public void Message1(string arg1) => WriteEvent(1, arg1);

        [Event(2, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStart(string arg1) => WriteEvent(2, arg1);

        [Event(3, ActivityOptions = EventActivityOptions.Recursive)]
        public void FooStop(string arg1) => WriteEvent(3, arg1);
    }

    class MyEventListener : EventListener
    {
        protected override void OnEventSourceCreated(EventSource eventSource)
        {
            if(eventSource.Name == "System.Threading.Tasks.TplEventSource")
            {
                EnableEvents(eventSource, EventLevel.LogAlways, (EventKeywords)0x80);
            }
            else if(eventSource.Name == "MyEventSource")
            {
                EnableEvents(eventSource, EventLevel.Informational);
            }
        }
    }
}

Note: if you run this under VS debugger on Windows an ETW session is started that sets some different keywords on TplEventSource and may interfere with Activity tracing. That behavior is a separate issue, #39353. If needed you can undo the debugger modifications by stopping just inside of Main and evaluating System.Threading.Tasks.TplEventSource.Log.TasksSetActivityIds = false in the immediate window before continuing.

Expected Behavior

The assert never fires. There is no enclosing scope of Start/Stop events that should have set a non-zero ActivityID.

Actual Behavior

The assert will fail once a threadpool thread is reused to run another invocation of DoWork(). The ID will be the same one that was present at the time await Task.Yield() was run in the previous work item on the threadpool thread. This makes it appear that the new work item is the continuation of the previous one.

Cause

There is code in ActivityTracer.cs that keeps the ActivityInfo chain stored in AsyncLocal m_current synchronized with the Windows ETW thread-local ActivityID. There are three kinds of updates this method might get:

  1. A new ActivityInfo is pushed on the stack
  2. An existing ActivityInfo is popped off the stack
  3. The underlying thread switches to running code with a new ExecutionContext which may have a completely different value of the m_current.

The 3rd case is not fully handled. In the repro above at the point a threadpool thread executes the 'await Task.Yield()' it will switch from m_current pointing at a stack of two ActivityInfos back to the thread's default ExecutionContext which has m_current null. Because there are >= 2 ActivityInfos, prev.m_creator != null and the code does not enter the the first block. Because cur == null the code also does not enter the second block. The final comment says that the code is intentionally leaving the id alone rather than reseting it though it doesn't provide any rationale why that would be a better outcome. So far I haven't come up with any scenario that appears better served by the current behavior.

cc @davmason @sywhang @josalem @brianrob

Author: noahfalk
Assignees: -
Labels:

area-System.Diagnostics.Tracing

Milestone: 6.0.0

@noahfalk
Copy link
Member Author

@brianrob - would it concern you if I changed this code to set activityID = 0 in the case we are switching the thread back to a default ExecutionContext?

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021
@brianrob
Copy link
Member

It would seem right to me to set the activity ID back to 0. To make sure I understand properly: In this case, we're essentially pausing the current task and "freeing" the thread to do other work. Freeing the thread should clear the activity ID for any thread pool thread, since that thread logically belongs to the thread pool and could dequeue work that is completely unrelated to what it was just doing. We also wouldn't want the cost of dequeueing work to be associated with the previous work item.

The one scenario that I'm trying to construct, though I'm not sure I can, is if we're on a non-thread pool thread and "pausing" a Task. I guess in this case if we switch back to the default ExecutionContext, then the activity ID should still be 0, because if it was set to something non-zero through an API that should flow the ID then we'd have an activity chain present, and could refer back to it. Does that make sense? I'm not 100% sure about my last statement here...

@noahfalk
Copy link
Member Author

To make sure I understand properly:

@brianrob - thanks and yep that is an accurate description.

because if it was set to something non-zero through an API that should flow the ID then we'd have an activity chain present, and could refer back to it.

It depends how it was set which is a murkier part of the story. Currently there are multiple ways you could set ActivityID:

  1. Use EventSource events with Start/Stop + TplEventSource keyword 0x80
  2. Use EventSource.CurrentThreadActivityID
  3. Use native win32 ETW APIs

The first one follows async flow (this bug aside). The 2nd and 3rd are thread-local (not async-local) mechanisms. However if you mix the first mechanism with either of the latter two things start getting interesting. I haven't seen the intended behavior documented anywhere, but based on my understanding of the code what happens in practice is using an EventSource start event converts the last set activity ID from thread-local behavior to async-local behavior. Ie both these tests pass:

        static async Task PropagateActivityID()
        {
            Guid id = Guid.NewGuid();
            EventSource.SetCurrentThreadActivityId(id);
            MyEventSource.Log.FooStart("starting");
            await Task.Yield();
            MyEventSource.Log.FooStop("Stopping");
            Debug.Assert(EventSource.CurrentThreadActivityId == id);
        }

        static async Task NoPropagateActivityID()
        {
            Guid id = Guid.NewGuid();
            EventSource.SetCurrentThreadActivityId(id);
            await Task.Yield();
            Debug.Assert(EventSource.CurrentThreadActivityId != id);
        }

noahfalk added a commit to noahfalk/runtime that referenced this issue Jul 14, 2021
Fixes dotnet#51608

Previously running an async task with two or more levels of nesting on the EventSource activity stack would leave the ActivityID set
on the old thread after being swapped out at a yield point. This meant that threadpool threads would often have stale ActivityIDs
present when they began a new work item. This in turn produced non-sensical logs where it appeared that unrelated work items were
nested within each other.

Most of this change is adding tests. The ActivityIdIsZeroedOnThreadSwitchOut test is the one being fixed. The remainder of the new
tests pass both before and after the change. I added those to better describe some of the existing behavior and add a little confidence
that the change didn't have unintended effects elsewhere. As best I can tell the Activity tracking feature didn't have any testing previously
and there is certainly still room for improvement on test coverage.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 14, 2021
@noahfalk noahfalk self-assigned this Jul 14, 2021
noahfalk added a commit that referenced this issue Jul 15, 2021
Prevent ActivityId leak across threads

Fixes #51608

Previously running an async task with two or more levels of nesting on the EventSource activity stack would leave the ActivityID set on the old thread after being swapped out at a yield point. This meant that threadpool threads would often have stale ActivityIDs present when they began a new work item. This in turn produced non-sensical logs where it appeared that unrelated work items were nested within each other.

Most of this change is adding tests. The ActivityIdIsZeroedOnThreadSwitchOut test is the one being fixed. The remainder of the new tests pass both before and after the change. I added those to better describe some of the existing behavior and add a little confidence that the change didn't have unintended effects elsewhere. As best I can tell the Activity tracking feature didn't have any testing previously and there is certainly still room for improvement on test coverage.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants