Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support EventPipe high core count CPU scalability #1412

Closed
noahfalk opened this issue Apr 17, 2019 · 12 comments
Closed

Support EventPipe high core count CPU scalability #1412

noahfalk opened this issue Apr 17, 2019 · 12 comments
Assignees
Milestone

Comments

@noahfalk
Copy link
Member

We had a report from a user on a machine with a 20 core CPU group that a hot lock in EventPipe (EventPipeBufferManager::m_lock) was causing excessive spinning in EventPipeBufferManager::AllocateBufferForThread. There was conflicting evidence about whether or not the system was sustaining high event load. Regardless of the specifics of that case, we need to validate that EventPipe scales well.

We should create a performance test that will EventPipe use from a large number of cores (>= 20) reading with both EventListener and IPC. We should then compare a low load case (1 event/sec/thread) with a high load case (maximum sustainable event rate on all threads) and compare the latency distribution in calls to WriteEvent. A first stab at a goal is probably that P50 latency increases no more than 10% and P99 latency increases no more than 50%.

@brianrob
Copy link
Member

One thing to consider when investigating this - if in fact there is high contention on EventPipeBufferManager::m_lock in EventPipeBufferManager::AllocateBufferForThread it's likely to be because there is a high enough event load that lots of threads are trying to allocate buffers at the same time and getting stuck behind this lock. The simplest way to solve this is to increase the buffer size or modify the buffer size growth algorithm which would mean that threads would come back through this slow path less often.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@noahfalk
Copy link
Member Author

We had another report of this issue come in. I was suspicious that it would be a case where the buffer was saturated but looking over traces events aren't getting dropped. Best guess we've got a lock convoy as the number of threads rise and because this is a spin lock it is very CPU intensive to wait. We either need to decrease contention on the lock, eliminate it completely using lockless strategies, or replace it with lock that has bounded spinning.

@gleocadie
Copy link

gleocadie commented Aug 26, 2020

@noahfalk @brianrob
I experienced the same issue but maybe specific to us: variation of number of threads during low/high traffic.

In our setup, we have a companion application that setups the EventPipe session and retrieves them from the stream/socket. The companion application configures the EventPipe session to get informational events from GC,Contention,Exception keywords. The main application runs in a container with 30 CPUs allocated.
The application can live several days before restarting/new version deployment/... After a few days, the application latency is becoming worse and worse.

When deactivating the event emission, the issue was not occurring. We took cpu samples and a lot of them were like that:

dotnet    84/2403  1747745.955015:    1001001 cpu-clock:u:
            7f764b9e73b7 SpinLock::GetLock
            7f764bbab656 EventPipeBufferManager::AllocateBufferForThread
            7f764bbac055 EventPipeBufferManager::WriteEvent
            7f764ba5041c EventPipeSession::WriteEventBuffered
            7f764ba4a41d EventPipe::WriteEventInternal
            7f764ba4a162 EventPipe::WriteEvent
            7f764bdb9b8a EventPipeWriteEventContentionStart_V1
            7f764b9f13f2 AwareLock::EnterEpilogHelper
            7f764b9f10de AwareLock::Enter
            7f764ba66286 JIT_MonEnter_Helper
            7f764ba666a8 JIT_MonReliableEnter_Portable
            7f75d6ef4ee9 instance void [System.Private.CoreLib] System.Threading.TimerQueueTimer::Fire(bool)[OptimizedTier1]
            7f75d6e8125c bool [System.Private.CoreLib] System.Threading.ThreadPoolWorkQueue::Dispatch()[OptimizedTier1]
            7f764baf51d3 CallDescrWorkerInternal
            7f764ba26d8b MethodDescCallSite::CallTargetWorker
            7f764bb9e580 QueueUserWorkItemManagedCallback
            7f764b9fa61a ManagedThreadBase_DispatchOuter
            7f764b9fac7d ManagedThreadBase::ThreadPool
            7f764bb88a15 ManagedPerAppDomainTPCount::DispatchWorkItem
            7f764ba16559 ThreadpoolMgr::WorkerThreadStart
            7f764bd9b312 CorUnix::CPalThread::ThreadEntry
            7f764d4edea5 start_thread

Which means that every event emission was impacted by this lock. We measure the time between the emission time and the time the event is actually processed by the companion application. We can see that this time is growing over time.

I exposed the m_pThreadSessionStateList size and I saw that this list was continuously growing: new threads were added but dead threads were never removed from the list.
As a workaround, our companion application, that gets events in remote, restarts the session every 24h. This seems fixing the issue for now.

Hope it helps.

@noahfalk
Copy link
Member Author

Thanks @gleocadie! The intended operation is that a thread would reuse a thread-local buffer for ~99.9% of events and only in the 0.1% case would it exhaust the local buffer and need to acquire the global lock to get a new block of memory. Your report is part of growing evidence that either 0.1% is still too much contention or we have bad behavior causing the lock to get accessed more frequently than intended.

I exposed the m_pThreadSessionStateList size and I saw that this list was continuously growing

That matches my understanding too. As far as I can tell the time the lock is held shouldn't vary depending on the size of that list so I wasn't expecting it to be the root cause here. However its always possible I've overlooked a connection. Even if it isn't the root cause, that growing list could cause other issues and it deserves to be fixed.

@gleocadie
Copy link

gleocadie commented Aug 27, 2020

@noahfalk
Thanks for the explanation, it helps in confirming the intent I read in the code.

That matches my understanding too. As far as I can tell the time the lock is held shouldn't vary depending on the size of that list so I wasn't expecting it to be the root cause here. However its always possible I've overlooked a connection. Even if it isn't the root cause, that growing list could cause other issues and it deserves to be fixed.

I agree it's not the root cause and does deserve to be fixed.
I read that when events are written to the wire, a call is made to MoveNextEventAnyThread. In a first pass, the lock is taken and a loop goes over the whole list to find the next buffer to send. This would make the time the lock is held depending on the size of the list. This will have, at some point, impacts on the thread(s) waiting on the lock if the size of the list is still growing.

@noahfalk
Copy link
Member Author

@gleocadie - Yep, I stand corrected! That definitely looks like a spot where the duration the lock is held is proportional to the number of threads that are on that list : ) Even though that acquisition isn't on the event writing path it still contends with the lock acquisitions that are. Nice code spelunking!

@gleocadie
Copy link

@noahfalk - I'm currently testing a fix where I filter out the session state associated to dead threads. I was thinking (after validating the test) I will push my fix upstream (PR). What do you think ?

@noahfalk
Copy link
Member Author

Please do : ) I can't evaluate the merits of a specific implementation without seeing it of course but in principle this is the kind of thing we want to fix. You can feel free to push a PR before you have validated it as well, just mark it as a draft and make a note that you are still testing it.

FYI, I might be heads down for a few days here but looping in some other folks on my team who can take a look - @sywhang @davmason @josalem

@gleocadie
Copy link

@noahfalk - Sorry I did not yet push the PR, my current fix is missing few things (the test was inconclusive) and I'm working on it. I might push it anyway so we can discuss about it (maybe next week). I will ping you and your team when it's done.

@noahfalk
Copy link
Member Author

No worries @gleocadie, we can do it whenever you are ready. After next week I'll be on vacation for the remainder of the year but some of my teammates should still be around.

@gleocadie
Copy link

Thanks, no pb.

@tommcdon tommcdon added this to the 6.0 milestone Dec 18, 2020
@tommcdon tommcdon added the testing Test work item label Dec 18, 2020
@tommcdon tommcdon changed the title Verify EventPipe scalability Support EventPipe high core count CPU scalability Jan 23, 2021
@tommcdon tommcdon removed the testing Test work item label Jan 23, 2021
@josalem
Copy link
Contributor

josalem commented May 24, 2021

solved via dotnet/runtime#48435

@josalem josalem closed this as completed May 24, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants