Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent unbounded lock holds in BufferManager of EventPipe #48435

Merged
merged 16 commits into from
Mar 16, 2021

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Feb 18, 2021

fix #43985
This should help, but not solve dotnet/diagnostics#1412

These changes will limit the time the BufferManager lock is held to O(N) where N is the number of currently active threads. This does not alleviate the lock caravan behavior we see when N becomes sufficiently large.

Belongs to #45518

Every physical thread in a process has an EventPipeThread created the first time an event is written on that thread. These EventPipeThread objects are thread-local, ref-counted instances that live longer than the physical thread. Each of these objects has a fixed-size (64, the max number of EventPipe sessions) array of ThreadSessionStates that keep hold references to the buffers for a given thread. In addition, for every EventPipe session there is one BufferManager that manages the ThreadSessionStates on the EventPipeThreads. It holds a lock that protects access to the buffers and sequence point data for a given session. Each of the ThreadSessionStates has a strong reference to the EventPipeThread it belongs to, keeping it alive even if the physical thread dies.

This results in a relation that looks like this:

Physical Thread (A) <1--1 EventPipeThread (B) <1--* ThreadSessionState (B) <*--1 BufferManager (B) <1--1 session (B)

A: lifetime A of the physical thread
B: lifetime B of the session

This means that the list of ThreadSessionStates is unbounded in length over the length of a session. For a sufficiently long session in multithreaded process (or a pathologic case such as my stress tests), this means that the list can become problematically long. For example, a process that creates a thread, then joins it in a loop will create problems for single, long-lived session.

The problem arises from the spin lock on the BufferManager. This lock can be taken in a few different situations (non-exhaustive list):

  • A thread is writing its first event or has exhausted a previously allocated buffer
    • This can trigger the extra work of initializing a new sequence point
  • The streaming thread is attempting to flush the oldest events from each thread
    • The streaming thread writes a sequence point
    • The streaming thread iterates all ThreadSessionStates looking for the oldest event

The selection of scenarios up above have work time of O(N) where N is the number of ThreadSessionStates.

This means that lock holding times scale with integral of all threads ever created during a session.

To remedy this, I've added logic that culls ThreadSessionStates belonging to EventPipeThreads that have been orphaned and that have had all their events flushed. In short, these ThreadSessionStates won't have any new events on them again. The logic guarantees that at least one SequencePoint will have the last known sequence value for events on that thread.

I tested this by writing a simple stress application. In pseudo-code it did the following:

function write_events:
	for i in 0..3:
		EventSource.Write(...)

function main:
	while not finished:
		t <- new thread(write_events)
		t.join()

This will create thousands of threads per second, but only one thread should ever be writing events at a time. This will minimize impacts from having many threads contending for the lock at the same time (a separate issue) while still exemplifying the underlying problem.

To measure the impact of these changes, I added some simple code during development that calculated histograms of how long the lock was held and how long the lock was waited on. With that I collected the following basic datapoints.

Note: The histograms are written like this, bin upper bound -> count, with -1 being an unbounded bin.

Note: The time measurements are in nanoseconds.

Note: Each thread sent 3 events before being joined.

Baseline (master + histogram code):

Buffer Manager Hold Histogram at lock iteration 200000
    10 -> 24120
    50 -> 29795
    100 -> 30391
    500 -> 59079
    1000 -> 3903
    2500 -> 8968
    5000 -> 14181
    10000 -> 24780
    15000 -> 3275
    50000 -> 1154
    100000 -> 229
    -1 -> 125
Buffer Manager Wait Histogram at lock iteration 200000
    10 -> 124123
    50 -> 30324
    100 -> 1196
    500 -> 3208
    1000 -> 3866
    2500 -> 10209
    5000 -> 13306
    10000 -> 12116
    15000 -> 513
    50000 -> 789
    100000 -> 225
    -1 -> 125
>81000 threads created
========================================================
**** Summary ****
iteration 1: 207,544.00 events collected, 12,227,958.00 events dropped in 64.221364 seconds - (1.67% throughput)
        (3,231.70 events/s) (4,524,376.00 bytes/s)

|-----------------------------------|--------------------|--------------------|--------------------|--------------------|
| stat                              | Min                | Max                | Average            | Standard Deviation |
|-----------------------------------|--------------------|--------------------|--------------------|--------------------|
| Events Read                       |          207,544.00|          207,544.00|          207,544.00|                0.00|
| Events Dropped                    |       12,227,958.00|       12,227,958.00|       12,227,958.00|                0.00|
| Throughput Efficiency (%)         |                1.67|                1.67|                1.67|                0.00|
| Event Throughput (events/sec)     |            3,231.70|            3,231.70|            3,231.70|                0.00|
| Data Throughput (Bytes/sec)       |        4,524,376.00|        4,524,376.00|        4,524,376.00|                0.00|
| Duration (seconds)                |           64.221364|           64.221364|           64.221364|            0.000000|
|-----------------------------------|--------------------|--------------------|--------------------|--------------------|

Patch:

Buffer Manager Hold Histogram at lock iteration 500000
    10 -> 425
    50 -> 231429
    100 -> 199302
    500 -> 67218
    1000 -> 1318
    2500 -> 164
    5000 -> 52
    10000 -> 92
    15000 -> 0
    50000 -> 0
    100000 -> 0
    -1 -> 0
Buffer Manager Wait Histogram at lock iteration 500000
    10 -> 485677
    50 -> 7712
    100 -> 5772
    500 -> 699
    1000 -> 97
    2500 -> 34
    5000 -> 2
    10000 -> 7
    15000 -> 0
    50000 -> 0
    100000 -> 0
    -1 -> 0
>196,000 threads created
========================================================
**** Summary ****
iteration 1: 589,059.00 events collected, 0.00 events dropped in 60.251668 seconds - (100.00% throughput)
        (9,776.64 events/s) (13,687,299.01 bytes/s)

|-----------------------------------|--------------------|--------------------|--------------------|--------------------|
| stat                              | Min                | Max                | Average            | Standard Deviation |
|-----------------------------------|--------------------|--------------------|--------------------|--------------------|
| Events Read                       |          589,059.00|          589,059.00|          589,059.00|                0.00|
| Events Dropped                    |                0.00|                0.00|                0.00|                0.00|
| Throughput Efficiency (%)         |              100.00|              100.00|              100.00|                0.00|
| Event Throughput (events/sec)     |            9,776.64|            9,776.64|            9,776.64|                0.00|
| Data Throughput (Bytes/sec)       |       13,687,299.01|       13,687,299.01|       13,687,299.01|                0.00|
| Duration (seconds)                |           60.251668|           60.251668|           60.251668|            0.000000|
|-----------------------------------|--------------------|--------------------|--------------------|--------------------|
  baseline patch
number of times the lock was taken ~200,000 ~500,000
number of threads in 60s ~81,000 ~196,000
number of events 12.4 million 589,059
events dropped 12.2 million 0
events received 207,544 589,059
average lock hold time not counting -1 bucket (μs) 1.6μs 0.34μs

One observation to make with the above data is that the 12.4MM number may be erroneous. It is possible this issue also affects the lost events count in the SequencePoints or the one that EventPipeEventSource calculates. The actual collected events for the baseline is ~240K events which does match with expectations (~81K threads * 3 events per thread). I have not investigated this issue further while working on this patch, but it is worth a second investigation to make sure those numbers are accurate.

CC @tommcdon @sywhang @noahfalk @lateralusX @shirhatti

@noahfalk
Copy link
Member

In your description of issues resolved I think you mean dotnet/diagnostics#1412 instead of #1412 ?

@josalem
Copy link
Contributor Author

josalem commented Feb 22, 2021

The remaining test failures appear unrelated.

@josalem josalem self-assigned this Feb 22, 2021
@josalem josalem marked this pull request as ready for review February 22, 2021 18:18
@noahfalk
Copy link
Member

@davmason - do you have a bit of time to look more closely at this?

@josalem - in my cursory look this seemed fine but I'm definitely hitting a learning curve now that this code has been converted to C and I can't easily find the which macros are generating some of the code elements. I'm going to need to come back and run this through the preprocessor to better understand how the code is working post-C-conversion. None of that is specific to your change, it just slows me down from being able to evaluate it.

ep_rt_thread_session_state_array_iterator_next(&thread_session_state_array_iterator)) {

EventPipeThreadSessionState * session_state = ep_rt_thread_session_state_array_iterator_value(&thread_session_state_array_iterator);
ep_rt_thread_session_state_list_remove(&buffer_manager->thread_session_state_list, session_state);
Copy link
Member

@lateralusX lateralusX Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be worth adding support to remove an item in list based on iterator and return a new iterator as result (similar to std::list::erase, then there wouldn't be a need for this second loop while still holding lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need to go take a look at the implementation of both Mono's list and Coreclr's SList. I agree though. I think in the long run we should implement a set of container classes that are shared between the two runtimes or at least used in shared native code like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For list I believe it could be done since iterator just needs to move to a position in the list that continue to work after the item as been removed. I believe we should be able to implement it on both list implementation and be able to drop this second loop.

@davmason
Copy link
Member

@noahfalk I am starting to take a look but so far have the same issue as you. It will be slower than normal

@noahfalk
Copy link
Member

@josalem @lateralusX @davmason - A sidebar, but given the general learning curve to understand the structure/conventions of the code post-C-conversion it may make sense for us to start a little README.md with useful notes? If anything like that already exists let me know but if not sometimes it can be useful to have newcomers (like myself) write it as they learn.

@lateralusX
Copy link
Member

lateralusX commented Feb 24, 2021

@josalem @lateralusX @davmason - A sidebar, but given the general learning curve to understand the structure/conventions of the code post-C-conversion it may make sense for us to start a little README.md with useful notes? If anything like that already exists let me know but if not sometimes it can be useful to have newcomers (like myself) write it as they learn.

Some smaller description in initial PR's around high level structure of the library, #34600, there is also more info in the talk/ppt held around EventPipe C port in .net platform talks in November. Writing up a file to collect information sounds like a good idea, feel free to ping me as well with any questions you might have.

Copy link
Contributor

@sywhang sywhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had same issue with Noah and David so it took a while - coming together nicely!

src/native/eventpipe/ep-buffer-manager.c Show resolved Hide resolved
src/native/eventpipe/ep-thread.h Show resolved Hide resolved
src/native/eventpipe/ep-buffer-manager.c Outdated Show resolved Hide resolved
src/native/eventpipe/ep-buffer-manager.c Show resolved Hide resolved
John Salem added 3 commits February 24, 2021 16:58
* make the sequence point maps removable
* clear session state from sequence points before deletion
@josalem josalem force-pushed the dev/josalem/eventpipe-lock-caravan branch from ccce44b to 26e34ec Compare February 25, 2021 01:00
@josalem
Copy link
Contributor Author

josalem commented Feb 26, 2021

remaining test failure appears to be a known issue.

Base automatically changed from master to main March 1, 2021 09:08
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change is good 👍

The learning curve for conventions coming into this code felt pretty steep but I captured a few notes that helped me and might help others: #49542

Given that I don't spend as much time in this code as others I don't want to make undue pressure to adjust conventions, but even after I knew what the patterns were it still felt as though I was spending a lot of mental energy to grok it. I didn't get the impression I was alone so it may be worth thinking about potential adjustments for easier readability?

@lateralusX
Copy link
Member

lateralusX commented Mar 12, 2021

Given that I don't spend as much time in this code as others I don't want to make undue pressure to adjust conventions, but even after I knew what the patterns were it still felt as though I was spending a lot of mental energy to grok it. I didn't get the impression I was alone so it may be worth thinking about potential adjustments for easier readability?

Any suggestions are highly welcomed (since I'm know this code and architecture inside out, I'm sort of adjusted to it), would be great to get a list of things that are the top "pain points" and I'm certain we can improve going forward. Parts of the library have some complexity due to taking C++ concepts into the C world and still try to keep type safety of underlying runtime implementation, like how the definitions of the the container implementations making it possible to hook up C++ artefact's in runtime implementations and when building library using C++ compiler (like CoreClr) get type safety checks by compiler in container classes used through the C library. Library also implements getter/setters and virtual inheritance, that could complicate things and give long statements and sometime hard to track function calls (due to virtual inheritance, primarily in the serializer parts of the library). I tried really hard to setup a convention and follow it thoroughly in the implementation to simplify readability as much as possible, but since another ambition was to be close to C++ naming (making it easier to transition and maintain both code bases during transition time), names tended to be long and I believe that together with the getters sometimes can produce long statements and together with the compactness of C code, could make it harder to eyeball.

@noahfalk
Copy link
Member

Any suggestions are highly welcomed

I added a few thoughts over at #49542 (comment). I think a lot of it aligns with things you were already considering. I agree with your goal of making the original translation very mechanical and now that we've completed the port and know everything is working well we probably have more safety margin to make refactoring adjustments. Thanks!

@josalem
Copy link
Contributor Author

josalem commented Mar 15, 2021

If there are no opinions otherwise, I will merge this later today.

@josalem josalem merged commit 1f213ad into dotnet:main Mar 16, 2021
@josalem josalem deleted the dev/josalem/eventpipe-lock-caravan branch March 16, 2021 17:36
@ghost ghost locked as resolved and limited conversation to collaborators Apr 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

libcoreclr.so!EventPipeInternal::GetNextEvent high CPU use
5 participants