Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't always wake up sleeping schedulers #11325

Closed
wants to merge 1 commit into from

Conversation

alexcrichton
Copy link
Member

I created a benchmark recently of incrementing a global variable inside of a
extra::sync::Mutex, and it turned out to be horribly slow. For 100K
increments (per thread), the timings I got were:

1 green thread: 10.73ms
8 green threads: 10916.14ms
1 native thread: 9.19ms
8 native threads: 4610.18ms

Upon profiling the test, most of the time is spent in kevent() (I'm on OSX)
and write(). I thought that this was because we were falling into epoll too
much, but after implementing the scheduler only falling back to epoll() if there
is no work or active I/O handles, it didn't fix the problem.

The problem actually turned out to be that the schedulers were in high
contention over the tasks being run. With RUST_TASKS=1, this test is blazingly
fast (78ms), and with RUST_TASKS=2, its incredibly slow (3824ms). The reason
that I found for this is that the tasks being enqueued are constantly stolen by
other schedulers, meaning that tasks are just getting ping-ponged back and forth
around schedulers while the schedulers spend a lot of time in kevent and
write waking each other up.

This optimization only wakes up a sleeping scheduler on every 8th task that is
enqueued. I have found this number to be the "low sweet spot" for maximizing
performance. The numbers after I made this change are:

1 green thread: 13.96ms
8 green threads: 80.86ms
1 native thread: 13.59ms
8 native threads: 4239.25ms

Which indicates that the 8-thread performance is up to the same level of
RUST_TASKS=1, and the other numbers essentiallyt stayed the same.

In other words, this is a 136x improvement in highly contentious green programs.

I created a benchmark recently of incrementing a global variable inside of a
`extra::sync::Mutex`, and it turned out to be horribly slow. For 100K
increments (per thread), the timings I got were:

    1 green thread: 10.73ms
    8 green threads: 10916.14ms
    1 native thread: 9.19ms
    8 native threads: 4610.18ms

Upon profiling the test, most of the time is spent in `kevent()` (I'm on OSX)
and `write()`. I thought that this was because we were falling into epoll too
much, but after implementing the scheduler only falling back to epoll() if there
is no work or active I/O handles, it didn't fix the problem.

The problem actually turned out to be that the schedulers were in high
contention over the tasks being run. With RUST_TASKS=1, this test is blazingly
fast (78ms), and with RUST_TASKS=2, its incredibly slow (3824ms). The reason
that I found for this is that the tasks being enqueued are constantly stolen by
other schedulers, meaning that tasks are just getting ping-ponged back and forth
around schedulers while the schedulers spend *a lot* of time in `kevent` and
`write` waking each other up.

This optimization only wakes up a sleeping scheduler on every 8th task that is
enqueued. I have found this number to be the "low sweet spot" for maximizing
performance. The numbers after I made this change are:

    1 green thread: 13.96ms
    8 green threads: 80.86ms
    1 native thread: 13.59ms
    8 native threads: 4239.25ms

Which indicates that the 8-thread performance is up to the same level of
RUST_TASKS=1, and the other numbers essentiallyt stayed the same.

In other words, this is a 136x improvement in highly contentious green programs.
@pcwalton
Copy link
Contributor

pcwalton commented Jan 5, 2014

Interesting. What kind of numbers are we looking at with pthread mutexes in 1:1 mode, incidentally?

@alexcrichton
Copy link
Member Author

I'm testing out writing our own mutex implementation (which is how I ran across this), and these are the numbers that I'm getting (it's the same test, just incrementing a variable a lot inside of a lock)

// extra::sync::Mutex
1 green thread: 10.26ms
8 green threads: 82.66ms
1 native thread: 9.16ms
8 native threads: 4312.01ms
// my mutex
1 green thread: 1.62ms
8 green threads: 12.99ms
1 native thread: 1.64ms
8 native threads: 3078.81ms
// unstable::mutex::Mutex
1 native thread: 4.10ms
8 native threads: 1408.71ms

Those numbers are all from OSX, and the numbers on linux are a lot worse in terms of pthreads vs our libraries. The numbers I get with a local ubuntu VM are:

// linux pthreads
1 thread: 2.66ms     
8 threads: 72.13ms   
// linux my mutex
1 native thread: 1.58ms          
8 native threads: 8558.39ms      

Still trying to pin down what's going on.

@brson
Copy link
Contributor

brson commented Jan 8, 2014

I mentioned offline that I would rather try to solve the problem of doing too much waking by having stealers do some exponential backoff before giving up completely. I am worried that this solution solves the problem well for this benchmark but would leave too many cores empty on a more realistic workload.

@alexcrichton
Copy link
Member Author

Closing for now, I'm going to get to this later.

@alexcrichton alexcrichton deleted the faster-green branch February 5, 2014 00:09
flip1995 pushed a commit to flip1995/rust that referenced this pull request Aug 14, 2023
Fix SPEEDTEST instructions and output

* `--nocapture` hasn't been needed anymore since forever (even before `ui_test`)
* the result was dividing by 1000 instead of the number of test runs, giving bogus (but still useful for the purpose) timing results.

changelog: fix SPEEDTEST instructions and output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants