Don't always wake up sleeping schedulers #11325

alexcrichton · 2014-01-05T17:10:26Z

I created a benchmark recently of incrementing a global variable inside of a
extra::sync::Mutex, and it turned out to be horribly slow. For 100K
increments (per thread), the timings I got were:

1 green thread: 10.73ms
8 green threads: 10916.14ms
1 native thread: 9.19ms
8 native threads: 4610.18ms

Upon profiling the test, most of the time is spent in kevent() (I'm on OSX)
and write(). I thought that this was because we were falling into epoll too
much, but after implementing the scheduler only falling back to epoll() if there
is no work or active I/O handles, it didn't fix the problem.

The problem actually turned out to be that the schedulers were in high
contention over the tasks being run. With RUST_TASKS=1, this test is blazingly
fast (78ms), and with RUST_TASKS=2, its incredibly slow (3824ms). The reason
that I found for this is that the tasks being enqueued are constantly stolen by
other schedulers, meaning that tasks are just getting ping-ponged back and forth
around schedulers while the schedulers spend a lot of time in kevent and
write waking each other up.

This optimization only wakes up a sleeping scheduler on every 8th task that is
enqueued. I have found this number to be the "low sweet spot" for maximizing
performance. The numbers after I made this change are:

1 green thread: 13.96ms
8 green threads: 80.86ms
1 native thread: 13.59ms
8 native threads: 4239.25ms

Which indicates that the 8-thread performance is up to the same level of
RUST_TASKS=1, and the other numbers essentiallyt stayed the same.

In other words, this is a 136x improvement in highly contentious green programs.

I created a benchmark recently of incrementing a global variable inside of a `extra::sync::Mutex`, and it turned out to be horribly slow. For 100K increments (per thread), the timings I got were: 1 green thread: 10.73ms 8 green threads: 10916.14ms 1 native thread: 9.19ms 8 native threads: 4610.18ms Upon profiling the test, most of the time is spent in `kevent()` (I'm on OSX) and `write()`. I thought that this was because we were falling into epoll too much, but after implementing the scheduler only falling back to epoll() if there is no work or active I/O handles, it didn't fix the problem. The problem actually turned out to be that the schedulers were in high contention over the tasks being run. With RUST_TASKS=1, this test is blazingly fast (78ms), and with RUST_TASKS=2, its incredibly slow (3824ms). The reason that I found for this is that the tasks being enqueued are constantly stolen by other schedulers, meaning that tasks are just getting ping-ponged back and forth around schedulers while the schedulers spend *a lot* of time in `kevent` and `write` waking each other up. This optimization only wakes up a sleeping scheduler on every 8th task that is enqueued. I have found this number to be the "low sweet spot" for maximizing performance. The numbers after I made this change are: 1 green thread: 13.96ms 8 green threads: 80.86ms 1 native thread: 13.59ms 8 native threads: 4239.25ms Which indicates that the 8-thread performance is up to the same level of RUST_TASKS=1, and the other numbers essentiallyt stayed the same. In other words, this is a 136x improvement in highly contentious green programs.

pcwalton · 2014-01-05T21:56:53Z

Interesting. What kind of numbers are we looking at with pthread mutexes in 1:1 mode, incidentally?

alexcrichton · 2014-01-06T06:24:40Z

I'm testing out writing our own mutex implementation (which is how I ran across this), and these are the numbers that I'm getting (it's the same test, just incrementing a variable a lot inside of a lock)

// extra::sync::Mutex
1 green thread: 10.26ms
8 green threads: 82.66ms
1 native thread: 9.16ms
8 native threads: 4312.01ms
// my mutex
1 green thread: 1.62ms
8 green threads: 12.99ms
1 native thread: 1.64ms
8 native threads: 3078.81ms
// unstable::mutex::Mutex
1 native thread: 4.10ms
8 native threads: 1408.71ms

Those numbers are all from OSX, and the numbers on linux are a lot worse in terms of pthreads vs our libraries. The numbers I get with a local ubuntu VM are:

// linux pthreads
1 thread: 2.66ms     
8 threads: 72.13ms   
// linux my mutex
1 native thread: 1.58ms          
8 native threads: 8558.39ms

Still trying to pin down what's going on.

brson · 2014-01-08T00:19:23Z

I mentioned offline that I would rather try to solve the problem of doing too much waking by having stealers do some exponential backoff before giving up completely. I am worried that this solution solves the problem well for this benchmark but would leave too many cores empty on a more realistic workload.

alexcrichton · 2014-01-08T00:35:57Z

Closing for now, I'm going to get to this later.

Fix SPEEDTEST instructions and output * `--nocapture` hasn't been needed anymore since forever (even before `ui_test`) * the result was dividing by 1000 instead of the number of test runs, giving bogus (but still useful for the purpose) timing results. changelog: fix SPEEDTEST instructions and output

alexcrichton closed this Jan 8, 2014

alexcrichton deleted the faster-green branch February 5, 2014 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't always wake up sleeping schedulers #11325

Don't always wake up sleeping schedulers #11325

alexcrichton commented Jan 5, 2014

pcwalton commented Jan 5, 2014

alexcrichton commented Jan 6, 2014

brson commented Jan 8, 2014

alexcrichton commented Jan 8, 2014

Don't always wake up sleeping schedulers #11325

Don't always wake up sleeping schedulers #11325

Conversation

alexcrichton commented Jan 5, 2014

pcwalton commented Jan 5, 2014

alexcrichton commented Jan 6, 2014

brson commented Jan 8, 2014

alexcrichton commented Jan 8, 2014