Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

mkuratczyk · 2024-03-07T09:45:58Z

Describe the bug
We've started testing RabbitMQ with OTP 27.0-rc1 and we immediately saw a pretty consistent ~40% performance drop
throughout our performance tests. We've identified the it's caused by the bitstring changes introduced in
#7828.

Specifically, OTP built from SHA 49024e8 performs similar to OTP26, while 24ef4cb shows a significant performance drop (also present in OTP27.0-rc1 and master as of today).

There was a follow-up PR #7935 which mentions

There's also a few things that have become slightly less efficient as a result of this. The plan is to fix them before RC1

but I think that doesn't really explain the difference we see.

The most obvious observable difference is the amount of garbage collection and time spent in pwrite_nif.
Before #7828 (faster):

After #7828 (slower):

To Reproduce
Currently my reproduction steps involve running a RabbitMQ node. If a small test case is needed, I'd appreciate some suggestions for what kind of operations specifically we could try in that test case.

To run a RabbitMQ node, you can either use docker or you can build and run locally. Details below.

To run a node with OTP 26:

docker run -p 5672:5672 pivotalrabbitmq/rabbitmq:otp27-support-otp-min-bazel

To run a node with OTP 27.0-rc1:

docker run -p 5672:5672 pivotalrabbitmq/rabbitmq:otp27-support-otp-max-bazel

note: these images don't work on Apple Silicon.

To build and run locally:

git clone https://github.com/rabbitmq/rabbitmq-server.git
cd rabbitmq-server
git checkout otp27-support
bazel run broker

note:make run-broker doesn't yet work on this branch so bazel is needed.

With a RabbitMQ running, run our load testing tool:

java -jar perf-test.jar -x 2 -y 0 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100 -z 10

This publishes, without consuming, 100B messages to a single queue for 10 seconds. With OTP26 and versions prior to #7828 that gives me about 115 000 messages per second. After that PR I get about 65 000 messages per second.

**Affected versions**

24ef4cb and newer (27.0-rc1, master).

The text was updated successfully, but these errors were encountered:

jhogberg · 2024-03-07T13:40:44Z

Thanks for your report!

I can't think of anything that would, suddenly, cause pwrite_nif to be much slower as a result of that change. The vast majority of the time spent appears to be within the underlying syscall, which should not have been affected at all.

How large are the binaries involved (ballpark), are there tons and tons of tiny ones (=< 64b)? Large (>64b) ones?

mkuratczyk · 2024-03-07T14:09:51Z

I was surprised by that pwrite_nif as well, there's a chance this is a red herring and I just happened to catch a bit different moment with more flushing to disk, etc. I can run some additional tests.

More time spent in GC seems pretty consistent - I also used msacct and saw a significant increase in full GC (from under 1% to almost 8%).

In the particular test as reported above I was publishing 100B messages, so there was a significant number of 100B binaries but also a bunch of small ones for sure (headers). I've also tested with 20B messages and there was a similar difference. With 1kb and 5kb messages there seems to be less of a difference (10-15%) but still consistently noticeable.

It's not completely clear to me what happens with these binaries within the Beam. They enter as TCP packets so even though the messages can be small (like in this 20B test), they can effectively be parts of the larger binary created at the TCP buffer layer.

okeuday · 2024-03-07T14:11:41Z

This could be related to #8141 (the increase in refc binary (>64B) memory consumption in 27.0-rc1).

@jhogberg Please make sure to mention the increase in refc binary size in the 27.0 release notes so that has more visibility (it isn't clear what change it relates to).

mkuratczyk · 2024-03-07T14:30:21Z

This could be a "cleaner" example. This time consumers are present (perf-test -x 2 -y 2 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100) so RabbitMQ doesn't write these messages to disk at all - just passes them from a publisher to a consumer (with a few processes in-between). Therefore, pwrite_nif is not present at all. Once again, the most visible difference is time spent in GC.

In this case I had ~90k msgs/s with OTP built before bitstring refactor was merged and ~55k msgs/s with the refactor merged (as before, I'm comparing OTP builds with just 1 commit of difference - 49024e83a2377a15f651fa95d39c43e71834a54a vs 24ef4cbaeda9b9c26682cba75f2f15b0c58722aa).

Before:

After:

lpgauth · 2024-03-07T14:39:01Z

Can you share the VM arguments you're using? Are you setting +hms? Maybe you need to bump due to the increased memory usage in OTP-27?

mkuratczyk · 2024-03-07T15:00:04Z

Here are the arguments. I'm not setting them up for this particular tests - these are some defaults set by RabbitMQ startup scripts (likely some could be reconsidered/tuned). Same flags used with both OTP versions, no +hms, we set min_heap_size for a few processes but none of those involved here I believe:

> ps auxwww | rg beam                                                                                                                                                                                                                                            
kura      596507  385  1.3 5070952 447640 pts/2  Sl+  14:41  12:10 /home/kura/.kerl/49024e83a2377a15f651fa95d39c43e71834a54a/erts-14.1.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -pc unicode -P 1048576 -t 5000000 -s
tbt db -zdbbl 128000 -sbwt none -sbwtdcpu none -sbwtdio none -JPperf true -JPperf true -- -root /home/kura/.kerl/49024e83a2377a15f651fa95d39c43e71834a54a -bindir /home/kura/.kerl/49024e83a2377a15f651fa95d39c43e71834a54a/erts-14.1.1/bin -progname erl -- -hom
e /home/kura -- -kernel shell_history enabled -- -pa  -s rabbit boot -boot start_sasl -ra wal_sync_method sync -syslog logger [] -syslog syslog_error_logger false -kernel prevent_overlapping_partitions false -- --

I didn't notice this before but indeed the memory usage is ~25% higher after the bitstring refactor. This is the memory used with the same workload as before, first with OTP 24ef4cbaeda9b9c26682cba75f2f15b0c58722aa and then OTP 49024e83a2377a15f651fa95d39c43e71834a54a:

jhogberg · 2024-03-07T15:33:07Z

I can't think of anything that would, suddenly, cause pwrite_nif to be much slower as a result of that change. The vast majority of the time spent appears to be within the underlying syscall, which should not have been affected at all.

... after thinking a bit more about it, now I can, along with higher memory usage (far beyond the extra word @okeuday mentioned).

How does the data you're passing to pwrite/2,3 look?

lhoguin · 2024-03-07T16:21:06Z

The data that ends up written looks like this:

First, for each message we build this data (term_to_iovec + small binary in a list): https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L272-L284

Then we do some consolidation: if two messages end up contiguous, we consolidate them with an Acc here: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L237-L270

Finally we do an open/pwrite/close per file with the consolidated data: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L227

Note that before getting built when the time comes to write them to disk, the messages are kept in a map as a record that contains a bunch of binaries and other values.

jhogberg · 2024-03-08T09:41:13Z

Thanks, that tracks with what I think it could be. term_to_iovec is one of the few places in the emulator where it's possible to create off-heap binaries that are smaller than 64b, and pwrite_nif is preceded by iolist_to_iovec which used to coalesce such off-heap binaries up to a certain point (similar things may happen elsewhere). If it's that, it'll be pretty easy to restore performance and perhaps even improve on it.

Unfortunately it took forever for rabbitmq-server to build because I hit bazelbuild/bazel#20886, so I probably won't have any news until Monday. :-/

lhoguin · 2024-03-08T09:56:50Z

It might be easier to replace Bazel usage with make run-broker if preferred (but local repo must be fully cleaned up first).

mkuratczyk · 2024-03-08T10:03:11Z

this branch is not yet compatible with make but should be very shortly. I will update the issue when it is

jhogberg · 2024-03-11T09:11:20Z

Fixing the awkward interaction between term_to_iovec, iolist_to_iovec, and pwrite removed that part of the slowdown altogether: the amount of garbage and rescheduling to/from dirty IO schedulers has been cut by several orders of magnitude. Once the rest is fixed, the tests that hit disk should be faster than before.

I haven't restored all performance yet, but I've ruled out the slightly increased size of some binary structures as the cause. I'll try to get this fixed before RC2.

jhogberg · 2024-03-11T16:27:42Z

I've found the problem: the GC pressure of off-heap binaries ("vheap") was vastly under-counted prior to 24ef4cb. In these tests this caused it to GC less often and, crucially, when there was less live data to keep. The larger packet sizes were less affected because they weren't under-counted by as much.

Tuning min_bin_vheap_size restores performance, but we're currently looking at fixing this on the producer side (inet_drv) to stop referencing unnecessarily large binaries.

I'll make a PR with the changes to the iovec stuff later in the week. :)

mkuratczyk · 2024-03-11T16:39:52Z

Great, thank you!

If you are using RabbitMQ for testing, I just pushed some changes to the otp27-support branch so that make run-broker should work now (no need to use bazel anymore, but if you made it work, you can keep using this as well).

mkuratczyk added the bug Issue is reported as a bug label Mar 7, 2024

jhogberg self-assigned this Mar 7, 2024

jhogberg added the team:VM Assigned to OTP team VM label Mar 7, 2024

jhogberg mentioned this issue Mar 12, 2024

erts: Improve iolist_to_iovec/1 performance and yielding #8254

Merged

jhogberg closed this as completed in #8254 Mar 15, 2024

mkuratczyk mentioned this issue Mar 28, 2024

OTP27 performance regression #8322

Open

ThomasArts mentioned this issue Apr 2, 2024

Performance OTP27 on binary_to_term slower #8329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

mkuratczyk commented Mar 7, 2024

jhogberg commented Mar 7, 2024

mkuratczyk commented Mar 7, 2024

okeuday commented Mar 7, 2024 •

edited

Loading

mkuratczyk commented Mar 7, 2024

lpgauth commented Mar 7, 2024 •

edited

Loading

mkuratczyk commented Mar 7, 2024

jhogberg commented Mar 7, 2024

lhoguin commented Mar 7, 2024 •

edited

Loading

jhogberg commented Mar 8, 2024

lhoguin commented Mar 8, 2024

mkuratczyk commented Mar 8, 2024

jhogberg commented Mar 11, 2024 •

edited

Loading

jhogberg commented Mar 11, 2024 •

edited

Loading

mkuratczyk commented Mar 11, 2024

Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

Comments

mkuratczyk commented Mar 7, 2024

jhogberg commented Mar 7, 2024

mkuratczyk commented Mar 7, 2024

okeuday commented Mar 7, 2024 • edited Loading

mkuratczyk commented Mar 7, 2024

lpgauth commented Mar 7, 2024 • edited Loading

mkuratczyk commented Mar 7, 2024

jhogberg commented Mar 7, 2024

lhoguin commented Mar 7, 2024 • edited Loading

jhogberg commented Mar 8, 2024

lhoguin commented Mar 8, 2024

mkuratczyk commented Mar 8, 2024

jhogberg commented Mar 11, 2024 • edited Loading

jhogberg commented Mar 11, 2024 • edited Loading

mkuratczyk commented Mar 11, 2024

okeuday commented Mar 7, 2024 •

edited

Loading

lpgauth commented Mar 7, 2024 •

edited

Loading

lhoguin commented Mar 7, 2024 •

edited

Loading

jhogberg commented Mar 11, 2024 •

edited

Loading

jhogberg commented Mar 11, 2024 •

edited

Loading