Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

Closed
mkuratczyk opened this issue Mar 7, 2024 · 14 comments · Fixed by #8254
Closed

Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229

mkuratczyk opened this issue Mar 7, 2024 · 14 comments · Fixed by #8254
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@mkuratczyk
Copy link
Contributor

Describe the bug
We've started testing RabbitMQ with OTP 27.0-rc1 and we immediately saw a pretty consistent ~40% performance drop
throughout our performance tests. We've identified the it's caused by the bitstring changes introduced in
#7828.

Specifically, OTP built from SHA 49024e8 performs similar to OTP26, while 24ef4cb shows a significant performance drop (also present in OTP27.0-rc1 and master as of today).

There was a follow-up PR #7935 which mentions

There's also a few things that have become slightly less efficient as a result of this. The plan is to fix them before RC1

but I think that doesn't really explain the difference we see.

The most obvious observable difference is the amount of garbage collection and time spent in pwrite_nif.
Before #7828 (faster):
49024e83a2377a15f651fa95d39c43e71834a54a

After #7828 (slower):
24ef4cbaeda9b9c26682cba75f2f15b0c58722aa

To Reproduce
Currently my reproduction steps involve running a RabbitMQ node. If a small test case is needed, I'd appreciate some suggestions for what kind of operations specifically we could try in that test case.

To run a RabbitMQ node, you can either use docker or you can build and run locally. Details below.

To run a node with OTP 26:
docker run -p 5672:5672 pivotalrabbitmq/rabbitmq:otp27-support-otp-min-bazel

To run a node with OTP 27.0-rc1:

docker run -p 5672:5672 pivotalrabbitmq/rabbitmq:otp27-support-otp-max-bazel

note: these images don't work on Apple Silicon.

To build and run locally:

git clone https://github.com/rabbitmq/rabbitmq-server.git
cd rabbitmq-server
git checkout otp27-support
bazel run broker

note:make run-broker doesn't yet work on this branch so bazel is needed.

With a RabbitMQ running, run our load testing tool:

java -jar perf-test.jar -x 2 -y 0 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100 -z 10

This publishes, without consuming, 100B messages to a single queue for 10 seconds. With OTP26 and versions prior to #7828 that gives me about 115 000 messages per second. After that PR I get about 65 000 messages per second.

**Affected versions**

24ef4cb and newer (27.0-rc1, master).

@mkuratczyk mkuratczyk added the bug Issue is reported as a bug label Mar 7, 2024
@jhogberg jhogberg self-assigned this Mar 7, 2024
@jhogberg jhogberg added the team:VM Assigned to OTP team VM label Mar 7, 2024
@jhogberg
Copy link
Contributor

jhogberg commented Mar 7, 2024

Thanks for your report!

I can't think of anything that would, suddenly, cause pwrite_nif to be much slower as a result of that change. The vast majority of the time spent appears to be within the underlying syscall, which should not have been affected at all.

How large are the binaries involved (ballpark), are there tons and tons of tiny ones (=< 64b)? Large (>64b) ones?

@mkuratczyk
Copy link
Contributor Author

I was surprised by that pwrite_nif as well, there's a chance this is a red herring and I just happened to catch a bit different moment with more flushing to disk, etc. I can run some additional tests.

More time spent in GC seems pretty consistent - I also used msacct and saw a significant increase in full GC (from under 1% to almost 8%).

In the particular test as reported above I was publishing 100B messages, so there was a significant number of 100B binaries but also a bunch of small ones for sure (headers). I've also tested with 20B messages and there was a similar difference. With 1kb and 5kb messages there seems to be less of a difference (10-15%) but still consistently noticeable.

It's not completely clear to me what happens with these binaries within the Beam. They enter as TCP packets so even though the messages can be small (like in this 20B test), they can effectively be parts of the larger binary created at the TCP buffer layer.

@okeuday
Copy link
Contributor

okeuday commented Mar 7, 2024

This could be related to #8141 (the increase in refc binary (>64B) memory consumption in 27.0-rc1).

@jhogberg Please make sure to mention the increase in refc binary size in the 27.0 release notes so that has more visibility (it isn't clear what change it relates to).

@mkuratczyk
Copy link
Contributor Author

This could be a "cleaner" example. This time consumers are present (perf-test -x 2 -y 2 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100) so RabbitMQ doesn't write these messages to disk at all - just passes them from a publisher to a consumer (with a few processes in-between). Therefore, pwrite_nif is not present at all. Once again, the most visible difference is time spent in GC.

In this case I had ~90k msgs/s with OTP built before bitstring refactor was merged and ~55k msgs/s with the refactor merged (as before, I'm comparing OTP builds with just 1 commit of difference - 49024e83a2377a15f651fa95d39c43e71834a54a vs 24ef4cbaeda9b9c26682cba75f2f15b0c58722aa).

Before:
49024e83a2377a15f651fa95d39c43e71834a54a-x2-y2-s100

After:
24ef4cbaeda9b9c26682cba75f2f15b0c58722aa-x2-y2-s100

@lpgauth
Copy link
Contributor

lpgauth commented Mar 7, 2024

Can you share the VM arguments you're using? Are you setting +hms? Maybe you need to bump due to the increased memory usage in OTP-27?

@mkuratczyk
Copy link
Contributor Author

Here are the arguments. I'm not setting them up for this particular tests - these are some defaults set by RabbitMQ startup scripts (likely some could be reconsidered/tuned). Same flags used with both OTP versions, no +hms, we set min_heap_size for a few processes but none of those involved here I believe:

> ps auxwww | rg beam                                                                                                                                                                                                                                            
kura      596507  385  1.3 5070952 447640 pts/2  Sl+  14:41  12:10 /home/kura/.kerl/49024e83a2377a15f651fa95d39c43e71834a54a/erts-14.1.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -pc unicode -P 1048576 -t 5000000 -s
tbt db -zdbbl 128000 -sbwt none -sbwtdcpu none -sbwtdio none -JPperf true -JPperf true -- -root /home/kura/.kerl/49024e83a2377a15f651fa95d39c43e71834a54a -bindir /home/kura/.kerl/49024e83a2377a15f651fa95d39c43e71834a54a/erts-14.1.1/bin -progname erl -- -hom
e /home/kura -- -kernel shell_history enabled -- -pa  -s rabbit boot -boot start_sasl -ra wal_sync_method sync -syslog logger [] -syslog syslog_error_logger false -kernel prevent_overlapping_partitions false -- -- 

I didn't notice this before but indeed the memory usage is ~25% higher after the bitstring refactor. This is the memory used with the same workload as before, first with OTP 24ef4cbaeda9b9c26682cba75f2f15b0c58722aa and then OTP 49024e83a2377a15f651fa95d39c43e71834a54a:

Screenshot 2024-03-07 at 15 57 36

@jhogberg
Copy link
Contributor

jhogberg commented Mar 7, 2024

I can't think of anything that would, suddenly, cause pwrite_nif to be much slower as a result of that change. The vast majority of the time spent appears to be within the underlying syscall, which should not have been affected at all.

... after thinking a bit more about it, now I can, along with higher memory usage (far beyond the extra word @okeuday mentioned).

How does the data you're passing to pwrite/2,3 look?

@lhoguin
Copy link
Contributor

lhoguin commented Mar 7, 2024

The data that ends up written looks like this:

First, for each message we build this data (term_to_iovec + small binary in a list): https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L272-L284

Then we do some consolidation: if two messages end up contiguous, we consolidate them with an Acc here: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L237-L270

Finally we do an open/pwrite/close per file with the consolidated data: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L227

Note that before getting built when the time comes to write them to disk, the messages are kept in a map as a record that contains a bunch of binaries and other values.

@jhogberg
Copy link
Contributor

jhogberg commented Mar 8, 2024

Thanks, that tracks with what I think it could be. term_to_iovec is one of the few places in the emulator where it's possible to create off-heap binaries that are smaller than 64b, and pwrite_nif is preceded by iolist_to_iovec which used to coalesce such off-heap binaries up to a certain point (similar things may happen elsewhere). If it's that, it'll be pretty easy to restore performance and perhaps even improve on it.

Unfortunately it took forever for rabbitmq-server to build because I hit bazelbuild/bazel#20886, so I probably won't have any news until Monday. :-/

@lhoguin
Copy link
Contributor

lhoguin commented Mar 8, 2024

It might be easier to replace Bazel usage with make run-broker if preferred (but local repo must be fully cleaned up first).

@mkuratczyk
Copy link
Contributor Author

this branch is not yet compatible with make but should be very shortly. I will update the issue when it is

@jhogberg
Copy link
Contributor

jhogberg commented Mar 11, 2024

Fixing the awkward interaction between term_to_iovec, iolist_to_iovec, and pwrite removed that part of the slowdown altogether: the amount of garbage and rescheduling to/from dirty IO schedulers has been cut by several orders of magnitude. Once the rest is fixed, the tests that hit disk should be faster than before.

I haven't restored all performance yet, but I've ruled out the slightly increased size of some binary structures as the cause. I'll try to get this fixed before RC2.

@jhogberg
Copy link
Contributor

jhogberg commented Mar 11, 2024

I've found the problem: the GC pressure of off-heap binaries ("vheap") was vastly under-counted prior to 24ef4cb. In these tests this caused it to GC less often and, crucially, when there was less live data to keep. The larger packet sizes were less affected because they weren't under-counted by as much.

Tuning min_bin_vheap_size restores performance, but we're currently looking at fixing this on the producer side (inet_drv) to stop referencing unnecessarily large binaries.

I'll make a PR with the changes to the iovec stuff later in the week. :)

@mkuratczyk
Copy link
Contributor Author

Great, thank you!

If you are using RabbitMQ for testing, I just pushed some changes to the otp27-support branch so that make run-broker should work now (no need to use bazel anymore, but if you made it work, you can keep using this as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants