Dead lettering en masse can overload the DLQ #5312

illotum · 2022-07-21T19:11:59Z

We observe a potential bug in our brokers

Classic mirrored DLQ replica, at a certain load, flips into CPU and memory churn and cannot recover. First reported here and later confirmed on multiple RabbitMQ versions.

With appropriate data set the instance is OOM-killed, but even on smaller sets I had the broker crossing into red and back for a long time after the test.

Reproduction

Environment

> cat ../rmq1.conf
listeners.tcp.default = 127.0.0.1:5671
management.tcp.port = 15671
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rmq2@3c22fb6527ee

> cat ../rmq2.conf
listeners.tcp.default = 127.0.0.1:5672
management.tcp.port = 15672
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rmq1@3c22fb6527ee

> echo $RABBITMQ_ENABLED_PLUGINS
rabbitmq_management

> alias brun "bazel run --config=local"

Two brokers

brun broker RABBITMQ_CONFIG_FILE=(realpath ../rmq1.conf) RABBITMQ_NODENAME=rmq1 RABBITMQ_DIST_PORT=25671

brun broker RABBITMQ_CONFIG_FILE=(realpath ../rmq2.conf) RABBITMQ_NODENAME=rmq2 RABBITMQ_DIST_PORT=25672

brun -- rabbitmqctl -n rmq1@3c22fb6527ee set_policy ha "." '{"ha-mode":"all","ha-sync-mode":"automatic","queue-mode":"lazy"}' --apply-to queues

Long message backlog nacked between queues

./bin/runjava com.rabbitmq.perf.PerfTest -u "input" -h amqp://localhost:5671 --auto-delete false -f persistent --queue-args x-dead-letter-exchange=,x-dead-letter-routing-key=dlq -x 4 -y 0 -s 1000 -pmessages 250000

# "dlq" created beforehand, say in web console
./bin/runjava com.rabbitmq.perf.PerfTest -u "input" -h amqp://localhost:5671 --auto-delete false -f persistent --queue-args x-dead-letter-exchange=,x-dead-letter-routing-key=dlq -x 0 -y 10 -na -re false

Observations

Slave replica memory footprint stays abnormally high, oscillating +/- 2GB, for many hours (so far up to 4) after the test. There is a significant uptick in CPU as well.

queue_slave_procs: 13.3358 gb (62.38%)
binary: 6.2803 gb (29.37%)
allocated_unused: 1.6457 gb (7.7%)
other_proc: 0.0529 gb (0.25%)
code: 0.0348 gb (0.16%)
other_system: 0.0195 gb (0.09%)
other_ets: 0.0037 gb (0.02%)
plugins: 0.0031 gb (0.01%)
atom: 0.0014 gb (0.01%)
metrics: 0.0014 gb (0.01%)
mgmt_db: 0.001 gb (0.0%)
connection_other: 0.0001 gb (0.0%)
mnesia: 0.0001 gb (0.0%)
msg_index: 0.0 gb (0.0%)
quorum_ets: 0.0 gb (0.0%)
queue_procs: 0.0 gb (0.0%)
quorum_queue_dlx_procs: 0.0 gb (0.0%)
stream_queue_procs: 0.0 gb (0.0%)
stream_queue_replica_reader_procs: 0.0 gb (0.0%)
connection_readers: 0.0 gb (0.0%)
connection_writers: 0.0 gb (0.0%)
connection_channels: 0.0 gb (0.0%)
quorum_queue_procs: 0.0 gb (0.0%)
stream_queue_coordinator_procs: 0.0 gb (0.0%)
reserved_unallocated: 0.0 gb (0.0%)

From the couple hours I spent on this:

Confirmed as low as 1M messages. On my laptop 3M is enough to trigger memory alarm.
Confirmed for 1-20 consumers.
Confirmed for messages as small as 1 byte.
The broker recovers after 1-2 minutes of churn
- at 250k messages
- non-lazy queues
- after deleting "dlq"

The text was updated successfully, but these errors were encountered:

illotum · 2022-07-21T19:24:15Z

And a note from @thuandb (queue depth mistmatch because of a different env):

It is confirmed that IO is the bottleneck causing high memory utilization during queue sync.
Using gp3 with maximum throughput (1000MB/s) helps with large messages.
Message size: 640k bytes
Queue depth: 115k messages
sync batch size: 256.
Memory only slightly increase @ mirrors.

lukebakken · 2022-07-21T20:00:59Z

This may or may not be related, the symptoms here reminding me of this message - https://groups.google.com/g/rabbitmq-users/c/-_J_K4T7yTI/

michaelklishin · 2022-07-21T20:27:51Z

What RabbitMQ version was used?

illotum · 2022-07-21T21:12:08Z

My test was run from a somewhat stale master, at e2edb1b. In prod this was observed on 3.8.22 and 3.8.27.

illotum · 2022-07-21T21:22:12Z

Updated description to remove the mention of allocs: I haven't actually looked at allocations yet. Memory may sway because of paging.

lukebakken · 2022-07-23T23:39:25Z

I have put together a reproduction project here:

https://github.com/lukebakken/rabbitmq-server-5312

It uses RMQ 3.10.6 and the latest PerfTest.

I can reproduce the issue as described.

Workarounds that address the issue:

Make dlq a quorum queue (https://github.com/lukebakken/rabbitmq-server-5312/tree/dlq-as-quorum)
Make all queues quorum queues (https://github.com/lukebakken/rabbitmq-server-5312/tree/all-as-quorum)

michaelklishin · 2022-07-29T09:28:01Z

Using CQv2 or quorum queue solves the problem => closing.

johanrhodin · 2022-08-26T21:04:34Z

We still have this issue with v2 classic queues. Will try to reproduce next week.

lukebakken · 2022-09-07T17:50:35Z

Putting this here since the issue seems to be the same - https://groups.google.com/g/rabbitmq-users/c/Kr2QZx1TREA

illotum · 2022-09-07T18:49:21Z

So in total, two reports about this still happening in CMQv2? @johanrhodin Looking forward to reproduction steps as well.

danielllek · 2022-09-09T06:33:34Z

I've also found another workaround - switching queue-mode from normal to lazy for queues that were causing problems solved the issue for me.

lukebakken · 2022-10-20T22:34:13Z

FWIW I just revisited this using the latest main code for RabbitMQ and PerfTest. Turns out now I can reproduce this using mirrored classic v2 queues -

As expected quorum queues do not show this issue (using the latest main for RabbitMQ and PerfTest as well).

cc @michaelklishin any reason to keep this open at this point?

michaelklishin · 2022-10-21T05:27:47Z

We can leave it open if some CMQ user would like to investigate and contribute a fix.

What immediately stands out on the screenshot is the gen_server2 buffer size for that CMQ mirror replica.

mkuratczyk · 2022-10-21T16:08:16Z

I believe the problem comes from the way at-most-once dead lettering works. For quorum queues, we added at-least-once strategy that uses publisher confirms internally, but classic queues don't support that. When a lot of messages are dead-letter, we are effectively publishing to a mirrored queue without any flow-control mechanism, which overloads the DLQ process, which causes a memory spike.

Assuming that's true, we would need to have classic queues support at-least-once dead-lettering, which is something that CQv2 should probably support at some point, but whether the same implementation would work for CQv1 and how much work would need to be done to support it with mirroring, I'm not sure.

Alternatively, perhaps it's possible to apply back-pressure using some other mechanism, to prevent DLQ getting overloaded.

Feel free to have a look at QQ at-least-once support PR for inspiration:
#3100

lukebakken · 2022-10-21T17:43:43Z

@michaelklishin @mkuratczyk interestingly enough, the gen_server2 internal priority queue (gen_server2 buffer in the top UI) does eventually drain, restoring memory usage to normal levels.

mkuratczyk · 2022-10-21T22:20:57Z

Sorry, I think I jumped to a conclusion here. The process that takes so much memory is the input queue, not the dlq. Given that, my theory that we are overloading the dlq by publishing to it without any flow control is wrong. Which doesn't mean the current naive CQ dead-lettering process isn't the culprit, but I it's definitely not what I thought it was.

Scratch that. I must have run the test incorrectly (I guess I was still publishing to the input queue). Now I see the dlq queue using lots of memory, which is what I'd expect.

This doubles the publish rate using the repro steps from #5312 and somewhat reduces the memory usage. ``` rabbitmqctl -n rabbit-0 set_policy --apply-to queues ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1 BEFORE: id: test-164723-973, sending rate avg: 6156 msg/s AFTER: id: test-165708-401, sending rate avg: 11572 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1,x-queue-mode=lazy -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1lazy BEFORE: id: test-165020-228, sending rate avg: 6239 msg/s AFTER: id: test-165848-489, sending rate avg: 11758 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=2 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v2 BEFORE: id: test-165314-597, sending rate avg: 6818 msg/s AFTER: id: test-170027-308, sending rate avg: 13891 msg/s ```

This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload.

mkuratczyk · 2022-10-25T14:00:26Z

@illotum @johanrhodin could you please test the cmq-optimisations branch, also available as an OCI image pivotalrabbitmq/rabbitmq:cmq-optimisations-otp-max-bazel/?

For me it roughly doubles the publish rate during the first phase of the test and, I assume, when messages are published internally to the dlq. Thanks to this, it seems to also lower the memory usage by at least half. In my system, that's a difference between OOMkill and a smooth operations, albeit still with a relatively high memory usage. Is such an improvement sufficient?

Note: when using CQv2, it's also useful to increase the dist_buf_busy_limit. CQv2 is so much faster that it can easily overload the distribution channel and then we have a "hiccup" during the consumption phase.

Tested with CQv1, CQv1 lazy and CQv2. Each of these had a memory alarm/oomkill using the main branch while the results with this branch are better (although, again, the memory used is still high):

mkuratczyk · 2022-10-26T13:21:42Z

One more thing worth sharing: ultimately the problem is that we don't use publisher confirms when internally publishing to classic queues (which is what happens during dead-lettering). I came up with a similar scenario without mirroring - put the input queue on one node and the dlq on another node and follow the same steps, just without the mirroring policy. Since mirroring makes this even worse (is responsible for part of the memory usage), you will need to increase the number/size of the messages but there are combinations of values that will lead to the dead letter queue being overloaded, even without mirroring.

I think all of this just proves what we knew: we need to have publisher confirms / flow control for internal publishing/dead-lettering to a classic queues, just like we now do for quorum queues.

illotum · 2022-10-26T21:26:14Z

@mkuratczyk your branch shows significant improvement across all my tests. I can still see the spike, but it's much harder to cross into alarm now.

mkuratczyk · 2022-10-28T11:16:06Z

I have made one more commit to the branch to bring back some parts I removed previously, as it showed some performance regression with larger (not embedded) messages. I will continue testing next week but I'd also appreciate another round of testing on your side. Especially with workloads that are different than the reproduction steps of this issue, as the memory usage reduction in this scenario is significant and consistent, but the change affects mirroring in general, so I want to be careful with this, as lots of users may be affected.

illotum · 2022-11-02T18:43:37Z

Not too rigorous, but I've been running various sync scenarios over the last couple days: cluster member loss, DLQ, lazy and not, various message sizes. Tests show similar or better profile in all cases (often insignificant), so far I see no degradation.

illotum · 2022-11-22T21:23:32Z

@mkuratczyk anything else I can help with re your performance branch?

michaelklishin · 2022-11-23T06:55:16Z

@illotum the core team will discuss what's ahead for that branch. Thank you the details and not letting this slip.

mkuratczyk · 2022-11-23T21:21:20Z

@illotum What versions would you expect this to be back-ported to? While we didn't find any regressions, we know RabbitMQ is used in so many ways/configurations that it could have some negative impact in some of them, so we are wary of back-porting this/releasing in a patch release (especially for "older" versions like 3.9 where very few changes happen at this point).

illotum · 2022-11-23T23:43:08Z

I'd be happy with 3.10, and fine with 3.11. We are only rolling out 3.11, so this would be a good release to pick as first.

michaelklishin · 2022-11-24T06:01:09Z

3.11 is what the core team had in mind.

danielllek · 2022-11-24T08:03:29Z

If it's possible for 3.10, I'd very happy :)

mkuratczyk · 2022-11-24T09:37:45Z

@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10.

This doubles the publish rate using the repro steps from #5312 and somewhat reduces the memory usage. ``` rabbitmqctl -n rabbit-0 set_policy --apply-to queues ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1 BEFORE: id: test-164723-973, sending rate avg: 6156 msg/s AFTER: id: test-165708-401, sending rate avg: 11572 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1,x-queue-mode=lazy -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1lazy BEFORE: id: test-165020-228, sending rate avg: 6239 msg/s AFTER: id: test-165848-489, sending rate avg: 11758 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=2 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v2 BEFORE: id: test-165314-597, sending rate avg: 6818 msg/s AFTER: id: test-170027-308, sending rate avg: 13891 msg/s ```

This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload.

mkuratczyk · 2022-11-24T17:35:47Z

I've opened a PR and shared some results of today's tests #6467

This doubles the publish rate using the repro steps from #5312 and somewhat reduces the memory usage. ``` rabbitmqctl -n rabbit-0 set_policy --apply-to queues ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1 BEFORE: id: test-164723-973, sending rate avg: 6156 msg/s AFTER: id: test-165708-401, sending rate avg: 11572 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1,x-queue-mode=lazy -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1lazy BEFORE: id: test-165020-228, sending rate avg: 6239 msg/s AFTER: id: test-165848-489, sending rate avg: 11758 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=2 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v2 BEFORE: id: test-165314-597, sending rate avg: 6818 msg/s AFTER: id: test-170027-308, sending rate avg: 13891 msg/s ``` (cherry picked from commit b5226c1)

This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload. (cherry picked from commit d7d129c)

SimonUnge · 2022-12-15T21:14:18Z

@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10.

@mkuratczyk I would like to re-visit the idea of back-porting this to the 3.10 branch! In terms of getting feedback from us (AWS) in terms of regression (and benefits), that feedback will be easier to provide if we get this in the 3.10 branch as we have a larger adoption on 3.10 currently in our fleet of brokers.

Are there anything particular you require in terms of feedback from us, in order to backport to 3.10 sooner, rather than later?

mkuratczyk · 2022-12-16T15:30:53Z

I understand you have more 3.10 but do you have 3.11.4+ as well already running and behaving well?

SimonUnge · 2022-12-19T17:53:23Z

@mkuratczyk We actually do not have any brokers on 3.11 yet, as we currently only support 3.10.*, hence issues collecting data. We will go live with 3.11 early next year but as of now, if we want live data, it will need to be from a 3.10 version.
We have not seen any issues with our local testing, but we can not collect live data.

Hence the wish for a 3.10-backport for us to more quickly get feedback, and also address some the issue we see on our brokers of course.

SimonUnge · 2023-01-11T23:48:10Z

Hi, just wanted to give this issue a bump, in the hope to get it back-ported to 3.10 - in our fleet we weekly encounter issues we think will get resolved with this fix, and would be able to quickly provide feedback. But it would need to be back-ported to a 3.10 release before we can do that. So I hope a 3.10 backport before real world data feedback is available would be an option?

We still have seen no regression issues with 3.11, but that is only based on local testing.
@mkuratczyk What would be a good convincing argument from us to get this backported to 3.10 sooner rather than later :)?

michaelklishin · 2023-01-12T03:21:05Z

#6467 cherry-picks cleanly to v3.10.x. The changes seem node-local so I don't find them risky.

michaelklishin · 2023-01-12T12:45:59Z

Another observation: #6467 relies on Sets v2 that were introduced in Erlang 24. RabbitMQ 3.10.x already requires 24.x, and later this year will require 25.

michaelklishin · 2023-01-12T20:26:51Z

#6867

michaelklishin · 2023-01-17T00:58:26Z

Shipped in 3.10.14.

(cherry picked from commit 4419707)

(cherry picked from commit e3a0304)

(cherry picked from commit 4419707)

SimonUnge · 2023-01-17T04:55:43Z

Shipped in 3.10.14.

Awesome, thank you!

michaelklishin closed this as completed Jul 29, 2022

lukebakken self-assigned this Sep 7, 2022

lukebakken reopened this Sep 7, 2022

mkuratczyk added a commit that referenced this issue Oct 25, 2022

Remove GS2 priorities from rabbit_mirror_queue_slave

d9558b2

This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload.

mkuratczyk changed the title ~~Churn in classic mirrored DLQ~~ Dead lettering en masse can overload the DLQ Oct 26, 2022

mkuratczyk added a commit that referenced this issue Nov 24, 2022

Remove GS2 priorities from rabbit_mirror_queue_slave

d7d129c

This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload.

mkuratczyk mentioned this issue Nov 24, 2022

Classic Queue Mirroring: reduce memory usage during dead-lettering of many messages #6467

Merged

michaelklishin closed this as completed in #6467 Nov 25, 2022

michaelklishin added this to the 3.11.4 milestone Nov 25, 2022

michaelklishin added a commit that referenced this issue Jan 17, 2023

Add #5312 to 3.10.14 release notes

4419707

michaelklishin added a commit that referenced this issue Jan 17, 2023

Add #5312 to 3.11.4 release notes

e3a0304

michaelklishin added a commit that referenced this issue Jan 17, 2023

Add #5312 to 3.10.14 release notes

d8b0b66

(cherry picked from commit 4419707)

michaelklishin added a commit that referenced this issue Jan 17, 2023

Add #5312 to 3.11.4 release notes

6668148

(cherry picked from commit e3a0304)

michaelklishin added a commit that referenced this issue Jan 17, 2023

Add #5312 to 3.10.14 release notes

9bbf513

(cherry picked from commit 4419707)

binarin mentioned this issue Jan 24, 2023

dialyzer warnings rabbit v3.11.x #7024

Closed

Dead lettering en masse can overload the DLQ #5312

Dead lettering en masse can overload the DLQ #5312

Comments

illotum commented Jul 21, 2022 • edited Loading

Reproduction

Observations

illotum commented Jul 21, 2022 • edited Loading

lukebakken commented Jul 21, 2022

michaelklishin commented Jul 21, 2022

illotum commented Jul 21, 2022

illotum commented Jul 21, 2022 • edited Loading

lukebakken commented Jul 23, 2022 • edited Loading

michaelklishin commented Jul 29, 2022

johanrhodin commented Aug 26, 2022

lukebakken commented Sep 7, 2022

illotum commented Sep 7, 2022

danielllek commented Sep 9, 2022

lukebakken commented Oct 20, 2022 • edited Loading

michaelklishin commented Oct 21, 2022

mkuratczyk commented Oct 21, 2022

lukebakken commented Oct 21, 2022

mkuratczyk commented Oct 21, 2022 • edited Loading

mkuratczyk commented Oct 25, 2022

mkuratczyk commented Oct 26, 2022 • edited Loading

illotum commented Oct 26, 2022

mkuratczyk commented Oct 28, 2022

illotum commented Nov 2, 2022

illotum commented Nov 22, 2022

michaelklishin commented Nov 23, 2022

mkuratczyk commented Nov 23, 2022

illotum commented Nov 23, 2022 • edited Loading

michaelklishin commented Nov 24, 2022

danielllek commented Nov 24, 2022

mkuratczyk commented Nov 24, 2022

mkuratczyk commented Nov 24, 2022

SimonUnge commented Dec 15, 2022

mkuratczyk commented Dec 16, 2022

SimonUnge commented Dec 19, 2022

SimonUnge commented Jan 11, 2023

michaelklishin commented Jan 12, 2023

michaelklishin commented Jan 12, 2023

michaelklishin commented Jan 12, 2023

michaelklishin commented Jan 17, 2023

SimonUnge commented Jan 17, 2023

illotum commented Jul 21, 2022 •

edited

Loading

illotum commented Jul 21, 2022 •

edited

Loading

illotum commented Jul 21, 2022 •

edited

Loading

lukebakken commented Jul 23, 2022 •

edited

Loading

lukebakken commented Oct 20, 2022 •

edited

Loading

mkuratczyk commented Oct 21, 2022 •

edited

Loading

mkuratczyk commented Oct 26, 2022 •

edited

Loading

illotum commented Nov 23, 2022 •

edited

Loading