-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dead lettering en masse can overload the DLQ #5312
Comments
And a note from @thuandb (queue depth mistmatch because of a different env):
|
This may or may not be related, the symptoms here reminding me of this message - https://groups.google.com/g/rabbitmq-users/c/-_J_K4T7yTI/ |
What RabbitMQ version was used? |
My test was run from a somewhat stale master, at e2edb1b. In prod this was observed on 3.8.22 and 3.8.27. |
Updated description to remove the mention of allocs: I haven't actually looked at allocations yet. Memory may sway because of paging. |
I have put together a reproduction project here: https://github.com/lukebakken/rabbitmq-server-5312 It uses RMQ 3.10.6 and the latest PerfTest. I can reproduce the issue as described. Workarounds that address the issue:
|
Using CQv2 or quorum queue solves the problem => closing. |
We still have this issue with v2 classic queues. Will try to reproduce next week. |
Putting this here since the issue seems to be the same - https://groups.google.com/g/rabbitmq-users/c/Kr2QZx1TREA |
So in total, two reports about this still happening in CMQv2? @johanrhodin Looking forward to reproduction steps as well. |
I've also found another workaround - switching queue-mode from |
FWIW I just revisited this using the latest As expected quorum queues do not show this issue (using the latest cc @michaelklishin any reason to keep this open at this point? |
We can leave it open if some CMQ user would like to investigate and contribute a fix. What immediately stands out on the screenshot is the |
I believe the problem comes from the way at-most-once dead lettering works. For quorum queues, we added at-least-once strategy that uses publisher confirms internally, but classic queues don't support that. When a lot of messages are dead-letter, we are effectively publishing to a mirrored queue without any flow-control mechanism, which overloads the DLQ process, which causes a memory spike. Assuming that's true, we would need to have classic queues support Alternatively, perhaps it's possible to apply back-pressure using some other mechanism, to prevent DLQ getting overloaded. Feel free to have a look at QQ |
@michaelklishin @mkuratczyk interestingly enough, the |
Scratch that. I must have run the test incorrectly (I guess I was still publishing to the |
This doubles the publish rate using the repro steps from #5312 and somewhat reduces the memory usage. ``` rabbitmqctl -n rabbit-0 set_policy --apply-to queues ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1 BEFORE: id: test-164723-973, sending rate avg: 6156 msg/s AFTER: id: test-165708-401, sending rate avg: 11572 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1,x-queue-mode=lazy -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1lazy BEFORE: id: test-165020-228, sending rate avg: 6239 msg/s AFTER: id: test-165848-489, sending rate avg: 11758 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=2 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v2 BEFORE: id: test-165314-597, sending rate avg: 6818 msg/s AFTER: id: test-170027-308, sending rate avg: 13891 msg/s ```
This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload.
@illotum @johanrhodin could you please test the cmq-optimisations branch, also available as an OCI image For me it roughly doubles the publish rate during the first phase of the test and, I assume, when messages are published internally to the Note: when using CQv2, it's also useful to increase the Tested with CQv1, CQv1 lazy and CQv2. Each of these had a memory alarm/oomkill using the |
One more thing worth sharing: ultimately the problem is that we don't use publisher confirms when internally publishing to classic queues (which is what happens during dead-lettering). I came up with a similar scenario without mirroring - put the I think all of this just proves what we knew: we need to have publisher confirms / flow control for internal publishing/dead-lettering to a classic queues, just like we now do for quorum queues. |
@mkuratczyk your branch shows significant improvement across all my tests. I can still see the spike, but it's much harder to cross into alarm now. |
I have made one more commit to the branch to bring back some parts I removed previously, as it showed some performance regression with larger (not embedded) messages. I will continue testing next week but I'd also appreciate another round of testing on your side. Especially with workloads that are different than the reproduction steps of this issue, as the memory usage reduction in this scenario is significant and consistent, but the change affects mirroring in general, so I want to be careful with this, as lots of users may be affected. |
Not too rigorous, but I've been running various sync scenarios over the last couple days: cluster member loss, DLQ, lazy and not, various message sizes. Tests show similar or better profile in all cases (often insignificant), so far I see no degradation. |
@mkuratczyk anything else I can help with re your performance branch? |
@illotum the core team will discuss what's ahead for that branch. Thank you the details and not letting this slip. |
@illotum What versions would you expect this to be back-ported to? While we didn't find any regressions, we know RabbitMQ is used in so many ways/configurations that it could have some negative impact in some of them, so we are wary of back-porting this/releasing in a patch release (especially for "older" versions like 3.9 where very few changes happen at this point). |
I'd be happy with 3.10, and fine with 3.11. We are only rolling out 3.11, so this would be a good release to pick as first. |
3.11 is what the core team had in mind. |
If it's possible for 3.10, I'd very happy :) |
@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10. |
This doubles the publish rate using the repro steps from #5312 and somewhat reduces the memory usage. ``` rabbitmqctl -n rabbit-0 set_policy --apply-to queues ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1 BEFORE: id: test-164723-973, sending rate avg: 6156 msg/s AFTER: id: test-165708-401, sending rate avg: 11572 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1,x-queue-mode=lazy -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1lazy BEFORE: id: test-165020-228, sending rate avg: 6239 msg/s AFTER: id: test-165848-489, sending rate avg: 11758 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=2 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v2 BEFORE: id: test-165314-597, sending rate avg: 6818 msg/s AFTER: id: test-170027-308, sending rate avg: 13891 msg/s ```
This seems to be a large contributor to memory spikes observed in #5312. Initial testing shows significantly lower memory usage in with the issue 5312 workload.
I've opened a PR and shared some results of today's tests #6467 |
This doubles the publish rate using the repro steps from #5312 and somewhat reduces the memory usage. ``` rabbitmqctl -n rabbit-0 set_policy --apply-to queues ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1 BEFORE: id: test-164723-973, sending rate avg: 6156 msg/s AFTER: id: test-165708-401, sending rate avg: 11572 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=1,x-queue-mode=lazy -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v1lazy BEFORE: id: test-165020-228, sending rate avg: 6239 msg/s AFTER: id: test-165848-489, sending rate avg: 11758 msg/s $ perf-test -ad false -f persistent -qa x-dead-letter-exchange=,x-dead-letter-routing-key=cmq-dlq,x-queue-master-locator=client-local,x-queue-version=2 -x 4 -y 0 -s 1000 -C 250000 -u cmq-input-v2 BEFORE: id: test-165314-597, sending rate avg: 6818 msg/s AFTER: id: test-170027-308, sending rate avg: 13891 msg/s ``` (cherry picked from commit b5226c1)
@mkuratczyk I would like to re-visit the idea of back-porting this to the 3.10 branch! In terms of getting feedback from us (AWS) in terms of regression (and benefits), that feedback will be easier to provide if we get this in the 3.10 branch as we have a larger adoption on 3.10 currently in our fleet of brokers. Are there anything particular you require in terms of feedback from us, in order to backport to 3.10 sooner, rather than later? |
I understand you have more 3.10 but do you have 3.11.4+ as well already running and behaving well? |
@mkuratczyk We actually do not have any brokers on 3.11 yet, as we currently only support 3.10.*, hence issues collecting data. We will go live with 3.11 early next year but as of now, if we want live data, it will need to be from a 3.10 version. Hence the wish for a 3.10-backport for us to more quickly get feedback, and also address some the issue we see on our brokers of course. |
Hi, just wanted to give this issue a bump, in the hope to get it back-ported to 3.10 - in our fleet we weekly encounter issues we think will get resolved with this fix, and would be able to quickly provide feedback. But it would need to be back-ported to a 3.10 release before we can do that. So I hope a 3.10 backport before real world data feedback is available would be an option? We still have seen no regression issues with 3.11, but that is only based on local testing. |
#6467 cherry-picks cleanly to |
Another observation: #6467 relies on Sets v2 that were introduced in Erlang 24. RabbitMQ 3.10.x already requires 24.x, and later this year will require 25. |
Shipped in |
(cherry picked from commit 4419707)
(cherry picked from commit e3a0304)
(cherry picked from commit 4419707)
Awesome, thank you! |
We observe a potential bug in our brokers
Classic mirrored DLQ replica, at a certain load, flips into CPU and memory churn and cannot recover. First reported here and later confirmed on multiple RabbitMQ versions.
With appropriate data set the instance is OOM-killed, but even on smaller sets I had the broker crossing into red and back for a long time after the test.
Reproduction
Environment
Two brokers
Long message backlog nacked between queues
Observations
Slave replica memory footprint stays abnormally high, oscillating +/- 2GB, for many hours (so far up to 4) after the test. There is a significant uptick in CPU as well.
From the couple hours I spent on this:
The text was updated successfully, but these errors were encountered: