Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

RabbitMQ high CPU usage on idle VM #3855

Closed
TomaszUrugOlszewski opened this issue Feb 23, 2018 · 42 comments · Fixed by #13044 or #10377
Closed

RabbitMQ high CPU usage on idle VM #3855

TomaszUrugOlszewski opened this issue Feb 23, 2018 · 42 comments · Fixed by #13044 or #10377

Comments

@TomaszUrugOlszewski
Copy link

Is this a request for help?: Yes


Is this a BUG REPORT or FEATURE REQUEST? (choose one): FEATURE REQUEST

Version of Helm and Kubernetes:
Helm: 2.8.0
Kubectl server: 1.7.12-gke.1 (current) (It's GCP)

Which chart:
stable/rabbitmq

What happened:
High CPU usage on idle VM with only RabbitMQ running, generated by readiness/liveness probes. Basing on Stackdriver charts I see 100% CPU usage on n1-standard-2 VM. After forking chart, replacing probes with simple tcpSocket to 15672 port, it decreased to ~0%.

image

What you expected to happen:
To allow customizing health checks, use tcpSocket or httpGet instead of exec probes.

How to reproduce it (as minimally and precisely as possible):
Run it on GCP with n1-standard-2 VM type, and watch

@sbnl
Copy link

sbnl commented Mar 15, 2018

Would like to add that I'm also seeing this on GCP K8s 1.8.7-gke.1aprox 60% usage at idle.
n1-standard-1 (3.7G) free tier.
erl_child_setup and beam.smp being the consumers a around 25% each.

edit : chart : rabbitmq-0.6.17

edit: Upgraded to 0.6.25, no change.

@macropin
Copy link

macropin commented Apr 5, 2018

We're seeing this as well (k8s 1.8.10, rabbitmq-0.6.25). This is caused by a longstanding erlang issue related to nofile ulimit which has been known since at least 2014.

If you disable the liveness and readiness probes you will find the idle usage will come down a lot.

See https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63

@robermorales
Copy link
Contributor

cool! the solution of disabling readiness and liveness worked so far! but is there any option to change the ulimit in the docker image, the chart, or the deployment itself?

@thomas-riccardi
Copy link
Contributor

@robermorales the docker image including the fix (https://github.com/bitnami/bitnami-docker-rabbitmq/pull/69) is being prepared by bitnami; then the default value should be OK, but we should still expose it in the chart values.

@robermorales
Copy link
Contributor

thanks!

@thomas-riccardi
Copy link
Contributor

The fix has been released in docker images 3.7.4-r4 and 3.6.15-r4 (and their aliases: 3.7.4, 3.7, 3.6.15 and 3.6).

Since #4591 the values.yaml became non-prod, and the default image tag is the floating tag 3.7.4 instead of 3.7.4-r1. (values-production.yaml still refers to 3.7.4-r1).
Alas, we also have by default pullPolicy: IfNotPresent so in practice the floating tag is not a great idea...

I'm not sure if values-production.yaml is a good pattern; maybe it could just overrides some values instead of redefining everything. (only uses: redis and rabbitmq).

Maybe we should remove the floating tag and use read-only tags: 3.7.4-r4.

In any case, we should bump to -r4 to get the high CPU usage fix.

@rips-hb
Copy link

rips-hb commented Jun 20, 2018

I still have the exact same issue in 3.7.6-r8.

@thomas-riccardi
Copy link
Contributor

@rips-hb I also have some high CPU usage, but it's periodic, not constant. I found out it's the probes (rabbitmqctl status) which use that much CPU periodically; it's a different issue (that should be created here), and I'm not sure what to do to fix it.

@rips-hb
Copy link

rips-hb commented Jun 20, 2018

@thomas-riccardi for me it is constant unfortunately and I could resolve it by disabling liveness and readiness as suggested in this thread. Since it is only a test system that is no problem but I would rather have this checks on a production system. I will investigate a bit more and if I find something else I will create a new ticket.

@stale
Copy link

stale bot commented Aug 19, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2018
@nerumo
Copy link
Contributor

nerumo commented Aug 19, 2018

still an issue

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2018
@stale
Copy link

stale bot commented Sep 18, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2018
@macropin
Copy link

macropin commented Sep 25, 2018

The fix https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63 is incomplete and does not set the ulimit for the liveness / readiness probes. So this is still an issue.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2018
@thomas-riccardi
Copy link
Contributor

@macropin you are right.
However, I did not find a difference in execution time for rabbitmqctl status (or rabbitmqctl node_health_check) with and without ulimit -n 1024 (or ulimit -n 65536), so trying to add the ulimit -n to healtchecks will probably not help.

@thomas-riccardi
Copy link
Contributor

@macropin see also rabbitmq-ha advancements: #7378 (and #7752).

@intelfx
Copy link
Contributor

intelfx commented Oct 2, 2018

@macropin @thomas-riccardi As I see it, the fix in bitnami/bitnami-docker-rabbitmq#63 is not only incomplete, it is completely inapplicable because it modifies the Docker entrypoint that we do not use.

@intelfx
Copy link
Contributor

intelfx commented Oct 2, 2018

Also I concur with @thomas-riccardi's findings — I did some testing too, and it turns out that even setting ridiculously low ulimit -n 128 does not help neither to reduce the probes' CPU time nor to reduce overall Pod's CPU usage.

@alexsandro-xpt
Copy link

could it fixed at helm install? I have same problem here.

@stale
Copy link

stale bot commented Nov 5, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2018
@thomas-riccardi
Copy link
Contributor

@TomaszUrugOlszewski It should be fixed by #8140 (and subsequent fixes), modulo the issue about disk or memory alarms (see #8635).

@leixu26
Copy link

leixu26 commented Nov 9, 2018

3.5.7 same issure,high cpu load

@leixu26
Copy link

leixu26 commented Nov 9, 2018

rabbitmq 3.7.8 ,Erlang 21.1 same issue,high cpu useage. qps about 200/s

@alexsandro-xpt
Copy link

But, Is this was fixed?

@dnetguru
Copy link

I'm seeing the exact same behavior on 3.7.8 (Erlang 21.1) with very high CPU usage when idle and as a workaround disabling both the readiness and liveliness checks seems to fix the issue.

@javsalgar
Copy link
Collaborator

Hi,

Thanks for the feedback. If this issue is constant, then maybe it makes sense to change the readiness/liveness probes to simple tcp port checking. Thoughts on that?

@dnetguru
Copy link

dnetguru commented Dec 8, 2018

@javsalgar I tried again with the latest version of the chart (rabbitmq-4.0.1) with Rabbitmq 3.7.9 (Erlang 21.1) and this problem does not seem to happen anymore.

@juan131
Copy link
Collaborator

juan131 commented Jan 2, 2019

Hi everyone,

Do you still suffer High CPU usage because of readiness/liveness probes in the latest versions of the cart??

I agree with @javsalgar, we can make simpler probes (such as tcp port checking) or decrease the frequency if you're running into issues because of that.

@desaintmartin
Copy link
Collaborator

In my case, I greatly reduced the frequency of the probes.

@juan131
Copy link
Collaborator

juan131 commented Jan 3, 2019

What values did you use @desaintmartin ?

@desaintmartin
Copy link
Collaborator

  livenessProbe:
    timeoutSeconds: 30
    periodSeconds: 30
  readinessProbe:
    timeoutSeconds: 30
    periodSeconds: 30

@macropin
Copy link

macropin commented Jan 4, 2019

So the "fix" isn't a fix, rather a workaround. This should be reopened. Were the ulimits ever updated for the probes?

@juan131
Copy link
Collaborator

juan131 commented Jan 4, 2019

Hi @macropin

Currently liveness/readiness probes use curl instead of rabbitmqctl. Why do you consider necessary to update the ulimits on the probes?

@macropin
Copy link

macropin commented Jan 5, 2019

Oh that's great. I missed that change. The High cpu usage of rabbitmqctl was due to it not inheriting the entrypoint ulimit which cased a cpu usage issue with the Erlang JVM.

@juan131
Copy link
Collaborator

juan131 commented Jan 8, 2019

Yes, it was one of the reasons why the CPU was so high. That's why they were moved to curl on
#8140

@Artimi
Copy link

Artimi commented Apr 1, 2019

Hi, I've found another reason why RabbitMQ can have noticeable CPU usage when in idle or under a light load. RabbitMQ is running on Erlang and it is using its scheduling capabilities. To schedule a process Erlang is using scheduler threads and its number by default depends on a number of logical cores. And this is a problem when running in docker/kubernetes because RabbitMQ will think it has more resources than it actually has. In our case, a RabbitMQ node is running on a server with 40 cores but we are limiting it only to have 1 core in Kubernetes. Erlang will run 40 scheduler threads that are constantly context switching which generates the cpu usage. When I set the number of scheduler threads to 1 the CPU usage dropped from 23 % to 3 %.
You can check how many scheduler threads you are using with rabbitmqctl status:

$ rabbitmqctl status
...
{erlang_version,
     "Erlang/OTP 20 [erts-9.3.3.3] [source] [64-bit] [smp:40:40] [ds:40:40:10] [async-threads:640] [hipe] [kernel-poll:true]\n"},
...

numbers after smp are the number of threads. For more info see Erlang scheduler details. You can set the value using environment variable:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+S 1:1"

@infa-ddeore
Copy link

@Artimi how did you set"RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS" with rabbitmq helm?

@Artimi
Copy link

Artimi commented Apr 11, 2019

@infa-ddeore We actually are not using helm, just custom made kubernetes deployment. I just wrote it here because I had similar problems as are in this issue.

@thomas-riccardi
Copy link
Contributor

thomas-riccardi commented Apr 11, 2019

@infa-ddeore the feature seems to be missing indeed.
It could be easily added like I did in #12908 for the metrics container.

We could also use the downward api to get the cpu requests or limits, and generate automatically the correct value for RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS in the command. Or just helm templating.

@infa-ddeore
Copy link

@thomas-riccardi thanks for the pointers, for now I will update stable/rabbitmq chart locally with this variable and deploy that

@infa-ddeore
Copy link

@Artimi setting "RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS" to "+S 1:1" doesn't seem helping me, I will try disabling liveness and readyness checks, my rmq is 3.7.8 and erlang 21

@juan131
Copy link
Collaborator

juan131 commented Apr 15, 2019

Thanks for reporting it @Artimi

I just created a PR so the user has a couple of parameters to limit the number of scheduler threads.

@zzguang520
Copy link

same issues in helm chart rabbitmq-ha-1.47.1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.