Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

Closed
orgads opened this issue Jul 10, 2022 · 21 comments · Fixed by #11117 or #16082
Closed

[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

orgads opened this issue Jul 10, 2022 · 21 comments · Fixed by #11117 or #16082
Assignees
Labels

Comments

@orgads
Copy link
Contributor

orgads commented Jul 10, 2022

Name and Version

bitnami/rabbitmq 10.1.11

What steps will reproduce the bug?

Just run it and watch the CPU usage.

What do you see instead?

High CPU usage while idle

Additional information

The liveness and readiness probe execute a separate Erlang process, which is very expensive.

This was reported (and fixed) in helm/charts#3855 but not to bitnami.

@orgads
Copy link
Contributor Author

orgads commented Jul 10, 2022

@javsalgar
Copy link
Contributor

Hi,

It looks to me that this is something more related to RabbitMQ itself, so the upstream devs should recommend which settings (in case it is a matter of only settings and not a bug) suit better to avoid the issue.

fmulero pushed a commit that referenced this issue Jul 12, 2022
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Fixes #11116

Signed-off-by: Orgad Shaneh <[email protected]>
vaggeliskls pushed a commit to vaggeliskls/charts that referenced this issue Jul 21, 2022
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Fixes bitnami#11116

Signed-off-by: Orgad Shaneh <[email protected]>
Signed-off-by: vaggeliskls <[email protected]>
FraPazGal pushed a commit that referenced this issue Jul 25, 2022
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Fixes #11116

Signed-off-by: Orgad Shaneh <[email protected]>
@Igor-lkm
Copy link

Igor-lkm commented Apr 4, 2023

Same issue here.

Environment: GCP GKE: 1.25.7-gke.1000

Chart versions: bitnami/rabbitmq 10.1.5 and 11.12.2

After restart it works fine for some time and then CPU goes to 100% of available:

Screenshot 2023-04-04 at 09 54 59

Debug, node with an issue:

> top

asks:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s): 34.2 us, 15.7 sy,  0.0 ni, 47.4 id,  0.7 wa,  0.0 hi,  2.0 si,  0.0 st
MiB Mem :   7450.7 total,    324.0 free,   2586.8 used,   4540.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   4515.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  46152 1001      20   0 2118592  56728  22068 S  51.1   0.7 257:17.79 beam.smp
    163 1001      20   0 2250684 147216  71152 S   8.5   1.9  63:48.97 beam.smp
      1 1001      20   0    2508   1664   1496 S   0.0   0.0   0:00.44 rabbitmq-server
     95 1001      20   0    5940   3368   1180 S   0.0   0.0   0:01.95 epmd
    169 1001      20   0    2396   1336   1232 S   0.0   0.0   0:00.27 erl_child_setup
    221 1001      20   0    3740    920    812 S   0.0   0.0   0:00.37 inet_gethost
    222 1001      20   0    3956   1896   1728 S   0.0   0.0   0:00.61 inet_gethost
    225 1001      20   0    2508    560    448 S   0.0   0.0   0:01.19 sh
  46166 1001      20   0    2396    592    500 S   0.0   0.0   0:00.00 erl_child_setup
 133259 1001      20   0    2508    552    448 S   0.0   0.0   0:00.01 sh

So beam.smp takes 50+% on affected node. (which is resource requested)

Healthy node shows:

> top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    164 1001      20   0 2271084 168556  71148 S   7.0   2.2  67:11.16 beam.smp
 141808 1001      20   0    9996   3804   3300 R   0.3   0.0   0:00.02 top
...

Yaml:

replicaCount: 3
resources:
  requests:
    cpu: 500m
    memory: 700Mi
  limits:
    cpu: 600m
    memory: 900Mi
...
metrics:
  enabled: true
clustering:
  forceBoot: true
readinessProbe:
  periodSeconds: 60
  timeoutSeconds: 40
livenessProbe:
  periodSeconds: 60
  timeoutSeconds: 40

@github-actions github-actions bot added triage Triage is needed and removed solved labels Apr 4, 2023
@javsalgar javsalgar changed the title RabbitMQ high CPU usage while idle [bitnami/rabbitmq] RabbitMQ high CPU usage while idle Apr 4, 2023
@frivoire
Copy link

frivoire commented Apr 11, 2023

I have observed this unusual CPU usage on Chart v10.3.9 (and RabbitMQ 3.10.20) running on GKE: ~0.4 cpu used per pod on idle.
For me, it's directly after pod startup (without any delay), so totally reproductible.

At least for now, here is my solution :
=> patch the chart to add RABBITMQ_CTL_ERL_ARGS="+S 1:1" in the commands of the liveness & readiness (in templates/statefulset.yaml obviously)

The idea is to ensure that the Erlang VM uses only 1 scheduler (and not many more) for the "CTL" commands, so the CLI tools like rabbitmq-diagnostics which is used for the probes.
Because the high-cpu usage is really caused on the process of the CLI tool itself, not the broker process (for which we already have Helm variables: maxAvailableSchedulers & onlineSchedulers), cf a time rabbitmq-diagnostics -q ping

EDIT: it doesn't solve everything, cf my comment below

@Igor-lkm
Copy link

To set RABBITMQ_CTL_ERL_ARGS (and/or maxAvailableSchedulers, onlineSchedulers) did not really work for me.

We can set extra env variables via values.yaml:

extraEnvVars:
  - name: RABBITMQ_CTL_ERL_ARGS
    value: "+S 1:1"

However, so far I understood rabbitmq do not recommend to use CLI tools for probes:

“Running the CLI as a probe is a bad idea and certainly contributes to the CPU usage …. we use a TCP check for the readinessProbe … ”

So, something like this seems to fix CPU usage:

customReadinessProbe:
  tcpSocket:
    port: amqp
customLivenessProbe:
  tcpSocket:
    port: amqp

(liveness looses functionality, so we might want something better here)

@frivoire
Copy link

frivoire commented Apr 14, 2023

My above comment was too quick: I do get an immediate improvement RABBITMQ_CTL_ERL_ARGS (0.4cpu->0.1cpu for me), but after a few hours, I do get the high CPU usage (~1cpu used on a pod) 😭

According to my first analysis, it's the rabbitmq-diagnostics process itself that uses it, one instance of the probe stays running (I got one example of >24h running process of beam.smp .... /opt/bitnami/rabbitmq/escript/rabbitmq-diagnostics -q check_running precisely), I don't know why.

I agree with @Igor-lkm's remark: we should avoid the CLI tool for probes.
I'm probably going to test replacing it with a curl on the API, and I'll share results.

@pruthvi-itribe
Copy link

We are also facing this issue? what could be best solution for this issue ?

@orgads
Copy link
Contributor Author

orgads commented Apr 16, 2023

Here's an example. The Authorization header contains base64 encoding of user:password (on this example, user:p4ssw0rd).

customLivenessProbe:
  failureThreshold: 6
  httpGet:
    httpHeaders:
    - name: "Authorization"
      value: "Basic dXNlcjpwNHNzdzByZA=="
    path: "/api/health/checks/virtual-hosts"
    port: 15672
  initialDelaySeconds: 120
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 20
customReadinessProbe:
  failureThreshold: 3
  httpGet:
    httpHeaders:
    - "name": "Authorization"
      value: "Basic dXNlcjpwNHNzdzByZA=="
    path: "/api/health/checks/local-alarms"
    port: 15672
  initialDelaySeconds: 10
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 20

@orgads
Copy link
Contributor Author

orgads commented Apr 16, 2023

And this is in terraform:

resource "random_password" "root" {
  length      = 16
  min_lower   = 1
  min_upper   = 1
  min_numeric = 1
  special     = false
}

resource "random_password" "erlang_cookie" {
  length      = 16
  min_lower   = 1
  min_upper   = 1
  min_numeric = 1
  special     = false
}

locals {
  root-user = {
    username = "Admin"
    password = random_password.root.result
  }
}

resource "helm_release" "rabbitmq" {
  name       = "my-rabbitmq"
  namespace  = "namespace"
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "rabbitmq"
  version    = "11.13.0"
  timeout    = 240

  values = [yamlencode({
    replicaCount = 1
    auth = {
      username     = local.root-user.username
      password     = local.root-user.password
      erlangCookie = random_password.erlang_cookie.result
    }
    customLivenessProbe = {
      httpGet = {
        path = "/api/health/checks/virtual-hosts"
        port = 15672
        httpHeaders = [{
          name  = "Authorization"
          value = "Basic ${base64encode("${local.root-user.username}:${local.root-user.password}")}"
        }]
      }
      initialDelaySeconds = 120
      periodSeconds       = 30
      timeoutSeconds      = 20
      failureThreshold    = 6
      successThreshold    = 1
    }
    customReadinessProbe = {
      httpGet = {
        path = "/api/health/checks/local-alarms"
        port = 15672
        httpHeaders = [{
          name  = "Authorization"
          value = "Basic ${base64encode("${local.root-user.username}:${local.root-user.password}")}"
        }]
      }
      initialDelaySeconds = 10
      periodSeconds       = 30
      timeoutSeconds      = 20
      failureThreshold    = 3
      successThreshold    = 1
    }
  })]
}

@pruthvi-itribe
Copy link

pruthvi-itribe commented Apr 16, 2023

I actually tried this

customReadinessProbe:
  tcpSocket:
    port: amqp
customLivenessProbe:
  tcpSocket:
    port: amqp
And from last 12 hours I don't see that its spiking up to 100% cpu

@javsalgar javsalgar reopened this Apr 17, 2023
@github-actions github-actions bot removed the triage Triage is needed label Apr 17, 2023
@carrodher
Copy link
Member

Thanks for updating this issue and creating the associated PR. The team will review it and provide feedback. Once merged the PR, this issue will be automatically closed.

orgads added a commit to orgads/charts that referenced this issue Apr 27, 2023
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Reapply of bitnami#11117 and bitnami#11180.

Fixes bitnami#11116.

Signed-off-by: Orgad Shaneh <[email protected]>
(cherry picked from commit 73966c6)
jotamartos added a commit that referenced this issue May 2, 2023
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Reapply of #11117 and #11180.

Fixes #11116.

Signed-off-by: Orgad Shaneh <[email protected]>
(cherry picked from commit 73966c6)

Signed-off-by: Juan José Martos <[email protected]>
Co-authored-by: Juan José Martos <[email protected]>
@github-actions github-actions bot added the solved label May 2, 2023
Yaytay pushed a commit to Yaytay/charts that referenced this issue May 5, 2023
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Reapply of bitnami#11117 and bitnami#11180.

Fixes bitnami#11116.

Signed-off-by: Orgad Shaneh <[email protected]>
(cherry picked from commit 73966c6)

Signed-off-by: Juan José Martos <[email protected]>
Co-authored-by: Juan José Martos <[email protected]>
Mauraza pushed a commit that referenced this issue May 9, 2023
Use REST APIs for liveness/readiness probes, instead of spawning
expensive erlang processes.

Reapply of #11117 and #11180.

Fixes #11116.

Signed-off-by: Orgad Shaneh <[email protected]>
(cherry picked from commit 73966c6)

Signed-off-by: Juan José Martos <[email protected]>
Co-authored-by: Juan José Martos <[email protected]>
@github-actions github-actions bot removed the solved label May 11, 2023
@orgads
Copy link
Contributor Author

orgads commented May 12, 2023

Thanks for reporting. I'll try to look into this next week.

@github-actions github-actions bot removed the solved label May 12, 2023
@Jojoooo1
Copy link

Jojoooo1 commented May 12, 2023

@orgads I think it was a mix of unexpected event, I went through many upgrade and never had any CPU peak/problem (this is why it was my first suspect) but downgrading actually did not solve the problem. I then updated the customLivenessProbe and customReadinessProbe since I was using load definition and it seems now that the CPU is very low even in idle, I will monitor the next few hours but it seems that it solved the problem.

@Igor-lkm
Copy link

Igor-lkm commented May 12, 2023

So my summary:

What cases high CPU usage is CLI tools of rabbitmq rabbitmq-diagnostics in liveness and readiness probes.

Example:
for livenessProbe is rabbitmq-diagnostics -q ping
for readinessProbe is rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms.

and rabbitmq do not recommend to use CLI tools for probes actually.

This PR fixed this #16082

However it fixes only if loadDefinition.enabled set to false. If it is true we we use old probes with CLI tools which cases CPU usage. If you have loadDefinition.enabled set to true you need to overwrite default probes with something, which makes sense for you. Something like:

customReadinessProbe:
  tcpSocket:
    port: amqp
customLivenessProbe:
  tcpSocket:
    port: amqp

@orgads
Copy link
Contributor Author

orgads commented May 12, 2023

Great summary @Igor-lkm, thanks!

@ImTemporaryHere
Copy link

Great summary @Igor-lkm, thanks!

@Dunge
Copy link

Dunge commented Jul 3, 2024

I've updated from RabbitMQ 3.11.5 to 3.13.3 (chart 11.2.0 to 14.4.4) and noticed my CPU usage triple. I have loadDefinition.enabled: true because that's the only way I managed to preset user/vhost/permissions/policies.

Setting the probes port to amqp seems to fix the CPU usage, but @Igor-lkm above seems to mention it also loose functionality? That doesn't seem good. I also see with this snippet only it loose all other defaut timing values, should we put them back? Is @orgads solution with http get and headers a better one?

@Dunge
Copy link

Dunge commented Jul 3, 2024

In fact just settings amqp reducing to default timings caused my second and third node to always get killed. @orgads solution worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet