[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

orgads · 2022-07-10T12:33:23Z

Name and Version

bitnami/rabbitmq 10.1.11

What steps will reproduce the bug?

Just run it and watch the CPU usage.

What do you see instead?

High CPU usage while idle

Additional information

The liveness and readiness probe execute a separate Erlang process, which is very expensive.

This was reported (and fixed) in helm/charts#3855 but not to bitnami.

orgads · 2022-07-10T13:00:44Z

Also related: rabbitmq/discussions#151, https://groups.google.com/g/rabbitmq-users/c/BYJzgySEdr8

javsalgar · 2022-07-11T10:01:20Z

Hi,

It looks to me that this is something more related to RabbitMQ itself, so the upstream devs should recommend which settings (in case it is a matter of only settings and not a bug) suit better to avoid the issue.

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Fixes #11116 Signed-off-by: Orgad Shaneh <[email protected]>

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Fixes bitnami#11116 Signed-off-by: Orgad Shaneh <[email protected]> Signed-off-by: vaggeliskls <[email protected]>

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Fixes #11116 Signed-off-by: Orgad Shaneh <[email protected]>

Igor-lkm · 2023-04-04T07:58:29Z

Same issue here.

Environment: GCP GKE: 1.25.7-gke.1000

Chart versions: bitnami/rabbitmq 10.1.5 and 11.12.2

After restart it works fine for some time and then CPU goes to 100% of available:

Debug, node with an issue:

> top

asks:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s): 34.2 us, 15.7 sy,  0.0 ni, 47.4 id,  0.7 wa,  0.0 hi,  2.0 si,  0.0 st
MiB Mem :   7450.7 total,    324.0 free,   2586.8 used,   4540.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   4515.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  46152 1001      20   0 2118592  56728  22068 S  51.1   0.7 257:17.79 beam.smp
    163 1001      20   0 2250684 147216  71152 S   8.5   1.9  63:48.97 beam.smp
      1 1001      20   0    2508   1664   1496 S   0.0   0.0   0:00.44 rabbitmq-server
     95 1001      20   0    5940   3368   1180 S   0.0   0.0   0:01.95 epmd
    169 1001      20   0    2396   1336   1232 S   0.0   0.0   0:00.27 erl_child_setup
    221 1001      20   0    3740    920    812 S   0.0   0.0   0:00.37 inet_gethost
    222 1001      20   0    3956   1896   1728 S   0.0   0.0   0:00.61 inet_gethost
    225 1001      20   0    2508    560    448 S   0.0   0.0   0:01.19 sh
  46166 1001      20   0    2396    592    500 S   0.0   0.0   0:00.00 erl_child_setup
 133259 1001      20   0    2508    552    448 S   0.0   0.0   0:00.01 sh

So beam.smp takes 50+% on affected node. (which is resource requested)

Healthy node shows:

> top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    164 1001      20   0 2271084 168556  71148 S   7.0   2.2  67:11.16 beam.smp
 141808 1001      20   0    9996   3804   3300 R   0.3   0.0   0:00.02 top
...

Yaml:

replicaCount: 3
resources:
  requests:
    cpu: 500m
    memory: 700Mi
  limits:
    cpu: 600m
    memory: 900Mi
...
metrics:
  enabled: true
clustering:
  forceBoot: true
readinessProbe:
  periodSeconds: 60
  timeoutSeconds: 40
livenessProbe:
  periodSeconds: 60
  timeoutSeconds: 40

frivoire · 2023-04-11T17:41:00Z

I have observed this unusual CPU usage on Chart v10.3.9 (and RabbitMQ 3.10.20) running on GKE: ~0.4 cpu used per pod on idle.
For me, it's directly after pod startup (without any delay), so totally reproductible.

At least for now, here is my solution :
=> patch the chart to add RABBITMQ_CTL_ERL_ARGS="+S 1:1" in the commands of the liveness & readiness (in templates/statefulset.yaml obviously)

The idea is to ensure that the Erlang VM uses only 1 scheduler (and not many more) for the "CTL" commands, so the CLI tools like rabbitmq-diagnostics which is used for the probes.
Because the high-cpu usage is really caused on the process of the CLI tool itself, not the broker process (for which we already have Helm variables: maxAvailableSchedulers & onlineSchedulers), cf a time rabbitmq-diagnostics -q ping

EDIT: it doesn't solve everything, cf my comment below

Igor-lkm · 2023-04-14T08:05:36Z

To set RABBITMQ_CTL_ERL_ARGS (and/or maxAvailableSchedulers, onlineSchedulers) did not really work for me.

We can set extra env variables via values.yaml:

extraEnvVars:
  - name: RABBITMQ_CTL_ERL_ARGS
    value: "+S 1:1"

However, so far I understood rabbitmq do not recommend to use CLI tools for probes:

“Running the CLI as a probe is a bad idea and certainly contributes to the CPU usage …. we use a TCP check for the readinessProbe … ”

So, something like this seems to fix CPU usage:

customReadinessProbe:
  tcpSocket:
    port: amqp
customLivenessProbe:
  tcpSocket:
    port: amqp

(liveness looses functionality, so we might want something better here)

frivoire · 2023-04-14T11:38:32Z

My above comment was too quick: I do get an immediate improvement RABBITMQ_CTL_ERL_ARGS (0.4cpu->0.1cpu for me), but after a few hours, I do get the high CPU usage (~1cpu used on a pod) 😭

According to my first analysis, it's the rabbitmq-diagnostics process itself that uses it, one instance of the probe stays running (I got one example of >24h running process of beam.smp .... /opt/bitnami/rabbitmq/escript/rabbitmq-diagnostics -q check_running precisely), I don't know why.

I agree with @Igor-lkm's remark: we should avoid the CLI tool for probes.
I'm probably going to test replacing it with a curl on the API, and I'll share results.

pruthvi-itribe · 2023-04-15T21:18:04Z

We are also facing this issue? what could be best solution for this issue ?

orgads · 2023-04-16T04:19:45Z

Here's an example. The Authorization header contains base64 encoding of user:password (on this example, user:p4ssw0rd).

customLivenessProbe:
  failureThreshold: 6
  httpGet:
    httpHeaders:
    - name: "Authorization"
      value: "Basic dXNlcjpwNHNzdzByZA=="
    path: "/api/health/checks/virtual-hosts"
    port: 15672
  initialDelaySeconds: 120
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 20
customReadinessProbe:
  failureThreshold: 3
  httpGet:
    httpHeaders:
    - "name": "Authorization"
      value: "Basic dXNlcjpwNHNzdzByZA=="
    path: "/api/health/checks/local-alarms"
    port: 15672
  initialDelaySeconds: 10
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 20

orgads · 2023-04-16T04:35:26Z

And this is in terraform:

resource "random_password" "root" {
  length      = 16
  min_lower   = 1
  min_upper   = 1
  min_numeric = 1
  special     = false
}

resource "random_password" "erlang_cookie" {
  length      = 16
  min_lower   = 1
  min_upper   = 1
  min_numeric = 1
  special     = false
}

locals {
  root-user = {
    username = "Admin"
    password = random_password.root.result
  }
}

resource "helm_release" "rabbitmq" {
  name       = "my-rabbitmq"
  namespace  = "namespace"
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "rabbitmq"
  version    = "11.13.0"
  timeout    = 240

  values = [yamlencode({
    replicaCount = 1
    auth = {
      username     = local.root-user.username
      password     = local.root-user.password
      erlangCookie = random_password.erlang_cookie.result
    }
    customLivenessProbe = {
      httpGet = {
        path = "/api/health/checks/virtual-hosts"
        port = 15672
        httpHeaders = [{
          name  = "Authorization"
          value = "Basic ${base64encode("${local.root-user.username}:${local.root-user.password}")}"
        }]
      }
      initialDelaySeconds = 120
      periodSeconds       = 30
      timeoutSeconds      = 20
      failureThreshold    = 6
      successThreshold    = 1
    }
    customReadinessProbe = {
      httpGet = {
        path = "/api/health/checks/local-alarms"
        port = 15672
        httpHeaders = [{
          name  = "Authorization"
          value = "Basic ${base64encode("${local.root-user.username}:${local.root-user.password}")}"
        }]
      }
      initialDelaySeconds = 10
      periodSeconds       = 30
      timeoutSeconds      = 20
      failureThreshold    = 3
      successThreshold    = 1
    }
  })]
}

pruthvi-itribe · 2023-04-16T11:36:31Z

I actually tried this

customReadinessProbe:
  tcpSocket:
    port: amqp
customLivenessProbe:
  tcpSocket:
    port: amqp

And from last 12 hours I don't see that its spiking up to 100% cpu

carrodher · 2023-04-18T08:31:01Z

Thanks for updating this issue and creating the associated PR. The team will review it and provide feedback. Once merged the PR, this issue will be automatically closed.

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Reapply of bitnami#11117 and bitnami#11180. Fixes bitnami#11116. Signed-off-by: Orgad Shaneh <[email protected]> (cherry picked from commit 73966c6)

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Reapply of #11117 and #11180. Fixes #11116. Signed-off-by: Orgad Shaneh <[email protected]> (cherry picked from commit 73966c6) Signed-off-by: Juan José Martos <[email protected]> Co-authored-by: Juan José Martos <[email protected]>

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Reapply of bitnami#11117 and bitnami#11180. Fixes bitnami#11116. Signed-off-by: Orgad Shaneh <[email protected]> (cherry picked from commit 73966c6) Signed-off-by: Juan José Martos <[email protected]> Co-authored-by: Juan José Martos <[email protected]>

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Reapply of #11117 and #11180. Fixes #11116. Signed-off-by: Orgad Shaneh <[email protected]> (cherry picked from commit 73966c6) Signed-off-by: Juan José Martos <[email protected]> Co-authored-by: Juan José Martos <[email protected]>

orgads · 2023-05-12T11:31:16Z

Thanks for reporting. I'll try to look into this next week.

Jojoooo1 · 2023-05-12T11:37:56Z

@orgads I think it was a mix of unexpected event, I went through many upgrade and never had any CPU peak/problem (this is why it was my first suspect) but downgrading actually did not solve the problem. I then updated the customLivenessProbe and customReadinessProbe since I was using load definition and it seems now that the CPU is very low even in idle, I will monitor the next few hours but it seems that it solved the problem.

Igor-lkm · 2023-05-12T13:43:40Z

So my summary:

What cases high CPU usage is CLI tools of rabbitmq rabbitmq-diagnostics in liveness and readiness probes.

Example:
for livenessProbe is rabbitmq-diagnostics -q ping
for readinessProbe is rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms.

and rabbitmq do not recommend to use CLI tools for probes actually.

This PR fixed this #16082

However it fixes only if loadDefinition.enabled set to false. If it is true we we use old probes with CLI tools which cases CPU usage. If you have loadDefinition.enabled set to true you need to overwrite default probes with something, which makes sense for you. Something like:

customReadinessProbe:
  tcpSocket:
    port: amqp
customLivenessProbe:
  tcpSocket:
    port: amqp

orgads · 2023-05-12T13:45:19Z

Great summary @Igor-lkm, thanks!

ImTemporaryHere · 2024-06-13T10:44:39Z

Great summary @Igor-lkm, thanks!

Dunge · 2024-07-03T17:41:04Z

I've updated from RabbitMQ 3.11.5 to 3.13.3 (chart 11.2.0 to 14.4.4) and noticed my CPU usage triple. I have loadDefinition.enabled: true because that's the only way I managed to preset user/vhost/permissions/policies.

Setting the probes port to amqp seems to fix the CPU usage, but @Igor-lkm above seems to mention it also loose functionality? That doesn't seem good. I also see with this snippet only it loose all other defaut timing values, should we put them back? Is @orgads solution with http get and headers a better one?

Dunge · 2024-07-03T17:58:32Z

In fact just settings amqp reducing to default timings caused my second and third node to always get killed. @orgads solution worked.

orgads mentioned this issue Jul 10, 2022

[bitnami/rabbitmq] Fix high CPU usage while idle #11117

Merged

4 tasks

fmulero closed this as completed in #11117 Jul 12, 2022

fmulero pushed a commit that referenced this issue Jul 12, 2022

[bitnami/rabbitmq] Fix high CPU usage while idle (#11117)

73966c6

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Fixes #11116 Signed-off-by: Orgad Shaneh <[email protected]>

orgads mentioned this issue Jul 13, 2022

[bitnami/rabbitmq] Avoid usage of deprecated api #11180

Closed

4 tasks

agomezmoron added the solved label Jul 20, 2022

FraPazGal pushed a commit that referenced this issue Jul 25, 2022

[bitnami/rabbitmq] Fix high CPU usage while idle (#11117)

87599ac

Use REST APIs for liveness/readiness probes, instead of spawning expensive erlang processes. Fixes #11116 Signed-off-by: Orgad Shaneh <[email protected]>

github-actions bot added triage Triage is needed and removed solved labels Apr 4, 2023

bitnami-bot assigned javsalgar Apr 4, 2023

javsalgar changed the title ~~RabbitMQ high CPU usage while idle~~ [bitnami/rabbitmq] RabbitMQ high CPU usage while idle Apr 4, 2023

javsalgar added the rabbitmq label Apr 4, 2023

javsalgar reopened this Apr 17, 2023

github-actions bot removed the triage Triage is needed label Apr 17, 2023

javsalgar mentioned this issue Apr 24, 2023

[bitnami/rabbitmq]: cpu use much higher after updating from 11.4.0 to 11.13.0 #16141

Closed

jotamartos closed this as completed in #16082 May 2, 2023

github-actions bot added the solved label May 2, 2023

github-actions bot removed the solved label May 11, 2023

bitnami-bot assigned carrodher May 11, 2023

github-actions bot added the solved label May 12, 2023

github-actions bot removed the solved label May 12, 2023

github-actions bot added the solved label May 12, 2023

carrodher unassigned Mauraza Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

orgads commented Jul 10, 2022 •

edited by carrodher

Loading

orgads commented Jul 10, 2022

javsalgar commented Jul 11, 2022

Igor-lkm commented Apr 4, 2023 •

edited

Loading

frivoire commented Apr 11, 2023 •

edited

Loading

Igor-lkm commented Apr 14, 2023

frivoire commented Apr 14, 2023 •

edited

Loading

pruthvi-itribe commented Apr 15, 2023

orgads commented Apr 16, 2023

orgads commented Apr 16, 2023

pruthvi-itribe commented Apr 16, 2023 •

edited

Loading

carrodher commented Apr 18, 2023

orgads commented May 12, 2023

Jojoooo1 commented May 12, 2023 •

edited

Loading

Igor-lkm commented May 12, 2023 •

edited

Loading

orgads commented May 12, 2023

ImTemporaryHere commented Jun 13, 2024

Dunge commented Jul 3, 2024

Dunge commented Jul 3, 2024

[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

[bitnami/rabbitmq] RabbitMQ high CPU usage while idle #11116

Comments

orgads commented Jul 10, 2022 • edited by carrodher Loading

Name and Version

What steps will reproduce the bug?

What do you see instead?

Additional information

orgads commented Jul 10, 2022

javsalgar commented Jul 11, 2022

Igor-lkm commented Apr 4, 2023 • edited Loading

frivoire commented Apr 11, 2023 • edited Loading

Igor-lkm commented Apr 14, 2023

frivoire commented Apr 14, 2023 • edited Loading

pruthvi-itribe commented Apr 15, 2023

orgads commented Apr 16, 2023

orgads commented Apr 16, 2023

pruthvi-itribe commented Apr 16, 2023 • edited Loading

carrodher commented Apr 18, 2023

orgads commented May 12, 2023

Jojoooo1 commented May 12, 2023 • edited Loading

Igor-lkm commented May 12, 2023 • edited Loading

orgads commented May 12, 2023

ImTemporaryHere commented Jun 13, 2024

Dunge commented Jul 3, 2024

Dunge commented Jul 3, 2024

orgads commented Jul 10, 2022 •

edited by carrodher

Loading

Igor-lkm commented Apr 4, 2023 •

edited

Loading

frivoire commented Apr 11, 2023 •

edited

Loading

frivoire commented Apr 14, 2023 •

edited

Loading

pruthvi-itribe commented Apr 16, 2023 •

edited

Loading

Jojoooo1 commented May 12, 2023 •

edited

Loading

Igor-lkm commented May 12, 2023 •

edited

Loading