[stable/rabbitmq] failing probes on disk or memory alarms #8635

f84anton · 2018-10-22T15:26:04Z

Is this a request for help?:

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:
kubernetes v1.11, helm v2.11.0

Which chart:
stable/rabbitmq

What happened:

When memory usage reaches high watermark the response for probes looks like this:

$ curl -f --user sas:$RABBITMQ_PASSWORD 127.0.0.1:15672/api/healthchecks/node
{"status":"failed","reason":"resource alarm(s) in effect:[memory]"}

so probe is failed. But rabbitmq is working at the moment and disabling incoming connections is not fixing anything.

When the same occur with disk alarm rabbitmq loose connections, consumers can't connect and get messages. So container is killed when liveness probe fails. But its not helping. Actually after that rabbitmq can't start because the liveness probe fails while rabbitmq reading mnesia data.

What you expected to happen:
Rabbitmq can accept connections when disk or memory alarm fires.

How to reproduce it (as minimally and precisely as possible):
values:

rabbitmq:
  clustering:
    address_type: hostname
  configuration: |-
      ## Clustering
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
      cluster_formation.node_cleanup.interval = 10
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = autoheal
      ## queue master locator
      queue_master_locator=min-masters
      ## enable guest user
      loopback_users.guest = false
      ## https://www.rabbitmq.com/memory.html#memsup-usage
      vm_memory_high_watermark.absolute = 3GB
      disk_free_limit.absolute = 1GB
replicas: 1
persistence:
  enabled: true
  storageClass: local-storage
  size: 3Gi
resources:
  limits:
    cpu: 5
    memory: 3Gi
  requests:
    cpu: 5
    memory: 3Gi

You need to fill the pv with durable queues data or just create files in the pv to trigger disk alarm.

Anything else we need to know:

@tompizmor

The text was updated successfully, but these errors were encountered:

steven-sheehy · 2018-10-22T21:18:23Z

This is probably something that needs to be addressed upstream with RabbitMQ. We need a health check endpoint without taking into effect disk and memory alarms. Either a new endpoint or the existing endpoint could make the alarm optional with a HTTP parameter. Switching back to rabbitmqctl status is not an option since it causes high CPU usage and doesn't properly check rabbitmq is functioning by listing channels and queues. I've seen scenarios where rabbitmqctl status shows it is up but I couldn't connect to it until I restarted rabbitmq.

stale · 2018-11-21T21:30:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2018-12-05T21:40:54Z

This issue is being automatically closed due to inactivity.

thomas-riccardi · 2018-12-06T09:36:11Z

Has someone started a discussion with upstream rabbitmq about this issue?

f84anton · 2018-12-06T14:37:34Z

@thomas-riccardi, I dont think it's some kind of the rabbitmq management plugin issue. That's just misconfigured probes.

thomas-riccardi · 2018-12-06T15:04:29Z

@f84anton how would you configure the probes then?

The readiness probe should raise an error when cutting new incoming connections would help resolve the issue.
For memory and disk alarms it is tricky as maybe solving the issue is done by adding more consumers, which do need to connect to rabbitmq. Or maybe new connections are just for producers, which would then be blocked by the alarm: these new connections may indeed be better off not connecting to this node.

As for the liveness probe: it should raise an error when killing the container would help resolve the issue. The memory alarm could indeed be resolved by killing the container, but not the disk alarm (for persistent messages).

Conclusion

The memory and disk alarms could be useful as probe errors only in some cases. It should thus be a deployment setting.
Not sure if it should be the default as it is today. Personally I would change the default given your initial bug report scenario: rabbitmq already has higher level mechanisms (producer throttling) to restore working state: it doesn't need help from kubernetes.
Documentation would be useful in any case.
To be able to selectively ignore memory and disk alarms in the probe we do need new upstream features: an api endpoint for healthchecks/node where we can optionally ignore memory and disk (and more?) alarms.

More generally, what are the rabbitmq failures modes for which kubernetes probe could help?

thomas-riccardi · 2018-12-06T15:11:37Z

@f84anton as for your second sub-issue:

When the same occur with disk alarm rabbitmq loose connections, consumers can't connect and get messages. So container is killed when liveness probe fails. But its not helping. Actually after that rabbitmq can't start because the liveness probe fails while rabbitmq reading mnesia data.

It seems to be a different issue: your node seems to take a lot of time to start up, and the default liveness probe configuration is not calibrated for that and raises an error too early.
If that is the case, then you should change the liveness probe configuration (mainly the initialDelaySeconds, which defaults to 120):

charts/stable/rabbitmq/values.yaml

Line 194 in 6a4608e

initialDelaySeconds: 120

thomas-riccardi mentioned this issue Nov 6, 2018

RabbitMQ high CPU usage on idle VM #3855

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2018

stale bot closed this as completed Dec 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/rabbitmq] failing probes on disk or memory alarms #8635

[stable/rabbitmq] failing probes on disk or memory alarms #8635

f84anton commented Oct 22, 2018

steven-sheehy commented Oct 22, 2018

stale bot commented Nov 21, 2018

stale bot commented Dec 5, 2018

thomas-riccardi commented Dec 6, 2018

f84anton commented Dec 6, 2018

thomas-riccardi commented Dec 6, 2018 •

edited

Loading

thomas-riccardi commented Dec 6, 2018

[stable/rabbitmq] failing probes on disk or memory alarms #8635

[stable/rabbitmq] failing probes on disk or memory alarms #8635

Comments

f84anton commented Oct 22, 2018

steven-sheehy commented Oct 22, 2018

stale bot commented Nov 21, 2018

stale bot commented Dec 5, 2018

thomas-riccardi commented Dec 6, 2018

f84anton commented Dec 6, 2018

thomas-riccardi commented Dec 6, 2018 • edited Loading

Conclusion

thomas-riccardi commented Dec 6, 2018

thomas-riccardi commented Dec 6, 2018 •

edited

Loading