Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[stable/rabbitmq] failing probes on disk or memory alarms #8635

Closed
f84anton opened this issue Oct 22, 2018 · 7 comments
Closed

[stable/rabbitmq] failing probes on disk or memory alarms #8635

f84anton opened this issue Oct 22, 2018 · 7 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@f84anton
Copy link
Contributor

Is this a request for help?:


Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:
kubernetes v1.11, helm v2.11.0

Which chart:
stable/rabbitmq

What happened:

  1. When memory usage reaches high watermark the response for probes looks like this:
$ curl -f --user sas:$RABBITMQ_PASSWORD 127.0.0.1:15672/api/healthchecks/node
{"status":"failed","reason":"resource alarm(s) in effect:[memory]"}

so probe is failed. But rabbitmq is working at the moment and disabling incoming connections is not fixing anything.

  1. When the same occur with disk alarm rabbitmq loose connections, consumers can't connect and get messages. So container is killed when liveness probe fails. But its not helping. Actually after that rabbitmq can't start because the liveness probe fails while rabbitmq reading mnesia data.

What you expected to happen:
Rabbitmq can accept connections when disk or memory alarm fires.

How to reproduce it (as minimally and precisely as possible):
values:

rabbitmq:
  clustering:
    address_type: hostname
  configuration: |-
      ## Clustering
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
      cluster_formation.node_cleanup.interval = 10
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = autoheal
      ## queue master locator
      queue_master_locator=min-masters
      ## enable guest user
      loopback_users.guest = false
      ## https://www.rabbitmq.com/memory.html#memsup-usage
      vm_memory_high_watermark.absolute = 3GB
      disk_free_limit.absolute = 1GB
replicas: 1
persistence:
  enabled: true
  storageClass: local-storage
  size: 3Gi
resources:
  limits:
    cpu: 5
    memory: 3Gi
  requests:
    cpu: 5
    memory: 3Gi

You need to fill the pv with durable queues data or just create files in the pv to trigger disk alarm.

Anything else we need to know:

@tompizmor

@steven-sheehy
Copy link
Collaborator

This is probably something that needs to be addressed upstream with RabbitMQ. We need a health check endpoint without taking into effect disk and memory alarms. Either a new endpoint or the existing endpoint could make the alarm optional with a HTTP parameter. Switching back to rabbitmqctl status is not an option since it causes high CPU usage and doesn't properly check rabbitmq is functioning by listing channels and queues. I've seen scenarios where rabbitmqctl status shows it is up but I couldn't connect to it until I restarted rabbitmq.

@stale
Copy link

stale bot commented Nov 21, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2018
@stale
Copy link

stale bot commented Dec 5, 2018

This issue is being automatically closed due to inactivity.

@stale stale bot closed this as completed Dec 5, 2018
@thomas-riccardi
Copy link
Contributor

Has someone started a discussion with upstream rabbitmq about this issue?

@f84anton
Copy link
Contributor Author

f84anton commented Dec 6, 2018

@thomas-riccardi, I dont think it's some kind of the rabbitmq management plugin issue. That's just misconfigured probes.

@thomas-riccardi
Copy link
Contributor

thomas-riccardi commented Dec 6, 2018

@f84anton how would you configure the probes then?

The readiness probe should raise an error when cutting new incoming connections would help resolve the issue.
For memory and disk alarms it is tricky as maybe solving the issue is done by adding more consumers, which do need to connect to rabbitmq. Or maybe new connections are just for producers, which would then be blocked by the alarm: these new connections may indeed be better off not connecting to this node.

As for the liveness probe: it should raise an error when killing the container would help resolve the issue. The memory alarm could indeed be resolved by killing the container, but not the disk alarm (for persistent messages).

Conclusion

  • The memory and disk alarms could be useful as probe errors only in some cases. It should thus be a deployment setting.
  • Not sure if it should be the default as it is today. Personally I would change the default given your initial bug report scenario: rabbitmq already has higher level mechanisms (producer throttling) to restore working state: it doesn't need help from kubernetes.
  • Documentation would be useful in any case.
  • To be able to selectively ignore memory and disk alarms in the probe we do need new upstream features: an api endpoint for healthchecks/node where we can optionally ignore memory and disk (and more?) alarms.

More generally, what are the rabbitmq failures modes for which kubernetes probe could help?

@thomas-riccardi
Copy link
Contributor

@f84anton as for your second sub-issue:

When the same occur with disk alarm rabbitmq loose connections, consumers can't connect and get messages. So container is killed when liveness probe fails. But its not helping. Actually after that rabbitmq can't start because the liveness probe fails while rabbitmq reading mnesia data.

It seems to be a different issue: your node seems to take a lot of time to start up, and the default liveness probe configuration is not calibrated for that and raises an error too early.
If that is the case, then you should change the liveness probe configuration (mainly the initialDelaySeconds, which defaults to 120):

initialDelaySeconds: 120

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants