-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zookeeper ruok command check occasionally fails #475
Comments
@alfred-landrum thanks a lot for your report, I struggle with the same issue @nishant-yt could you look to this PR? |
We have the same problem in our K8s cluster. It would be great to get this PR reviewed and merged. @alfred-landrum thanks for your effort! |
Is there any update on when this will be merged? |
Same problem, is there any update on when this will be merged? |
Description
The zookeeper liveness and readiness checks use the zookeeper
ruok
command like so:However, if I just sit in a shell and try manually running that, eventually, I'll very occasionally get a blank response instead of the expected
imok
, even though the service seems fine, and running the command immediately again returns the expected string.I believe the root issue is that we occasionally see a TCP reset from the zookeeper command socket:
From looking at the sequence numbers, I think it receives an ack from the client side after it's started shutting down the socket state. Even though the
imok
response is in flight, or already in the tcp receive buffers,nc
doesn't read the socket data & write to its standard out.I replaced
nc
withsocat
, and then ran a test on the zookeeper instance. I temporarily disabled the liveness and readiness checks, and with a tcpdump running, ran a script that tried to run the checks until failures. Using justnc
would fail in a few seconds; usingnc -q 0
(close write side as soon as possible) would fail in 10s of seconds. Using just plainsocat
(echo ruok | socat stdio tcp4:127.0.0.1:2181
) would not fail, though via the tcpdump, I could verify that it had seen TCP reset packets.We're updating our local zookeeper image to install and use socat instead of nc.
Importance
Under a loaded system, this has sometimes caused the liveness probe to fail with enough frequency to cause kubernetes to kill the pod, so I think its important that this be resolved.
Suggestions for an improvement
Update the zookeeper probes to install and use socat instead of nc.
The text was updated successfully, but these errors were encountered: