Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zookeeper ruok command check occasionally fails #475

Closed
alfred-landrum opened this issue May 12, 2022 · 4 comments · Fixed by #476
Closed

zookeeper ruok command check occasionally fails #475

alfred-landrum opened this issue May 12, 2022 · 4 comments · Fixed by #476

Comments

@alfred-landrum
Copy link
Contributor

Description

The zookeeper liveness and readiness checks use the zookeeper ruok command like so:

OK=$(echo ruok | nc 127.0.0.1 $CLIENT_PORT)

# Check to see if zookeeper service answers
if [[ "$OK" == "imok" ]]; then
...
fi

However, if I just sit in a shell and try manually running that, eventually, I'll very occasionally get a blank response instead of the expected imok, even though the service seems fine, and running the command immediately again returns the expected string.

I believe the root issue is that we occasionally see a TCP reset from the zookeeper command socket:
image

From looking at the sequence numbers, I think it receives an ack from the client side after it's started shutting down the socket state. Even though the imok response is in flight, or already in the tcp receive buffers, nc doesn't read the socket data & write to its standard out.

I replaced nc with socat, and then ran a test on the zookeeper instance. I temporarily disabled the liveness and readiness checks, and with a tcpdump running, ran a script that tried to run the checks until failures. Using just nc would fail in a few seconds; using nc -q 0 (close write side as soon as possible) would fail in 10s of seconds. Using just plain socat (echo ruok | socat stdio tcp4:127.0.0.1:2181) would not fail, though via the tcpdump, I could verify that it had seen TCP reset packets.

We're updating our local zookeeper image to install and use socat instead of nc.

Importance

Under a loaded system, this has sometimes caused the liveness probe to fail with enough frequency to cause kubernetes to kill the pod, so I think its important that this be resolved.

Suggestions for an improvement

Update the zookeeper probes to install and use socat instead of nc.

@Slach
Copy link

Slach commented May 17, 2022

@alfred-landrum thanks a lot for your report, I struggle with the same issue
I hope your PR will merge and include in next release

@nishant-yt could you look to this PR?

@karlwowza
Copy link

We have the same problem in our K8s cluster. It would be great to get this PR reviewed and merged. @alfred-landrum thanks for your effort!

@dariothornhill
Copy link

Is there any update on when this will be merged?

@mailsonsantana
Copy link

Same problem, is there any update on when this will be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants