zookeeper ruok command check occasionally fails #475

alfred-landrum · 2022-05-12T14:43:36Z

Description

The zookeeper liveness and readiness checks use the zookeeper ruok command like so:

OK=$(echo ruok | nc 127.0.0.1 $CLIENT_PORT)

# Check to see if zookeeper service answers
if [[ "$OK" == "imok" ]]; then
...
fi

However, if I just sit in a shell and try manually running that, eventually, I'll very occasionally get a blank response instead of the expected imok, even though the service seems fine, and running the command immediately again returns the expected string.

I believe the root issue is that we occasionally see a TCP reset from the zookeeper command socket:

From looking at the sequence numbers, I think it receives an ack from the client side after it's started shutting down the socket state. Even though the imok response is in flight, or already in the tcp receive buffers, nc doesn't read the socket data & write to its standard out.

I replaced nc with socat, and then ran a test on the zookeeper instance. I temporarily disabled the liveness and readiness checks, and with a tcpdump running, ran a script that tried to run the checks until failures. Using just nc would fail in a few seconds; using nc -q 0 (close write side as soon as possible) would fail in 10s of seconds. Using just plain socat (echo ruok | socat stdio tcp4:127.0.0.1:2181) would not fail, though via the tcpdump, I could verify that it had seen TCP reset packets.

We're updating our local zookeeper image to install and use socat instead of nc.

Importance

Under a loaded system, this has sometimes caused the liveness probe to fail with enough frequency to cause kubernetes to kill the pod, so I think its important that this be resolved.

Suggestions for an improvement

Update the zookeeper probes to install and use socat instead of nc.

The text was updated successfully, but these errors were encountered:

Slach · 2022-05-17T09:57:19Z

@alfred-landrum thanks a lot for your report, I struggle with the same issue
I hope your PR will merge and include in next release

@nishant-yt could you look to this PR?

karlwowza · 2022-06-30T09:56:04Z

We have the same problem in our K8s cluster. It would be great to get this PR reviewed and merged. @alfred-landrum thanks for your effort!

dariothornhill · 2022-09-21T12:53:49Z

Is there any update on when this will be merged?

mailsonsantana · 2022-10-10T17:46:46Z

Same problem, is there any update on when this will be merged?

alfred-landrum mentioned this issue May 12, 2022

Issue 475: use socat instead of nc #476

Merged

anishakj closed this as completed in #476 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zookeeper ruok command check occasionally fails #475

zookeeper ruok command check occasionally fails #475

alfred-landrum commented May 12, 2022

Slach commented May 17, 2022

karlwowza commented Jun 30, 2022

dariothornhill commented Sep 21, 2022

mailsonsantana commented Oct 10, 2022

zookeeper ruok command check occasionally fails #475

zookeeper ruok command check occasionally fails #475

Comments

alfred-landrum commented May 12, 2022

Description

Importance

Suggestions for an improvement

Slach commented May 17, 2022

karlwowza commented Jun 30, 2022

dariothornhill commented Sep 21, 2022

mailsonsantana commented Oct 10, 2022