Add livez and readyz for etcd #16651

siyuanfoundation · 2023-09-25T23:02:21Z

This is a prototype for adding livez/readyz support to etcd (design doc).
This pr is setting up the general structure for livez/ready checks, with only 2 simple checks implemented.

siyuanfoundation · 2023-09-25T23:06:26Z

Tested with local server
curl '127.0.0.1:2379/readyz?verbose'
output:

[+]ping ok
[+]serializable_read ok
[+]data_corruption ok
[+]defrag_active ok
readyz check passed

curl '127.0.0.1:2379/livez?verbose'
output:

[+]ping ok
[+]serializable_read ok
livez check passed

tjungblu · 2023-09-26T16:18:14Z

Cool stuff and thanks for the PR, I was wondering why you're adding this to the GRPC API? We already have a health handler on the http server: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L746

Just FYI, we had huge issues in the past with running etcdctl commands as a health probes in openshift with regards to zombie processes. We're much happier with just letting kubelet hit an http endpoint instead.

server/etcdserver/healthz_test.go

server/etcdserver/healthz.go

chaochn47 · 2023-09-26T16:47:27Z

We already have a health handler on the http server: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L746

I think we can start from http endpoint. I don’t think backporting gRPC changes to release-3.5 is feasible and accepted. It will literally change the customer facing API even if it is just maintenance service.

siyuanfoundation · 2023-09-26T23:32:13Z

We already have a health handler on the http server: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L746

I think we can start from http endpoint. I don’t think backporting gRPC changes to release-3.5 is feasible and accepted. It will literally change the customer facing API even if it is just maintenance service.

Removed gPRC change.

server/embed/etcd.go

wenjiaswe · 2023-09-27T21:38:52Z

cc @ahrtr @mitake @jmhbnz @ptabor @lavacat @fuweid

server/etcdserver/api/v3health/healthz.go

server/etcdserver/api/etcdhttp/readiness.go

fuweid

LGTM

Nice work!

server/etcdserver/api/etcdhttp/health.go

ahrtr · 2023-10-13T04:57:22Z

Please rebase this PR and resolve the left comment, thx

serathius · 2023-10-13T10:05:44Z

server/etcdserver/api/etcdhttp/health.go

+		if _, found := r.URL.Query()["verbose"]; found {
+			fmt.Fprint(w, h.Reason)
+		}
+		fmt.Fprint(w, "ok\n")


Any reason to implement verbose for per check endpoint?

per check endpoint still print the detailed error message in verbose, and no details in non verbose. In theory, the user can get all the details from the root path verbose. But I think it still makes sense to follow the same paradigm.

server/etcdserver/api/etcdhttp/health_test.go

siyuanfoundation · 2023-10-13T17:04:01Z

@ahrtr Rebased the PR. Please review. Thanks!

logicalhan

/lgtm

server/etcdserver/api/etcdhttp/health.go

Add two separate probes, one for liveness and one for readiness. The liveness probe would check that the local individual node is up and running, or else restart the node, while the readiness probe would check that the cluster is ready to serve traffic. This would make etcd health-check fully Kubernetes API complient. Signed-off-by: Siyuan Zhang <[email protected]>

ahrtr

LGTM

Thanks for your first contribution, and great work!

Please also add a followup item to update the doc in etcd-io/website.

server/etcdserver/api/etcdhttp/health.go

Question about serializable read

serathius · 2023-10-17T11:07:13Z

Please note that this PR doesn't implement the full design as it was presented. Livez works correct, however we are missing checks for readyz making it not very useful. Please see 80ab2ad

I think this is ok to merge, however until we finish implementing readyz, we shouldn't backport nor document the new endpoints.

siyuanfoundation marked this pull request as draft September 25, 2023 23:02

chaochn47 reviewed Sep 26, 2023

View reviewed changes

server/etcdserver/healthz_test.go Outdated Show resolved Hide resolved

server/etcdserver/healthz.go Outdated Show resolved Hide resolved

siyuanfoundation force-pushed the livez-pr branch 2 times, most recently from ae3789a to 0705f0e Compare September 26, 2023 23:22

serathius reviewed Sep 27, 2023

View reviewed changes

server/embed/etcd.go Outdated Show resolved Hide resolved

siyuanfoundation force-pushed the livez-pr branch 3 times, most recently from f5d4788 to 8abcec0 Compare September 27, 2023 21:08

siyuanfoundation marked this pull request as ready for review September 27, 2023 21:10

siyuanfoundation requested a review from chaochn47 September 27, 2023 22:01

serathius reviewed Sep 28, 2023

View reviewed changes