[server] Add a liveness probe for event loop lag #13203

andrew-farries · 2022-09-22T08:58:42Z

Description

Add a liveness probe for server.

The probe fails iff the nodejs event loop lag (as reported by a prometheus metric) exceeds the value set in server config.

Context:

We have an alert and corresponding runbook which fires when the event loop lag exceeds a certain threshold. The runbook simply advises the operator to restart the affected pods, something that should not require manual intervention.

Related Issue(s)

How to test

Edit the server-config configmap to change the maximumEventLoopLag setting.
Restart the server pod after changing the configmap.
Hit <preview-url>/api/live to see the current event loop lag and see the response code change according to the current value of maximumEventLoopLag.

Release Notes

[server]: Add a liveness probe which fails when the nodejs event loop lag exceeds a certain threshold

Documentation

Werft options:

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-integration-tests=webapp
Valid options are all, workspace, webapp, ide

werft-gitpod-dev-com · 2022-09-22T08:58:59Z

started the job as gitpod-build-af-server-liveness-probe.8 because the annotations in the pull request description changed
(with .werft/ from main)

easyCZ

Great! Like it.

It does have the potential to effectively take all instances out of the deployment pool, however. If on startup the value spikes and the liveness probe fails, it will not consider the instance ready. That can put more load on the other instances and cause a cascading failure.

components/server/src/liveness/liveness-controller.ts

easyCZ · 2022-09-22T09:16:46Z

install/installer/pkg/components/server/configmap.go

@@ -236,6 +236,7 @@ func configmap(ctx *common.RenderContext) ([]runtime.Object, error) {
 		ContentServiceAddr:           net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", contentservice.Component, ctx.Namespace), strconv.Itoa(contentservice.RPCPort)),
 		ImageBuilderAddr:             net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", common.ImageBuilderComponent, ctx.Namespace), strconv.Itoa(common.ImageBuilderRPCPort)),
 		UsageServiceAddr:             net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", usage.Component, ctx.Namespace), strconv.Itoa(usage.GRPCServicePort)),
+		MaximumEventLoopLag:          0.35,


This would cause even self-hosted instances to restart if it reached this value. Are we happy with that? Should we document that?

Yes, and I think it's desirable that this applies to SaaS and self-hosted.

The only point of contention is what this value should be set to, and should it be the same in self-hosted vs SaaS installations. This value is taken from the AlertManager, ie it's set to the value that currently triggers a page to the on-caller. I think it's appropriate to leave it the same for all installations right now.

I don't think any extra documentation beyond the changelog is necessary.

The probe fails iff the nodejs event loop lag (as reported by a prometheus metric) exceeds the value set in server config.

Hard code the server setting added in the parent commit. If necessary, this could become configurable (via an experimental config setting).

Add a liveness probe for server that fails if the nodejs event loop lag exceeds a given threshold.

andrew-farries · 2022-09-22T11:50:30Z

It does have the potential to effectively take all instances out of the deployment pool, however. If on startup the value spikes and the liveness probe fails, it will not consider the instance ready. That can put more load on the other instances and cause a cascading failure.

Looking at the dashboard for this metric, we do see regular spikes but none for longer than 15 seconds. By setting the FailureThreshold for the probe, we should be able to avoid probe failures due to these spikes.

I've bumped the FailureThreshold to 6, at a period of 10s which means we'd need to see ~1 minute of sustained lag before the probe fails.

easyCZ · 2022-09-22T12:35:45Z

install/installer/pkg/components/server/deployment.go

+								},
+								InitialDelaySeconds: 120,
+								PeriodSeconds:       10,
+								FailureThreshold:    6,


easyCZ

LGTM, this is a great improvement!

easyCZ · 2022-09-22T12:36:18Z

/hold for failing CI (to not block merge Q)

andrew-farries · 2022-09-22T14:09:43Z

/unhold

andrew-farries requested a review from a team September 22, 2022 08:58

roboquat added release-note size/M labels Sep 22, 2022

github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 22, 2022

easyCZ reviewed Sep 22, 2022

View reviewed changes

Andrew Farries added 4 commits September 22, 2022 10:53

Add server config for maxiumumEventLoopLag

1b18c76

Add liveness probe endpoint to server

296db47

The probe fails iff the nodejs event loop lag (as reported by a prometheus metric) exceeds the value set in server config.

[installer] Set maximumEventLoopLag

1e9c36e

Hard code the server setting added in the parent commit. If necessary, this could become configurable (via an experimental config setting).

[installer] Add liveness probe for server

761f166

Add a liveness probe for server that fails if the nodejs event loop lag exceeds a given threshold.

andrew-farries force-pushed the af/server-liveness-probe branch from 73e59c8 to b2a4283 Compare September 22, 2022 11:50

Run make generateRenderTests

6c06a55

andrew-farries force-pushed the af/server-liveness-probe branch from b2a4283 to 6c06a55 Compare September 22, 2022 11:53

easyCZ reviewed Sep 22, 2022

View reviewed changes

install/installer/pkg/components/server/deployment.go

},

InitialDelaySeconds: 120,

PeriodSeconds: 10,

FailureThreshold: 6,

Copy link

Member

easyCZ Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

easyCZ approved these changes Sep 22, 2022

View reviewed changes

roboquat added the do-not-merge/hold label Sep 22, 2022

roboquat removed the do-not-merge/hold label Sep 22, 2022

roboquat merged commit 11afa59 into main Sep 22, 2022

roboquat deleted the af/server-liveness-probe branch September 22, 2022 14:10

roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Add a liveness probe for event loop lag #13203

[server] Add a liveness probe for event loop lag #13203

andrew-farries commented Sep 22, 2022 •

edited

Loading

werft-gitpod-dev-com bot commented Sep 22, 2022

easyCZ left a comment

easyCZ Sep 22, 2022

andrew-farries Sep 22, 2022

easyCZ Sep 22, 2022

andrew-farries commented Sep 22, 2022

easyCZ Sep 22, 2022

easyCZ left a comment

easyCZ commented Sep 22, 2022

andrew-farries commented Sep 22, 2022

[server] Add a liveness probe for event loop lag #13203

[server] Add a liveness probe for event loop lag #13203

Conversation

andrew-farries commented Sep 22, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

werft-gitpod-dev-com bot commented Sep 22, 2022

easyCZ left a comment

Choose a reason for hiding this comment

easyCZ Sep 22, 2022

Choose a reason for hiding this comment

andrew-farries Sep 22, 2022

Choose a reason for hiding this comment

easyCZ Sep 22, 2022

Choose a reason for hiding this comment

andrew-farries commented Sep 22, 2022

easyCZ Sep 22, 2022

Choose a reason for hiding this comment

easyCZ left a comment

Choose a reason for hiding this comment

easyCZ commented Sep 22, 2022

andrew-farries commented Sep 22, 2022

andrew-farries commented Sep 22, 2022 •

edited

Loading