Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Add a liveness probe for event loop lag #13203

Merged
merged 5 commits into from
Sep 22, 2022

Conversation

andrew-farries
Copy link
Contributor

@andrew-farries andrew-farries commented Sep 22, 2022

Description

Add a liveness probe for server.

The probe fails iff the nodejs event loop lag (as reported by a prometheus metric) exceeds the value set in server config.

Context:

We have an alert and corresponding runbook which fires when the event loop lag exceeds a certain threshold. The runbook simply advises the operator to restart the affected pods, something that should not require manual intervention.

Related Issue(s)

How to test

  1. Edit the server-config configmap to change the maximumEventLoopLag setting.
  2. Restart the server pod after changing the configmap.
  3. Hit <preview-url>/api/live to see the current event loop lag and see the response code change according to the current value of maximumEventLoopLag.

Release Notes

[server]: Add a liveness probe which fails when the nodejs event loop lag exceeds a certain threshold

Documentation

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-integration-tests=webapp
    Valid options are all, workspace, webapp, ide

@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-af-server-liveness-probe.8 because the annotations in the pull request description changed
(with .werft/ from main)

@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 22, 2022
Copy link
Member

@easyCZ easyCZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Like it.

It does have the potential to effectively take all instances out of the deployment pool, however. If on startup the value spikes and the liveness probe fails, it will not consider the instance ready. That can put more load on the other instances and cause a cascading failure.

@@ -236,6 +236,7 @@ func configmap(ctx *common.RenderContext) ([]runtime.Object, error) {
ContentServiceAddr: net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", contentservice.Component, ctx.Namespace), strconv.Itoa(contentservice.RPCPort)),
ImageBuilderAddr: net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", common.ImageBuilderComponent, ctx.Namespace), strconv.Itoa(common.ImageBuilderRPCPort)),
UsageServiceAddr: net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", usage.Component, ctx.Namespace), strconv.Itoa(usage.GRPCServicePort)),
MaximumEventLoopLag: 0.35,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause even self-hosted instances to restart if it reached this value. Are we happy with that? Should we document that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I think it's desirable that this applies to SaaS and self-hosted.

The only point of contention is what this value should be set to, and should it be the same in self-hosted vs SaaS installations. This value is taken from the AlertManager, ie it's set to the value that currently triggers a page to the on-caller. I think it's appropriate to leave it the same for all installations right now.

I don't think any extra documentation beyond the changelog is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Andrew Farries added 4 commits September 22, 2022 10:53
The probe fails iff the nodejs event loop lag (as reported by a
prometheus metric) exceeds the value set in server config.
Hard code the server setting added in the parent commit. If necessary,
this could become configurable (via an experimental config setting).
Add a liveness probe for server that fails if the nodejs event loop lag
exceeds a given threshold.
@andrew-farries andrew-farries force-pushed the af/server-liveness-probe branch from 73e59c8 to b2a4283 Compare September 22, 2022 11:50
@andrew-farries
Copy link
Contributor Author

It does have the potential to effectively take all instances out of the deployment pool, however. If on startup the value spikes and the liveness probe fails, it will not consider the instance ready. That can put more load on the other instances and cause a cascading failure.

Looking at the dashboard for this metric, we do see regular spikes but none for longer than 15 seconds. By setting the FailureThreshold for the probe, we should be able to avoid probe failures due to these spikes.

I've bumped the FailureThreshold to 6, at a period of 10s which means we'd need to see ~1 minute of sustained lag before the probe fails.

@andrew-farries andrew-farries force-pushed the af/server-liveness-probe branch from b2a4283 to 6c06a55 Compare September 22, 2022 11:53
},
InitialDelaySeconds: 120,
PeriodSeconds: 10,
FailureThreshold: 6,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@easyCZ easyCZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this is a great improvement!

@easyCZ
Copy link
Member

easyCZ commented Sep 22, 2022

/hold for failing CI (to not block merge Q)

@andrew-farries
Copy link
Contributor Author

/unhold

@roboquat roboquat merged commit 11afa59 into main Sep 22, 2022
@roboquat roboquat deleted the af/server-liveness-probe branch September 22, 2022 14:10
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note size/M team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants