-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[server] Add a liveness probe for event loop lag #13203
Conversation
started the job as gitpod-build-af-server-liveness-probe.8 because the annotations in the pull request description changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Like it.
It does have the potential to effectively take all instances out of the deployment pool, however. If on startup the value spikes and the liveness probe fails, it will not consider the instance ready. That can put more load on the other instances and cause a cascading failure.
@@ -236,6 +236,7 @@ func configmap(ctx *common.RenderContext) ([]runtime.Object, error) { | |||
ContentServiceAddr: net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", contentservice.Component, ctx.Namespace), strconv.Itoa(contentservice.RPCPort)), | |||
ImageBuilderAddr: net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", common.ImageBuilderComponent, ctx.Namespace), strconv.Itoa(common.ImageBuilderRPCPort)), | |||
UsageServiceAddr: net.JoinHostPort(fmt.Sprintf("%s.%s.svc.cluster.local", usage.Component, ctx.Namespace), strconv.Itoa(usage.GRPCServicePort)), | |||
MaximumEventLoopLag: 0.35, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would cause even self-hosted instances to restart if it reached this value. Are we happy with that? Should we document that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I think it's desirable that this applies to SaaS and self-hosted.
The only point of contention is what this value should be set to, and should it be the same in self-hosted vs SaaS installations. This value is taken from the AlertManager, ie it's set to the value that currently triggers a page to the on-caller. I think it's appropriate to leave it the same for all installations right now.
I don't think any extra documentation beyond the changelog is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The probe fails iff the nodejs event loop lag (as reported by a prometheus metric) exceeds the value set in server config.
Hard code the server setting added in the parent commit. If necessary, this could become configurable (via an experimental config setting).
Add a liveness probe for server that fails if the nodejs event loop lag exceeds a given threshold.
73e59c8
to
b2a4283
Compare
Looking at the dashboard for this metric, we do see regular spikes but none for longer than 15 seconds. By setting the I've bumped the |
b2a4283
to
6c06a55
Compare
}, | ||
InitialDelaySeconds: 120, | ||
PeriodSeconds: 10, | ||
FailureThreshold: 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this is a great improvement!
/hold for failing CI (to not block merge Q) |
/unhold |
Description
Add a liveness probe for
server
.The probe fails iff the nodejs event loop lag (as reported by a prometheus metric) exceeds the value set in
server
config.Context:
We have an alert and corresponding runbook which fires when the event loop lag exceeds a certain threshold. The runbook simply advises the operator to restart the affected pods, something that should not require manual intervention.
Related Issue(s)
How to test
server-config
configmap to change themaximumEventLoopLag
setting.server
pod after changing the configmap.<preview-url>/api/live
to see the current event loop lag and see the response code change according to the current value ofmaximumEventLoopLag
.Release Notes
Documentation
Werft options:
If enabled this will build
install/preview
Valid options are
all
,workspace
,webapp
,ide