Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health probes to GRPC servers validating network access #8777

Merged
merged 6 commits into from
Mar 29, 2022
Merged

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Mar 13, 2022

Description

Running registry-facade without probes could lead to workspace issues during the start phase. In particular with requests downloading images.
The new probes ensure we can resolve DNS queries and at least we have access to the static layers defined in the configuration to avoid such issues.

How to test

Registry facade should start without any issues.

Release Notes

[registry-facade] Add health probes to validate network access

Documentation

@aledbf aledbf requested a review from a team March 13, 2022 15:27
@aledbf aledbf requested a review from csweichel as a code owner March 13, 2022 15:27
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Mar 13, 2022
@codecov
Copy link

codecov bot commented Mar 13, 2022

Codecov Report

Merging #8777 (b96ab58) into main (55b3ec2) will increase coverage by 3.22%.
The diff coverage is n/a.

❗ Current head b96ab58 differs from pull request most recent head f1d31be. Consider uploading reports for the commit f1d31be to get more accurate results

@@            Coverage Diff             @@
##             main    #8777      +/-   ##
==========================================
+ Coverage   11.17%   14.40%   +3.22%     
==========================================
  Files          18       51      +33     
  Lines         993     4900    +3907     
==========================================
+ Hits          111      706     +595     
- Misses        880     4124    +3244     
- Partials        2       70      +68     
Flag Coverage Δ
components-gitpod-cli-app 11.17% <ø> (ø)
components-ws-daemon-app 20.33% <ø> (?)
components-ws-daemon-lib 20.33% <ø> (?)
install-installer-raw-app 4.27% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ws-daemon/pkg/quota/size.go 87.30% <0.00%> (ø)
install/installer/pkg/common/display.go 0.00% <0.00%> (ø)
...l/installer/pkg/components/ws-manager/configmap.go 23.75% <0.00%> (ø)
...l/installer/pkg/components/ws-manager/tlssecret.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/daemon/markunmount.go 0.00% <0.00%> (ø)
...nstall/installer/pkg/components/ws-manager/role.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/content/service.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/cpulimit/cfs.go 30.52% <0.00%> (ø)
...staller/pkg/components/ws-manager/networkpolicy.go 0.00% <0.00%> (ø)
install/installer/pkg/common/ca.go 0.00% <0.00%> (ø)
... and 23 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@aledbf aledbf force-pushed the aledbf/reg branch 8 times, most recently from 04cef51 to 541d8dc Compare March 13, 2022 16:28
@aledbf
Copy link
Member Author

aledbf commented Mar 13, 2022

Logs:

{"level":"info","message":"creating DNS check for host eu.gcr.io","serviceContext":{"service":"registry-facade","version":"commit-541d8dcad0cf495d8bf946bcd24d27efc3be4685"},"severity":"INFO","time":"2022-03-13T16:30:17Z"}
{"level":"info","message":"creating network check using URL http://eu.gcr.io/gitpod-core-dev/build/supervisor:commit-4f1c3922b4ad71b6601382068b8a33d9b7cb2f2b","serviceContext":{"service":"registry-facade","version":"commit-541d8dcad0cf495d8bf946bcd24d27efc3be4685"},"severity":"INFO","time":"2022-03-13T16:30:17Z"}
{"level":"info","message":"creating network check using URL http://eu.gcr.io/gitpod-core-dev/build/supervisor:commit-4f1c3922b4ad71b6601382068b8a33d9b7cb2f2b","serviceContext":{"service":"registry-facade","version":"commit-541d8dcad0cf495d8bf946bcd24d27efc3be4685"},"severity":"INFO","time":"2022-03-13T16:30:17Z"}

@aledbf
Copy link
Member Author

aledbf commented Mar 13, 2022

/hold

we need to ensure the postStart waits for the readinessCheck

@aledbf aledbf force-pushed the aledbf/reg branch 2 times, most recently from b1fae11 to 13e5ff0 Compare March 13, 2022 19:08
@aledbf aledbf requested a review from a team March 13, 2022 19:08
@roboquat roboquat removed the size/L label Mar 13, 2022
@github-actions github-actions bot added size/L team: delivery Issue belongs to the self-hosted team labels Mar 13, 2022
@aledbf aledbf changed the title [registry-facade] Add health probes to validate network access Add health probes to validate network access to registry-facade and blobserve Mar 13, 2022
@roboquat roboquat removed the size/L label Mar 13, 2022
@aledbf aledbf force-pushed the aledbf/reg branch 3 times, most recently from 87bc8f4 to 502ed72 Compare March 13, 2022 19:25
@aledbf aledbf force-pushed the aledbf/reg branch 3 times, most recently from 33391df to 37f8a17 Compare March 16, 2022 19:04
kylos101
kylos101 previously approved these changes Mar 16, 2022
Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Leaving the hold, as we still need to determine why we're having trouble with image pulls in core-dev

mrsimonemms
mrsimonemms previously approved these changes Mar 18, 2022
@aledbf
Copy link
Member Author

aledbf commented Mar 18, 2022

/werft run

@aledbf aledbf dismissed stale reviews from mrsimonemms and kylos101 via b8669cd March 18, 2022 19:16
kylos101
kylos101 previously approved these changes Mar 18, 2022
Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM opens workspaces now, woot! Removing hold.

/unhold

Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change - it's surprising how long we got by without this.

components/blobserve/cmd/run.go Outdated Show resolved Hide resolved
components/blobserve/cmd/run.go Outdated Show resolved Hide resolved
Comment on lines 123 to 124
health.AddLivenessCheck("dns", kubernetes.DNSCanResolveProbe(staticLayerHost, 1*time.Second))
health.AddLivenessCheck("registry", kubernetes.NetworkIsReachableProbe(fmt.Sprintf("http://%v", repository)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want blobserve to go down if we can no longer reach those? Chances are things are cached and blobserve would just keep functioning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can remove those. Now, How do we know there's an error in blobserve without a restart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would want to add metrics/alerts for the error cases we know.

components/blobserve/cmd/run.go Outdated Show resolved Hide resolved
components/blobserve/cmd/run.go Outdated Show resolved Hide resolved
components/common-go/kubernetes/probes.go Outdated Show resolved Hide resolved
components/common-go/kubernetes/probes.go Outdated Show resolved Hide resolved
components/registry-facade/cmd/run.go Outdated Show resolved Hide resolved
components/ws-daemon/cmd/run.go Outdated Show resolved Hide resolved
Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thank you for this change.

@roboquat roboquat merged commit 25c56de into main Mar 29, 2022
@roboquat roboquat deleted the aledbf/reg branch March 29, 2022 16:50
@kylos101 kylos101 mentioned this pull request Mar 11, 2024
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production release-note size/XXL team: delivery Issue belongs to the self-hosted team team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants