Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor deployment healthcheck #2750

Closed

Conversation

tejal29
Copy link
Contributor

@tejal29 tejal29 commented Aug 28, 2019

In this Pr

  • Refactor health check to add a Resource interface which will implement the details if a resource is healthy or not.

  • Print health check summary after every resource check is returned indicating, if the current resource is ready or in error. How many more are remaining.

deployment/leeroy-app is ready. [1/2 deployment(s) still pending]

In case the deployment is in error, the message looks like:

deployment/leeroy-app failed [1/2 deployment(s) still pending]. Error: resource deployment/leeroy-app could not be fetched within 10s: context deadline exceeded.
deployment/leeroy-app is pending due to resource deployment/leeroy-app could not be fetched within 10s: context deadline exceeded

  • After every 150 milliseconds, print the state of the resource, to give users more insights on what the current status of a deployment is. e.g in case of deployments with a large number of replicas,
    the user will see the current stautus of the deployment
deployment.apps/leeroy-web configured
Waiting for deployments to stabilize
deployment/leeroy-app is pending due to Waiting for rollout to finish: 5 out of 10 new replicas have been updated...
deployment/leeroy-app is pending due to Waiting for rollout to finish: 6 out of 10 new replicas have been updated...
  • And then finally special error handling for errors when we can't fetch deployment due to connectivity issues. In such case, we should wait until the statusCheckDeadlineSeconds is not reached.
tejaldesai@@microservices (health_check_refactor)$ skaffold dev --default-repo=gcr.io/tejal-test --status-check
Generating tags...
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web -> gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web:v0.36.0-104-gdeba6031
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app -> gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app:v0.36.0-104-gdeba6031-dirty
Tags generated in 23.115371ms
Checking cache...
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web: Found
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app: Found
Cache check complete in 889.162173ms
Starting deploy...
kubectl client version: 1.11+
kubectl version 1.12.0 or greater is recommended for use with Skaffold
deployment.apps/leeroy-web created
service/leeroy-app configured
deployment.apps/leeroy-app created
Deploy complete in 1.265634429s
Waiting for deployments to stabilize
deployment/leeroy-web is ready. [1/2 deployment(s) still pending]
deployment/leeroy-app is pending due to Waiting for rollout to finish: 0 of 3 updated replicas are available...
deployment/leeroy-app is pending due to Waiting for rollout to finish: 0 of 3 updated replicas are available...
deployment/leeroy-app is pending due to Running [kubectl --context gke_tejal-test_us-central1-a_dump rollout status deployment leeroy-app --namespace default --watch=false]: stdout , stderr: Unable to connect to the server: dial tcp 35.238.80.213:443: connect: network is unreachable
, err: exit status 1: exit status 1
deployment/leeroy-app is ready. 
Watching for changes...

Sample full output

tejaldesai@@microservices (health_check_refactor)$ skaffold dev --default-repo=gcr.io/tejal-test --status-check
Generating tags...
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web -> gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web:v0.36.0-104-gdeba6031
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app -> gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app:v0.36.0-104-gdeba6031-dirty
Tags generated in 29.447689ms
Checking cache...
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web: Found
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app: Found
Cache check complete in 739.122386ms
Starting deploy...
kubectl client version: 1.11+
kubectl version 1.12.0 or greater is recommended for use with Skaffold
deployment.apps/leeroy-web created
service/leeroy-app configured
deployment.apps/leeroy-app created
Deploy complete in 1.039965467s
Waiting for deployments to stabilize
deployment/leeroy-app is pending due to Waiting for rollout to finish: 0 of 3 updated replicas are available...
deployment/leeroy-web is pending due to Waiting for rollout to finish: 0 of 1 updated replicas are available...
deployment/leeroy-app is ready. [1/2 deployment(s) still pending]
deployment/leeroy-web is ready. 
Watching for changes...
[leeroy-app-54788f4dc4-dsdlx leeroy-app] 2019/08/28 23:43:16 leeroy app server ready
[leeroy-app-54788f4dc4-tvhqt leeroy-app] 2019/08/28 23:43:16 leeroy app server ready
[leeroy-app-54788f4dc4-g87km leeroy-app] 2019/08/28 23:43:16 leeroy app server ready
[leeroy-web-5b986bffc4-xj77k leeroy-web] 2019/08/28 23:43:16 leeroy web server ready
^C

@codecov
Copy link

codecov bot commented Aug 28, 2019

Codecov Report

Merging #2750 into master will decrease coverage by 0.37%.
The diff coverage is 49.04%.

Impacted Files Coverage Δ
pkg/skaffold/deploy/status_check.go 18.09% <19.76%> (-45.15%) ⬇️
pkg/skaffold/runner/deploy.go 73.68% <75%> (+1.46%) ⬆️
pkg/skaffold/deploy/resources/resource.go 82.35% <82.35%> (ø)
pkg/skaffold/deploy/resources/deployment.go 87.87% <87.87%> (ø)
...kg/skaffold/generate_pipeline/generate_pipeline.go 39.47% <0%> (-16.17%) ⬇️
pkg/skaffold/deploy/kubectl.go 64.95% <0%> (ø) ⬆️
pkg/skaffold/build/util.go 100% <0%> (ø) ⬆️
pkg/skaffold/schema/validation/validation.go 100% <0%> (ø) ⬆️
pkg/skaffold/deploy/kustomize.go 71.42% <0%> (ø) ⬆️
pkg/skaffold/runner/generate_pipeline.go 0% <0%> (ø) ⬆️
... and 4 more

Copy link
Contributor

@dgageot dgageot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejal29 I've only reviewed the UX, not the code.
It looks like it's doing the job which is awesome!
It would be even better if the output could be streamlined and refined.

Here's what I see:

Deploy complete in 378.581679ms
Waiting for deployments to stabilize
deployment/leeroy-app is pending due to Waiting for deployment "leeroy-app" rollout to finish: 1 old replicas are pending termination...

deployment/leeroy-web is pending due to Waiting for deployment "leeroy-web" rollout to finish: 1 old replicas are pending termination...

deployment/leeroy-app is ready. [1/2 deployment(s) still pending]
deployment/leeroy-web is ready.

Here's what I'd love to see:

Deploy complete in 378.581679ms
Waiting for deployments to stabilize...
 - Waiting for deployment "leeroy-app" rollout to finish: 1 old replicas are pending termination...
 - Waiting for deployment "leeroy-web" rollout to finish: 1 old replicas are pending termination...
 - deployment/leeroy-app is ready. [1/2 deployment(s) still pending]
 - deployment/leeroy-web is ready.
Deployments stabilized in Xs

Another thing that it doesn't display and that I think is useful to the users is pods that fail to start.
This s something I see with kubectl get -w pods

@tejal29
Copy link
Contributor Author

tejal29 commented Aug 29, 2019

@dgageot Thanks! The pod check is upcoming in another PR, but i will show you how it look.

Regarding the "UI changes", let me fix it.

Thanks
Tejal

@tejal29 tejal29 force-pushed the health_check_refactor branch from 61c7f19 to a843292 Compare August 29, 2019 18:38
@tejal29 tejal29 force-pushed the health_check_refactor branch from a843292 to dccc68f Compare August 29, 2019 18:40
@tejal29
Copy link
Contributor Author

tejal29 commented Aug 29, 2019

@dgageot This is how it looks on my branch

tejaldesai@@microservices (health_check_refactor)$ skaffold dev --default-repo=gcr.io/tejal-test --status-check
Generating tags...
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web -> gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web:v0.36.0-105-ga8432921
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app -> gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app:v0.36.0-105-ga8432921
Tags generated in 23.651552ms
Checking cache...
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-web: Found
 - gcr.io/tejal-test/gcr.io/k8s-skaffold/leeroy-app: Found
Cache check complete in 749.659109ms
Starting deploy...
kubectl client version: 1.11+
kubectl version 1.12.0 or greater is recommended for use with Skaffold
deployment.apps/leeroy-web created
service/leeroy-app created
deployment.apps/leeroy-app created
Deploy complete in 1.019251518s
Waiting for deployments to stabilize
 - deployment/leeroy-app is pending due to Waiting for rollout to finish: 0 of 1 updated replicas are available...
 - deployment/leeroy-web is pending due to Waiting for rollout to finish: 0 of 1 updated replicas are available...
 - deployment/leeroy-web is ready. [1/2 deployment(s) still pending]
 - deployment/leeroy-app is ready. 
Deployments stabilized in 2.178304751s
Watching for changes...

if err != nil {
reason, details := parseKubectlError(err.Error())
d.UpdateStatus(details, reason, err)
if reason != KubectlConnection {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would want to keep polling for rollout status if there was a connectivity issue.

Copy link
Contributor

@balopat balopat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of nits. This is looking really good, I just have to look again at it with fresh eyes.

Copy link
Contributor

@dgageot dgageot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to break this PR in two pieces? I don't have ideas on how to do that but I must admit I got lost during the review.

@tejal29
Copy link
Contributor Author

tejal29 commented Sep 4, 2019

@dgageot Atcually i can.
i am breaking it up in 3 parts.

  1. Just adding final printSummary Print status check summary when a status check is completed. #2811
  2. Reporting Summary after every 0.5s add polling status after every half second for deployment health check #2813
  3. Adding Resources and refactoring.

Once those 2 merged in, i will update this PR. This Pr will be relatively shorter than.

@tejal29 tejal29 added the !! blocked !! this issue/PR is blocked by another issue label Sep 5, 2019
@tejal29 tejal29 closed this Sep 17, 2019
@tejal29 tejal29 mentioned this pull request Oct 4, 2019
4 tasks
@tejal29 tejal29 deleted the health_check_refactor branch April 15, 2021 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
!! blocked !! this issue/PR is blocked by another issue cla: yes needs-rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants