Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-rollout] Add prometheus init check #16000

Merged
merged 1 commit into from
Jan 30, 2023
Merged

Conversation

Pothulapati
Copy link
Contributor

@Pothulapati Pothulapati commented Jan 24, 2023

Description

When given a non-usable --prometheus-url, We start the rollout without verifying if the prometheus is reachable or not. This is a problem as we will be unable to get the metrics from prometheus and hence the rollout will be reverted later causing unnecessary time waste.

This can be prevented by performing a simple check to see if the prometheus is reachable or not. up query is used instead of key metrics as we can't be sure of their existence.

Signed-off-by: Tarun Pothulapati [email protected]

Related Issue(s)

Fixes #

How to test

Testing by passing a unreachable prometheus URL

gitpod /workspace/gitpod/components/workspace-rollout-job (tar/ws-rollout-prom-init) $ go run . --new-cluster g2c4ec3d5cb --old-cluster gfd9876cfe2 --prometheus-url http://localhos:9090
INFO[0000] Starting workspace-rollout-job               
FATA[0001] init: prometheus is not reachable             error="Post \"http://localhos:9090/api/v1/query\": dial tcp: lookup localhos on 1.1.1.1:53: no such host"
exit status 1

Release Notes

[ws-rollout] Add prometheus init check

Documentation

Build Options:

  • /werft with-github-actions
    Experimental feature to run the build with GitHub Actions (and not in Werft).
  • leeway-no-cache
    leeway-target=components:all-ci
  • /werft no-test
    Run Leeway with --dont-test
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh

@Pothulapati Pothulapati requested review from a team January 24, 2023 07:33
@github-actions github-actions bot added team: SID team: workspace Issue belongs to the Workspace team labels Jan 24, 2023
When given a non-usable `--prometheus-url`, We start the
rollout without verifying if the prometheus is reachable or not. This
is a problem as we will be unable to get the metrics from prometheus
and hence the rollout will be reverted later causing unnecessary
time waste.

This can be prevented by performing a simple check to see if the
prometheus is reachable or not. `up` query is used instead of
key metrics as we can't be sure of their existence.

Signed-off-by: Tarun Pothulapati <[email protected]>
@Pothulapati Pothulapati force-pushed the tar/ws-rollout-prom-init branch from b443912 to efe4e88 Compare January 24, 2023 07:34
Copy link
Member

@WVerlaek WVerlaek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - have some non-blocking suggestions

err = analysis.CheckPrometheusReachable(ctx, conf.prometheusURL)
if err != nil {
log.WithError(err).Fatal("init: prometheus is not reachable")
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could remove the error return here, log.Fatal will already exit the process

Comment on lines +30 to +32
if prometheusURL == "" {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving this function to WorkspaceKeyMetricsAnalyzer, e.g. WorkspaceKeyMetricsAnalyzer.CheckPrometheusReachable(), and then use the api client of the analyzer to check if it's reachable. You'd only need the v1Client.Query(ctx, "up", time.Now()) part then, deduplicates the construction of the API clients

@kylos101
Copy link
Contributor

@gitpod-io/engineering-delivery-operations-experience could we ask for your help with a review? It'd be great to test drive ws-rollout next week during the traffic shift.

@ArthurSens
Copy link
Contributor

ArthurSens commented Jan 26, 2023

@kylos101 since I was the one reviewing tarun's work before, I can say that the PR looks good. @WVerlaek's review makes sense though, but I'm not sure @Pothulapati can push changes to the branch since he is not part of the org anymore (nor if he is willing to do so)

@corneliusludmann could you give the ✅ just so the change gets merged and you'll apply the fix in another PR?

@kylos101
Copy link
Contributor

/unhold

We'll handle the suggestions in a follow-on PR, if needed, thank you @WVerlaek for the comprehensive review!

Also, @nandajavarma appreciate you unblocking this!

@roboquat roboquat merged commit f05f5bd into main Jan 30, 2023
@roboquat roboquat deleted the tar/ws-rollout-prom-init branch January 30, 2023 21:04
@roboquat roboquat added the deployed: workspace Workspace team change is running in production label Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production release-note size/M team: SID team: workspace Issue belongs to the Workspace team
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

6 participants