Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip checking Vault health #8524

Merged
merged 4 commits into from
Aug 11, 2020
Merged

Skip checking Vault health #8524

merged 4 commits into from
Aug 11, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jul 23, 2020

To check if token is valid, we should simply check if the API valid, without needing to check if Vault is healthy/initialized. If it's not, we'll get a proper error.

Back in the day, we used to distinguish between errors related to Vault not initializing (where we'd retry) and errors due to token being invalid (where treated such errors as permanent). However, that changed in #3957 . In this world of retries, it doesn't really make sense to do the pre-check.

Fixes #8512

Mahmood Ali added 3 commits July 23, 2020 11:33
Avoid checking if API is accessible, just make the API call and handle
when it fails.
@notnoop notnoop requested review from schmichael and shoenig July 23, 2020 20:40
@notnoop notnoop self-assigned this Jul 23, 2020
@@ -1239,7 +1229,7 @@ func (v *vaultClient) revokeDaemon() {
case <-v.tomb.Dying():
return
case now := <-ticker.C:
if established, _ := v.ConnectionEstablished(); !established {
if established, err := v.ConnectionEstablished(); !established || err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code here was introduced in 42f7bc8#diff-87a4c76eca3de54716c55ba6b8c06475 - where established is true only when there is no error. So checking error wasn't strictly necessary. After many modifications, this file consistently checks for established && err == nil, but we missed this spot.

For some reason, not so clear to me, #3957 chose to represent invalid tokens as established with true but non-nil error. It seems like we missed this spot.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@notnoop notnoop changed the title B vault health checks Skip checking Vault health Jul 23, 2020
@adawalli
Copy link

Really appreciate the very fast work on this @notnoop !

@adawalli
Copy link

Also - I tested this change out this morning and it completely fixes my issues! I am now dynamically retrieving secrets from Vault!

@adawalli
Copy link

adawalli commented Aug 8, 2020

Will this be included in next version of nomad?

@notnoop
Copy link
Contributor Author

notnoop commented Aug 10, 2020

@adawalli yes, we are aiming to include this in 0.12.2 - but a bit slow in merging this PR. vault.go has been causing some pain lately and want to do a bit more manual/stress testing before merging.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the changelog entry. Probably make it about the linked issue more than the code change?

Nomad servers no longer rely on Vault's /sys/init API as hardened Vault installations may restrict access to that endpoint [GH-8524]

or something similar?

@notnoop notnoop force-pushed the b-vault-health-checks branch from 33604ab to a3b4f06 Compare August 10, 2020 20:46
continue OUTER
}
initStatus = true
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael Can I get a second review please? I have restored the health check because of the way CreateToken handles connection errors:

nomad/nomad/vault.go

Lines 951 to 956 in a3b4f06

// Check if we have established a connection with Vault
if established, err := v.ConnectionEstablished(); !established && err == nil {
return nil, structs.NewRecoverableError(fmt.Errorf("Connection to Vault has not been established"), true)
} else if err != nil {
return nil, err
}

In the above code, we aim to detect if the server failed to make any API contact, vault will return a recoverable error, but if /sys/init (or health now) succeeds, then it marks the token failures as permenant.

I don't buy into this distinction much, but to be conservative in a point release, I brought back the health check. However, I made reversed the order of the check - only check health if parseToken failed. It should be functionally the same as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead make errors from parseSelfToken recoverable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't buy into the distinction but I'm afraid of making of any unintended side-effects considering it seems it was explicitly designed this way and this file has bitten as before. I'm more inclined to keep the change conservative for 0.12.X and make more ambitious refactor in a major release.

@notnoop notnoop merged commit 104ab50 into master Aug 11, 2020
@notnoop notnoop deleted the b-vault-health-checks branch August 11, 2020 20:01
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vault "init" API used instead of "health" API - Causes Failures With Production Vault
5 participants