-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault API Client Tweaks #4817
Vault API Client Tweaks #4817
Conversation
nomad/vault.go
Outdated
@@ -1228,8 +1239,18 @@ func (v *vaultClient) EmitStats(period time.Duration, stopCh chan struct{}) { | |||
case <-time.After(period): | |||
stats := v.Stats() | |||
metrics.SetGauge([]string{"nomad", "vault", "distributed_tokens_revoking"}, float32(stats.TrackedForRevoke)) | |||
metrics.SetGauge([]string{"nomad", "vault", "token_ttl"}, float32(tokenTTL(v.tokenData)/time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric here better has a server label to identify problematic nomad server - but looks like none of the other metrics have server id. I went with consistency in this case, so we can address all metrics together later. I suspect it'd be better to add a global label at the metrics agent level than pollute the metric key.
Users currently can set additional tags in the metrics agent level (e.g. datadog agent), or by setting server specific keys in the Nomad config (e.g. statsd_tags
).
2ed23de
to
650d04f
Compare
e105d17
to
dfaea51
Compare
nomad/vault.go
Outdated
|
||
// Copy all the stats | ||
stats.TrackedForRevoke = v.stats.TrackedForRevoke | ||
if v.tokenData != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only set on establish connection, so after we pass the original expire time, the TTL we display will be negative. We need to update the tokenData's expire time on each renew. That is what lastRenewed was for: https://github.com/hashicorp/nomad/pull/4817/files#diff-87a4c76eca3de54716c55ba6b8c06475L558
nomad/vault.go
Outdated
@@ -474,13 +477,10 @@ func (v *vaultClient) renewalLoop() { | |||
case <-authRenewTimer.C: | |||
// Renew the token and determine the new expiration | |||
err := v.renew() | |||
currentExpiration := v.lastRenewed.Add(time.Duration(v.tokenData.CreationTTL) * time.Second) | |||
currentExpiration := v.tokenData.ExpireTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is never getting updated; renewal will break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - thanks. I think we'll need to update the field to protect against leases TTLs being different from creation ttl and/or when we get closer to the token ttl. I'll rework this.
6213e56
to
7dcc554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR to avoid relying on tokenData to be latest, but to simply track expiration time as a field. This is because the renewSelf
does not return the full token data and does not contain an expire_time
field.
@@ -528,6 +530,49 @@ func TestVaultClient_RenewalLoop(t *testing.T) { | |||
if ttl == 0 { | |||
t.Fatalf("token renewal failed; ttl %v", ttl) | |||
} | |||
|
|||
if client.currentExpiration.Before(time.Now()) { | |||
t.Fatalf("found current expiration to be in past %s", time.Until(client.currentExpiration)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This added check to ensure that after the loop, we track the right expiration. In my earlier attempt, currentExpiration was an past value that caused renewal loop to renew at every loop without any pausing.
7dcc554
to
ac85cee
Compare
nomad/vault.go
Outdated
@@ -203,16 +210,12 @@ type vaultClient struct { | |||
// childTTL is the TTL for child tokens. | |||
childTTL string | |||
|
|||
// lastRenewed is the time the token was last renewed | |||
lastRenewed time.Time | |||
// currentExpiration is the time the current tokean lease expires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// currentExpiration is the time the current tokean lease expires | |
// currentExpiration is the time the current token lease expires |
func (v *vaultClient) renew() error { | ||
// returned. The boolean indicates whether it's safe to attempt to renew again. | ||
// This method updates the currentExpiration time | ||
func (v *vaultClient) renew() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI we have a RecoverableError type we sometimes use in circumstances like this. Since the recoverable aspect is handled within this package and doesn't cross public API boundaries, I don't think there's a compelling reason to switch to it.
|
||
_, err = client.renew() | ||
require.NoError(t, err) | ||
exp1 := client.currentExpiration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does reading this variable without locks trigger the race detector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at race job in [1], I don't see reports of race issues - but I suspect technically there should be. However, given that in this test, we renew on the same goroutine we check the value, it's not an issue; even if there is a another goroutine that could update it later on.
Add a gauge to track remaining time-to-live, duration of renewal request API call.
Seems like the stats field is a micro-optimization that doesn't justify the complexity it introduces. Removing it and computing the stats from revoking field directly.
Return Vault TTL info to /agent/self API and `nomad agent-info` command.
Keep attempting to renew Vault token past locally recorded expiry, just in case the token was renewed out of band, e.g. on another Nomad server, until Vault returns an unrecoverable error.
7838eb0
to
3a57b9c
Compare
// TokenTTL is the time-to-live duration for the current token | ||
TokenTTL time.Duration | ||
|
||
// TokenExpiry Time is the recoreded expiry time of the current token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TokenExpiry Time is the recoreded expiry time of the current token | |
// TokenExpiry Time is the recorded expiry time of the current token |
data.Root = root | ||
v.tokenData = &data | ||
v.currentExpiration = time.Now().Add(time.Duration(data.TTL) * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a read/write race between renew and stats(). Use a lock
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Some tweaks for Vault Client Token renewal: