-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit renewer API grace period logic #3414
Comments
Can the discussion be resumed? I was hit by this problem in consul template and only found out after some code diving. Using exec mode with a low token ttl (2m) creates a very surprising behavior: the renewer returns early, consul-template considers it an error, backoff kicks in and in the end, the token expired and did not get renewed. Setting the grace period lower than 1/3 of the token ttl fixes it but it is non-obvious. At the very least, I think this behaviour could be documented better. |
@bullno1 Seth isn't with HashiCorp anymore so half of the state on the issue got lost. I'll try to get some discussion started up again. |
OK, from a quick look, my feeling is: I hate grace periods. I stripped them all out of Vault proper a long while back (good riddance) and I didn't really like them coming in in the renewer but was convinced otherwise. But I still think it's a bad idea. IIUC the entire point of the grace period is to keep bad clients from doing stupid things (ignoring timers), except predictably it's causing good clients to accidentally do bad things, and it also allows you to write and keep using bad clients. Vault shouldn't be magic, it should do what you expect it to do. My vote right now is to just strip out the grace period entirely although I'd want to ensure that doesn't cause breaks for good clients. Feel free to convince me otherwise, or let me know if I'm misunderstanding something (I didn't do a deep dive back into the code right now). |
It is believed by myself and members of the Nomad team that this logic should be much more robust in terms of causing large numbers of new secret acquisitions caused by a static grace period. See comments in the code for details. Fixes #3414
* Add grace period calculation logic to renewer * Update lease renewer logic. It is believed by myself and members of the Nomad team that this logic should be much more robust in terms of causing large numbers of new secret acquisitions caused by a static grace period. See comments in the code for details. Fixes #3414 * Fix some commenting and fix tests * Add more time to test so that integ tests don't time out * Fix some review feedback
Context - some investigation of issues reported by nomad users who reported instability/constant calls to vault when they used a dynamic secret with ttl < ~ 18 minutes. Nomad uses Consul template which depends on the renewer API.
Root cause - Consul template sets the grace period for Vault to be 5 minutes. The renewer calculates a sleep duration based on the min TTL, but it returns early if the sleep duration is less than the grace period. The sleep duration is calculated as 1/3 of the lease duration plus jitter.
This implies that in order for the renewer to actually sleep before trying again, users would have to set their grace period to be 1/3 of their min TTL. Since most users leave this at default (5 minutes), any secret with a TTL less than ~17 minutes causes the renewer to constantly churn.
More context in this PR's discussion thread hashicorp/consul-template#1020
@sethvargo and I talked about this for a bit. Not sure what the right fix for this is. While we can always change the defaults in nomad/consul template to address this, and add more documentation, this is very nonintuitive. Also, the same API is also used in envconsul, and might confuse those users as well.
As a starting point for discussion, is it worth changing the "return early" logic to only consider lease durations and not sleep durations? That's a little easier to document and intuitively understand (always set the grace period to be smaller than your secret's TTL) vs (set the grace period to be smaller than 1/3 of your secret's TTL, which ties it to implementation details)
cc @sethvargo
The text was updated successfully, but these errors were encountered: