-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identity: Implement change_mode
#18943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few notes but otherwise LGTM
case <-deadlineTimer.C: | ||
h.logger.Warn("timed out waiting for initial identity tokens to be fetched", | ||
"num_fetched", i, "num_total", len(h.task.Identities)) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group-level identity_hook
will trigger the WIDMgr and get the identities synchronously before we ever run this hook. So this deadline is only waiting on the WIDMgr broadcasting the notification that it got the signed identities. If we're waiting for over a minute for a send on a channel, are we likely to ever get them? In which case, maybe we should return an error here so that the user sees this error rather than getting an error from downstream consumers?
This hook is early in the taskrunner so returning an error early will also prevent us from doing much more expensive setup work only to throw it away (ex. artifact
and volume
hooks won't get a chance to run).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That being said, I wouldn't block this PR on this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes a ton more sense. So the only threat here is for Clients so busy it can't pluck some structs off a chan and throw them on disk (if file=true) in time. No racing the network or remote machines involved.
When nodes are rebooted they can be quite busy so I am happy to have this race fixed. 1 minute does seem like an eternity though, so perhaps short-circuiting with an error is better than making a best-effort by letting it trundle on. 🤔
It ought to be sufficiently unlikely to not matter, so I don't think I'm going to write up an issue until someone hits it? 🤔
// Wait until every watcher ticks the first run chan | ||
for i := range h.task.Identities { | ||
select { | ||
case <-firstRunCh: | ||
// Identity fetched, loop | ||
case <-deadlineTimer.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of makes me think there should be a context-aware version of WaitGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% what I needed (or a WaitGroup where Wait() returns a chan)
Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.
Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.
This implements the jobspec and client internals for
identity.change_mode
: allowing users to signal or restart tasks when their identity tokens get rotated.Identity Watcher Firstrun
As part of this I also blocked
taskrunner/identityHook.Prestart(...)
from completing until tokens are initially written (or a timeout/context-cancellation is hit). I think before we technically had a race betweentaskrunner/identityHook.Prestart(...)
writing initial tokens and the things that rely on those initial tokens starting up. Not sure if it would be realistically hit, but it seemed like a good idea to copy ourtemplate
implementation of waiting for a first run.Let me know if you think the 1 minute timeout should be changed or removed.
template
blocks indefinitely, but it has a lot more concerns thanidentity
. I really couldn't think of a good reason to have the timer or not and was just guided by my desire to avoid dead/live-locking tasks without a good reason. If a task starts and immediately (well... after the timeout) fails due to a missingidentity
... that seems preferable to blocking forever? At least the alloc could eventually be rescheduled elsewhere that might have a better chance to succeed?Empty String Default
I treat
""
and"noop"
the same without actually Canonicalizing"" -> "noop"
like otherchange_mode
s do. I did this for 2 not weak reasons:change_mode
. If folks end up with >1 TTL'd renewing identities per task, it's going to be hard to managechange_mode
well. If you have 2 identities with the same TTL they will cause 2 restarts! Even worse: if you noop one, there's no telling if the other one will signal/restart right before or right after this one due to the jitter added. Add inchange_mode
s from other blocks and this could be a big headache.Due to
#2
I'd almost rather folks just usefile = true; env = false
, and reload the token file when they get a 403 from their upstream service. 🤷 Us barraging their task with signals is probably more error prone than that.