Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity: Implement change_mode #18943

Merged
merged 10 commits into from
Nov 1, 2023
Merged

identity: Implement change_mode #18943

merged 10 commits into from
Nov 1, 2023

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Oct 31, 2023

This implements the jobspec and client internals for identity.change_mode: allowing users to signal or restart tasks when their identity tokens get rotated.

Identity Watcher Firstrun

As part of this I also blocked taskrunner/identityHook.Prestart(...) from completing until tokens are initially written (or a timeout/context-cancellation is hit). I think before we technically had a race between taskrunner/identityHook.Prestart(...) writing initial tokens and the things that rely on those initial tokens starting up. Not sure if it would be realistically hit, but it seemed like a good idea to copy our template implementation of waiting for a first run.

Let me know if you think the 1 minute timeout should be changed or removed. template blocks indefinitely, but it has a lot more concerns than identity. I really couldn't think of a good reason to have the timer or not and was just guided by my desire to avoid dead/live-locking tasks without a good reason. If a task starts and immediately (well... after the timeout) fails due to a missing identity ... that seems preferable to blocking forever? At least the alloc could eventually be rescheduled elsewhere that might have a better chance to succeed?

Empty String Default

I treat "" and "noop" the same without actually Canonicalizing "" -> "noop" like other change_modes do. I did this for 2 not weak reasons:

  1. Why waste the bytes on the word "noop" when just omitting the entire field communicates the same thing?
  2. We need to default to a coordinated change_mode. If folks end up with >1 TTL'd renewing identities per task, it's going to be hard to manage change_mode well. If you have 2 identities with the same TTL they will cause 2 restarts! Even worse: if you noop one, there's no telling if the other one will signal/restart right before or right after this one due to the jitter added. Add in change_modes from other blocks and this could be a big headache.

Due to #2 I'd almost rather folks just use file = true; env = false, and reload the token file when they get a 403 from their upstream service. 🤷 Us barraging their task with signals is probably more error prone than that.

@schmichael schmichael added this to the 1.7.0 milestone Oct 31, 2023
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few notes but otherwise LGTM

Comment on lines +98 to +101
case <-deadlineTimer.C:
h.logger.Warn("timed out waiting for initial identity tokens to be fetched",
"num_fetched", i, "num_total", len(h.task.Identities))
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group-level identity_hook will trigger the WIDMgr and get the identities synchronously before we ever run this hook. So this deadline is only waiting on the WIDMgr broadcasting the notification that it got the signed identities. If we're waiting for over a minute for a send on a channel, are we likely to ever get them? In which case, maybe we should return an error here so that the user sees this error rather than getting an error from downstream consumers?

This hook is early in the taskrunner so returning an error early will also prevent us from doing much more expensive setup work only to throw it away (ex. artifact and volume hooks won't get a chance to run).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That being said, I wouldn't block this PR on this.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes a ton more sense. So the only threat here is for Clients so busy it can't pluck some structs off a chan and throw them on disk (if file=true) in time. No racing the network or remote machines involved.

When nodes are rebooted they can be quite busy so I am happy to have this race fixed. 1 minute does seem like an eternity though, so perhaps short-circuiting with an error is better than making a best-effort by letting it trundle on. 🤔

It ought to be sufficiently unlikely to not matter, so I don't think I'm going to write up an issue until someone hits it? 🤔

contributing/checklist-jobspec.md Outdated Show resolved Hide resolved
command/agent/job_endpoint.go Outdated Show resolved Hide resolved
Comment on lines +93 to +98
// Wait until every watcher ticks the first run chan
for i := range h.task.Identities {
select {
case <-firstRunCh:
// Identity fetched, loop
case <-deadlineTimer.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of makes me think there should be a context-aware version of WaitGroup

Copy link
Member Author

@schmichael schmichael Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% what I needed (or a WaitGroup where Wait() returns a chan)

@shoenig shoenig merged commit e49ca3c into main Nov 1, 2023
27 of 28 checks passed
@shoenig shoenig deleted the f-wi-changemode branch November 1, 2023 14:41
lgfa29 added a commit that referenced this pull request Nov 9, 2023
Identity change mode was implemented in #18943 and handles the update at
the task level, so workload identity manager receives the update as
expected.
lgfa29 added a commit that referenced this pull request Nov 9, 2023
Identity change mode was implemented in #18943 and handles the update at
the task level, so workload identity manager receives the update as
expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants