identity: Implement `change_mode` #18943

schmichael · 2023-10-31T23:59:54Z

This implements the jobspec and client internals for identity.change_mode: allowing users to signal or restart tasks when their identity tokens get rotated.

Identity Watcher Firstrun

As part of this I also blocked taskrunner/identityHook.Prestart(...) from completing until tokens are initially written (or a timeout/context-cancellation is hit). I think before we technically had a race between taskrunner/identityHook.Prestart(...) writing initial tokens and the things that rely on those initial tokens starting up. Not sure if it would be realistically hit, but it seemed like a good idea to copy our template implementation of waiting for a first run.

Let me know if you think the 1 minute timeout should be changed or removed. template blocks indefinitely, but it has a lot more concerns than identity. I really couldn't think of a good reason to have the timer or not and was just guided by my desire to avoid dead/live-locking tasks without a good reason. If a task starts and immediately (well... after the timeout) fails due to a missing identity ... that seems preferable to blocking forever? At least the alloc could eventually be rescheduled elsewhere that might have a better chance to succeed?

Empty String Default

I treat "" and "noop" the same without actually Canonicalizing "" -> "noop" like other change_modes do. I did this for 2 not weak reasons:

Why waste the bytes on the word "noop" when just omitting the entire field communicates the same thing?
We need to default to a coordinated change_mode. If folks end up with >1 TTL'd renewing identities per task, it's going to be hard to manage change_mode well. If you have 2 identities with the same TTL they will cause 2 restarts! Even worse: if you noop one, there's no telling if the other one will signal/restart right before or right after this one due to the jitter added. Add in change_modes from other blocks and this could be a big headache.

Due to #2 I'd almost rather folks just use file = true; env = false, and reload the token file when they get a 403 from their upstream service. 🤷 Us barraging their task with signals is probably more error prone than that.

wip - just jobspec portion

tgross

I've left a few notes but otherwise LGTM

tgross · 2023-11-01T12:19:58Z

client/allocrunner/taskrunner/identity_hook.go

+		case <-deadlineTimer.C:
+			h.logger.Warn("timed out waiting for initial identity tokens to be fetched",
+				"num_fetched", i, "num_total", len(h.task.Identities))
+			return nil


The group-level identity_hook will trigger the WIDMgr and get the identities synchronously before we ever run this hook. So this deadline is only waiting on the WIDMgr broadcasting the notification that it got the signed identities. If we're waiting for over a minute for a send on a channel, are we likely to ever get them? In which case, maybe we should return an error here so that the user sees this error rather than getting an error from downstream consumers?

This hook is early in the taskrunner so returning an error early will also prevent us from doing much more expensive setup work only to throw it away (ex. artifact and volume hooks won't get a chance to run).

(That being said, I wouldn't block this PR on this.)

Ah that makes a ton more sense. So the only threat here is for Clients so busy it can't pluck some structs off a chan and throw them on disk (if file=true) in time. No racing the network or remote machines involved.

When nodes are rebooted they can be quite busy so I am happy to have this race fixed. 1 minute does seem like an eternity though, so perhaps short-circuiting with an error is better than making a best-effort by letting it trundle on. 🤔

It ought to be sufficiently unlikely to not matter, so I don't think I'm going to write up an issue until someone hits it? 🤔

contributing/checklist-jobspec.md

command/agent/job_endpoint.go

shoenig · 2023-11-01T14:33:49Z

client/allocrunner/taskrunner/identity_hook.go

+	// Wait until every watcher ticks the first run chan
+	for i := range h.task.Identities {
+		select {
+		case <-firstRunCh:
+			// Identity fetched, loop
+		case <-deadlineTimer.C:


kind of makes me think there should be a context-aware version of WaitGroup

100% what I needed (or a WaitGroup where Wait() returns a chan)

Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.

schmichael added 5 commits October 31, 2023 14:42

identity: support change_mode and change_signal

9912052

wip - just jobspec portion

test struct

9a4cf5f

cleanup some insignificant boogs

3332a2b

actually implement change mode

3f0f4d1

docs tweaks

d65330b

schmichael added this to the 1.7.0 milestone Oct 31, 2023

schmichael requested review from pkazmierczak and tgross October 31, 2023 23:59

add changelog

1a977e5

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2023 00:03 View deployment

schmichael added 2 commits October 31, 2023 21:25

test identity.change_mode operations

86d0e97

use more words in changelog

8934958

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2023 04:31 View deployment

job endpoint tests

616ca79

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2023 04:37 View deployment

schmichael marked this pull request as ready for review November 1, 2023 04:51

tgross approved these changes Nov 1, 2023

View reviewed changes

address comments from code review

33f15cd

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2023 12:51 View deployment

shoenig reviewed Nov 1, 2023

View reviewed changes

shoenig approved these changes Nov 1, 2023

View reviewed changes

shoenig merged commit e49ca3c into main Nov 1, 2023
27 of 28 checks passed

shoenig deleted the f-wi-changemode branch November 1, 2023 14:41

lgfa29 added a commit that referenced this pull request Nov 9, 2023

chore: remove comment about WI change mode

11529c9

Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.

lgfa29 mentioned this pull request Nov 9, 2023

chore: remove comment about WI change mode #19047

Merged

lgfa29 added a commit that referenced this pull request Nov 9, 2023

chore: remove comment about WI change mode (#19047)

b61a31c

Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identity: Implement `change_mode` #18943

identity: Implement `change_mode` #18943

schmichael commented Oct 31, 2023 •

edited

Loading

tgross left a comment

tgross Nov 1, 2023

tgross Nov 1, 2023

schmichael Nov 1, 2023

shoenig Nov 1, 2023

schmichael Nov 1, 2023 •

edited

Loading

identity: Implement change_mode #18943

identity: Implement change_mode #18943

Conversation

schmichael commented Oct 31, 2023 • edited Loading

Identity Watcher Firstrun

Empty String Default

tgross left a comment

Choose a reason for hiding this comment

tgross Nov 1, 2023

Choose a reason for hiding this comment

tgross Nov 1, 2023

Choose a reason for hiding this comment

schmichael Nov 1, 2023

Choose a reason for hiding this comment

shoenig Nov 1, 2023

Choose a reason for hiding this comment

schmichael Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

identity: Implement `change_mode` #18943

identity: Implement `change_mode` #18943

schmichael commented Oct 31, 2023 •

edited

Loading

schmichael Nov 1, 2023 •

edited

Loading