keyring: fixes for keyring replication on cluster join #14987

tgross · 2022-10-20T16:00:09Z

Fixes #14981. Note to reviewers: these are definitely bugs but I don't have 100% confidence we've solved the problem because there's a log line missing in the reports from both the issue author and our internal user that I would expect to see. So I'm still trying to repro exactly, but it's worth getting some early eyes on this PR anyways.

Don't unblock early if rate limit burst exceeded. The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests.
Set MinQueryIndex on stale queries. When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID.
Note that the "not found" case does not return an error, just an empty key. Update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key. (Peers aren't shuffled so we'd expect to hit the same server repeatedly.)
Move the keyring initialize step to wait until we're sure the FSM is current.
If a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans

The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests.

tgross · 2022-10-20T17:52:44Z

Don't unblock early if rate limit burst exceeded. The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf).

Aside: there's a couple other existing cases of this in leader.go which I want to fix as well, but I don't want to muddy up this PR with that.

tgross · 2022-10-20T17:55:39Z

https://app.circleci.com/pipelines/github/hashicorp/nomad/33354/workflows/dbbd75a4-25cc-4d3d-9205-a817f7d68159/jobs/376989 shows what might be a reproduction of the original problem. I'm investigating that. No, this is a bug in the new code 🤦 Fixed!

When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID. Ensure that we're setting the correct reply index in the blocking query. Note that the "not found" case does not return an error, just an empty key. So as a belt-and-suspenders, update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key.

Wait until we're sure the FSM is current before we try to initialize the keyring. Also, if a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans

shoenig

LGTM, just thinking about Limiters

shoenig · 2022-10-21T14:37:17Z

nomad/encrypter.go

 			// Rate limit how often we attempt replication
-			limiter.Wait(ctx)
+			err := limiter.Wait(ctx)
+			if err != nil {
+				goto ERR_WAIT // rate limit exceeded
+			}


Seems like if we have our own blocking mechanism, we should be using Allow instead? Otherwise we are kind of double-waiting, which is fine but awkward given the error. (no need to change anything now)

https://pkg.go.dev/golang.org/x/time/rate#Limiter

edit: actually no, Wait has the nice property of waiting a minimal amount of time, unblocking once a resource becomes free

Yeah this API is weirdly hard to use correctly and the error case is only for exceeding the burst. I first thought we probably wanted to use Reserve but that has the same double-waiting problem. Maybe we should just remove the burst (set it to Inf) so that we guarantee we'll wait and not have to handle this error so awkwardly?

I misread the API again 😊 . Only when the rate is set to rate.Inf is the burst-limit ignored. So this is good as-is, awkward as it seems.

tgross · 2022-10-21T16:32:12Z

I've got a clue that the core problem here is actually related to bugs in garbage collection, but I'm going to merge this in and push up a new PR with that fix once I've got that pinned down.

tgross · 2022-10-21T18:27:52Z

Oops, forgot the changelog entry. Will add that to #15009

zansity · 2022-10-24T15:05:37Z

Hello! Any idea when 1.4.2 will be released? One of my environments is currently unstable due to key rotations becoming "stuck", and would like to test the new fix. Thank you!

tgross · 2022-10-24T15:23:10Z

Hello! Any idea when 1.4.2 will be released? One of my environments is currently unstable due to key rotations becoming "stuck", and would like to test the new fix. Thank you!

I can't commit to exact dates but I can tell you we're in the process of finalizing the release now, so it'll be "very soon."

schmichael · 2022-10-24T18:11:17Z

nomad/encrypter.go

+	getActiveKeyset := func() (*keyset, error) {
+		e.lock.RLock()
+		defer e.lock.RUnlock()
+		keyset, err := e.activeKeySetLocked()
+		return keyset, err
+	}


Sorry for the after-the-fact drive-by review:

Looks like we could replace activeKeySetLocked with this func as the only other place the key is read is Encrypt and it similarly only needs the lock for the duration of fetching the key.

The fact neither reader of this field need the lock other than for this field signals to me that the lock should be encapsulated in the getter and not the caller's concern.

Not a big deal though.

That's a good catch. activeKeySetLocked ends up holding onto the lock longer than it needs to as well -- it doesn't need to hold the lock while querying the state store. I'll open another quickie PR for that.

Follow-up PR: #15026

github-actions · 2023-02-22T02:16:58Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross mentioned this pull request Oct 20, 2022

failed to submit plan for evaluation: ... no such key \"<snip>\" in keyring error after moving cluster to 1.4.1 #14981

Closed

tgross added this to the 1.4.2 milestone Oct 20, 2022

tgross added theme/keyring type/bug labels Oct 20, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui October 20, 2022 16:04 View deployment

tgross changed the title ~~keyring: don't unblock early if rate limit burst exceeded~~ keyring: fixes for keyring replication on cluster join Oct 20, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui October 20, 2022 17:21 View deployment

tgross requested review from angrycub, schmichael and shoenig October 20, 2022 17:21

tgross force-pushed the b-keyring-replication-limit branch from 12902d4 to e8a0ec1 Compare October 20, 2022 19:00

vercel bot deployed to Preview – nomad-storybook-and-ui October 20, 2022 19:03 View deployment

test for adding new servers to keyring

1c31df7

vercel bot deployed to Preview – nomad-storybook-and-ui October 20, 2022 20:00 View deployment

tgross force-pushed the b-keyring-replication-limit branch from ed7bafb to 45f7352 Compare October 21, 2022 14:06

tgross marked this pull request as ready for review October 21, 2022 14:06

vercel bot deployed to Preview – nomad-storybook-and-ui October 21, 2022 14:12 View deployment

shoenig approved these changes Oct 21, 2022

View reviewed changes

tgross merged commit 5732eb2 into main Oct 21, 2022

tgross deleted the b-keyring-replication-limit branch October 21, 2022 16:33

tgross added the backport/1.4.x backport to 1.4.x release line label Oct 21, 2022

hc-github-team-nomad-core mentioned this pull request Oct 21, 2022

Backport of keyring: fixes for keyring replication on cluster join into release/1.4.x #15010

Merged

tgross mentioned this pull request Oct 21, 2022

keyring: fix missing GC config, don't rotate on manual GC #15009

Merged

schmichael reviewed Oct 24, 2022

View reviewed changes

github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keyring: fixes for keyring replication on cluster join #14987

keyring: fixes for keyring replication on cluster join #14987

tgross commented Oct 20, 2022 •

edited

Loading

tgross commented Oct 20, 2022

tgross commented Oct 20, 2022 •

edited

Loading

shoenig left a comment

shoenig Oct 21, 2022 •

edited

Loading

tgross Oct 21, 2022

tgross Oct 21, 2022

tgross commented Oct 21, 2022

tgross commented Oct 21, 2022

zansity commented Oct 24, 2022

tgross commented Oct 24, 2022

schmichael Oct 24, 2022

tgross Oct 24, 2022

tgross Oct 24, 2022

github-actions bot commented Feb 22, 2023

keyring: fixes for keyring replication on cluster join #14987

keyring: fixes for keyring replication on cluster join #14987

Conversation

tgross commented Oct 20, 2022 • edited Loading

tgross commented Oct 20, 2022

tgross commented Oct 20, 2022 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

shoenig Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

tgross Oct 21, 2022

Choose a reason for hiding this comment

tgross Oct 21, 2022

Choose a reason for hiding this comment

tgross commented Oct 21, 2022

tgross commented Oct 21, 2022

zansity commented Oct 24, 2022

tgross commented Oct 24, 2022

schmichael Oct 24, 2022

Choose a reason for hiding this comment

tgross Oct 24, 2022

Choose a reason for hiding this comment

tgross Oct 24, 2022

Choose a reason for hiding this comment

github-actions bot commented Feb 22, 2023

tgross commented Oct 20, 2022 •

edited

Loading

tgross commented Oct 20, 2022 •

edited

Loading

shoenig Oct 21, 2022 •

edited

Loading