keyring: safely handle missing keys and restore GC #15092

tgross · 2022-10-31T20:30:29Z

When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely.

Included in the replication fix:

Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue.
Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.

(Best reviewed commit-by-commit.)

This reverts commit b583f78. But keeps the changelog entry from #15034.

When replication of a single key fails, the replication loop broke early and therefore keys that fell later in the sorting order would never get replicated. Log the error and continue. Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting.

We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test.

tgross · 2022-10-31T20:37:56Z

nomad/state/schema.go

+				Indexer: &memdb.UUIDFieldIndex{
+					Field: "SigningKeyID",
+				},


Note: An alternative to storing this field in the Allocation struct would be to extract it from the signed identity claim in an indexer. I got that to work but it brought all the JWT parsing down here into the state store and we'd be parsing JWTs every time we restore a snapshot, which seemed terrible. The key ID isn't a lot of data (just a UUID string), and having it on the Allocation will be convenient for later use anyways, I'm sure.

nomad/core_sched.go

angrycub · 2022-10-31T22:55:38Z

nomad/core_sched.go

+		// don't GC a key that's deprecated, that's encrypted a variable or
+		// signed a live workload identity
+		inUse := false
+		if !keyMeta.Deprecated() {


This looks like it will reach the delete case if keyMeta.Deprecated() returns true, which doesn't seem to jibe with the comment.

Dropping from our sidebar discussion on Slack:

I think I've figured out why it exists, and we don't need it anymore:

In the original design, we didn't have a way of determining if the key was in use by a WI except by guessing (incorrectly, as it turned out) based on the indexes. So we marked it as deprecated b/c the GC that actually removed it could happen much later and having a status that let the user know we've done a rekey on it was maybe helpful for diagnostics.

In the new design, we can just not touch the key at all in the rekey pass. It'll already be marked inactive because the key was rotated. The next periodic GC will delete the key if it's not in use. Much simpler.

I've done that and I think this looks pretty good?

schmichael

Can you remove CoreScheduler.getOldestAllocationIndex? It doesn't appear to be used anymore. FWIW I have seen clusters with running allocs whose last modify time/index was over a year ago (which is absolutely fine and not necessarily a bug/problem), but it means basing logic purely off of that may mean the index very very rarely changes.

nomad/encrypter.go

schmichael · 2022-11-01T00:12:39Z

nomad/state/state_store.go

+
+// IsRootKeyMetaInUse determines whether a key has been used to sign a workload
+// identity for a live allocation or encrypt any variables
+func (s *StateStore) IsRootKeyMetaInUse(keyID string) (bool, error) {


This made me a little nervous at first glance because it could trend toward quadratic over time if there's at-least-one alloc for every key that's ever existed...

...however at the default key rotation rate of 30 days even in the maximally pathological case of 1-alloc-for-every-key-ever, that's a whopping 120 iterations over all allocations after a decade.

If we ever become concerned with the performance here we can always add ref counting to keys and update it when allocs are created or updated (or even just wait to decrement when the allocs are gc'd if we want to be really lazy). I'm guessing we'll never need to do that.

Making a compound index that only includes non-terminal allocs would be an easier optimization if not quite as effective.

Yeah, there are a lot of cases where I want a cheap Count() method in memdb. 😀 I can do that compound index though: ~~it reduces the memory usage for the index a bit for clusters with lots of short-lived allocs, and~~ reduces the number of iterations we have to do here at a tiny cost on txn.Insert (which is already doing expensive reflection stuff over the fields anyways). Seems worth it.

Done! (Struck out the note about memory b/c I'd forgotten how conditional indexes work)

nomad/encrypter.go

Co-authored-by: Charlie Voiselle <[email protected]>

github-actions · 2023-03-02T02:19:28Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added 5 commits October 31, 2022 15:35

Revert "keyring: remove root key GC (#15034)"

e880bce

This reverts commit b583f78. But keeps the changelog entry from #15034.

store key ID used for workload identity signing with alloc

6c3a830

keyring GC: check if a key has been used to sign a workload identity

ae083a2

improve stability of keyring replication tests

15fd16a

We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test.

This was referenced Oct 31, 2022

failed to submit plan for evaluation: ... no such key \"<snip>\" in keyring error after moving cluster to 1.4.1 #14981

Closed

re-enable root keyring garbage collection #15088

Closed

tgross added this to the 1.4.x milestone Oct 31, 2022

tgross commented Oct 31, 2022

View reviewed changes

changelog entries

782d03e

tgross marked this pull request as ready for review October 31, 2022 20:41

tgross requested review from schmichael and angrycub October 31, 2022 20:41

tgross added the theme/keyring label Oct 31, 2022

tgross requested a review from jrasell October 31, 2022 20:41

tgross added the type/bug label Oct 31, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui October 31, 2022 20:45 View deployment

tgross commented Oct 31, 2022

View reviewed changes

nomad/core_sched.go Outdated Show resolved Hide resolved

angrycub reviewed Oct 31, 2022

View reviewed changes

schmichael approved these changes Nov 1, 2022

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 14:00 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 14:04 View deployment

tgross added 4 commits November 1, 2022 10:34

code review: use compound index over liveness

571118f

code review: clean up logs and docstrings

2b571f0

code review: remove dead code

6962168

code review: drop deprecated state

bd3c556

tgross force-pushed the keyring-missing-keys branch from 8d35a22 to bd3c556 Compare November 1, 2022 14:34

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 14:45 View deployment

tgross modified the milestones: 1.4.x, 1.4.3 Nov 1, 2022

tgross added the backport/1.4.x backport to 1.4.x release line label Nov 1, 2022

tgross requested a review from angrycub November 1, 2022 17:26

schmichael approved these changes Nov 1, 2022

View reviewed changes

angrycub reviewed Nov 1, 2022

View reviewed changes

nomad/encrypter.go Outdated Show resolved Hide resolved

schmichael approved these changes Nov 1, 2022

View reviewed changes

Update nomad/encrypter.go

3d98ea1

Co-authored-by: Charlie Voiselle <[email protected]>

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 18:38 View deployment

tgross merged commit 6b2da83 into main Nov 1, 2022

tgross deleted the keyring-missing-keys branch November 1, 2022 19:00

hc-github-team-nomad-core mentioned this pull request Nov 1, 2022

Backport of keyring: safely handle missing keys and restore GC into release/1.4.x #15100

Merged

github-actions bot locked as resolved and limited conversation to collaborators Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keyring: safely handle missing keys and restore GC #15092

keyring: safely handle missing keys and restore GC #15092

tgross commented Oct 31, 2022 •

edited

Loading

tgross Oct 31, 2022 •

edited

Loading

angrycub Oct 31, 2022

tgross Nov 1, 2022

schmichael left a comment

schmichael Nov 1, 2022

tgross Nov 1, 2022 •

edited

Loading

tgross Nov 1, 2022 •

edited

Loading

github-actions bot commented Mar 2, 2023

keyring: safely handle missing keys and restore GC #15092

keyring: safely handle missing keys and restore GC #15092

Conversation

tgross commented Oct 31, 2022 • edited Loading

tgross Oct 31, 2022 • edited Loading

Choose a reason for hiding this comment

angrycub Oct 31, 2022

Choose a reason for hiding this comment

tgross Nov 1, 2022

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

schmichael Nov 1, 2022

Choose a reason for hiding this comment

tgross Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

tgross Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 2, 2023

tgross commented Oct 31, 2022 •

edited

Loading

tgross Oct 31, 2022 •

edited

Loading

tgross Nov 1, 2022 •

edited

Loading

tgross Nov 1, 2022 •

edited

Loading