Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyring in raft #23977

Merged
merged 13 commits into from
Sep 19, 2024
Merged

keyring in raft #23977

merged 13 commits into from
Sep 19, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Sep 17, 2024

In Nomad 1.4, we implemented a root keyring to support encrypting Variables and signing Workload Identities. The keyring was originally stored with the AEAD-wrapped DEKs and the KEK together in a JSON keystore file on disk. We recently added support for using an external KMS for the KEK to improve the security model for the keyring. But we've encountered multiple instances of the keystore files not getting backed up separately from the Raft snapshot, resulting in failure to restore clusters from backup.

Move Nomad's root keyring into Raft (encrypted with a KMS/Vault where available) in order to eliminate operational problems with the separate on-disk keystore.

Fixes: #23665
Ref: https://hashicorp.atlassian.net/browse/NET-10523
Ref: https://go.hashi.co/rfc/nmd-203


Notes for reviewers:

  • I wanted to split this out into a couple of PRs but because we're replacing the existing state store table rather than having it side-by-side, there wasn't a reasonable subset that actually compiled doing it that way.
  • Documentation updates will be in a separate PR
  • Tooling updates for redacting key material will be in a separate PR

@@ -127,10 +127,13 @@ const (
ACLBindingRulesDeleteRequestType MessageType = 58
NodePoolUpsertRequestType MessageType = 59
NodePoolDeleteRequestType MessageType = 60
JobVersionTagRequestType MessageType = 61
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phil has this message ID reserved in another PR, so I'm trying to avoid creating a conflict for him here 😁


// WithBackoffFunc is a helper that runs a function with geometric backoff + a
// small jitter to a maximum backoff. It returns once the context closes, with
// the error wrapping over the error from the function.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: this logic was lifted from the csi_hook, which is reasonably battle-tested at this point. Once we merge this I'll go back and update the csi_hook to use this function.

@tgross
Copy link
Member Author

tgross commented Sep 17, 2024

Looks like I've got a failing plan applier test... will look into that ASAP Should be fixed; we weren't waiting for the keyring initialization.

Landlock test failure is unrelated and fixed here: #23979 That's merged. Rebased on main.

schmichael

This comment was marked as resolved.

nomad/leader.go Outdated Show resolved Hide resolved
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few errors we need to handle, but otherwise all of my comments are ignorable hand wringing to try to make the keyring as understandable to users and developers as possible.

Very excited to see this happening!

nomad/encrypter.go Outdated Show resolved Hide resolved
nomad/encrypter.go Show resolved Hide resolved
nomad/structs/keyring.go Outdated Show resolved Hide resolved
nomad/structs/keyring.go Show resolved Hide resolved
nomad/leader.go Outdated Show resolved Hide resolved
nomad/leader.go Outdated Show resolved Hide resolved
nomad/plan_apply_test.go Show resolved Hide resolved
Comment on lines +556 to +557
// Block until keys are decrypted
s.encrypter.IsReady(s.shutdownCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so in the KMS case we block starting the RPC server until we've contacted at least 1 KMS? Will there be a log line to indicate this is what's happening? No RPCs means no observability like remote log streaming or goroutine dumps, so I want to make sure a KMS misconfiguration (like a wrong IP or port that a defensive firewall silently drops packets to) is observable from Nomad agent logs.

Copy link
Member Author

@tgross tgross Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so in the KMS case we block starting the RPC server until we've contacted at least 1 KMS?

The RPC server starts before we get here. So we can receive RPCs (which will fail if they need to VerifyClaim), but we won't return from NewServer and won't register the agent with Consul as a result. We could conceivably move this up higher, but I didn't think it made sense to have it land before we start the legacy keyring replication.

Will there be a log line to indicate this is what's happening?

Yeah it'll actually be extremely noisy in the logs with the line keyring is not ready - waiting for keys: <list of IDs> each time it polls. I didn't want to make it the sort of thing that someone can miss a single log message and then not realize it's going to blow up later.

@tgross tgross merged commit 44f4970 into main Sep 19, 2024
27 checks passed
@tgross tgross deleted the keyring-in-raft branch September 19, 2024 17:56
@tgross tgross mentioned this pull request Sep 19, 2024
tgross added a commit that referenced this pull request Sep 20, 2024
In #23977 we merged a change to how the keyring was stored. Because keyring
initialization takes slightly longer now, this uncovered existing timing bugs in
some of our tests where tests that require the keyring (ex. plan applier tests)
were waiting for the leader but not the keyring initialization. Fix some of the
examples we've seen cause test flakes.
tgross added a commit that referenced this pull request Sep 20, 2024
In #23977 we merged a change to how the keyring was stored. Because keyring
initialization takes slightly longer now, this uncovered existing timing bugs in
some of our tests where tests that require the keyring (ex. plan applier tests)
were waiting for the leader but not the keyring initialization. Fix some of the
examples we've seen cause test flakes.
tgross added a commit that referenced this pull request Sep 20, 2024
In #23977 we moved the keyring into Raft. This changeset documents the
operational changes and adds notes to the upgrade guide.
tgross added a commit that referenced this pull request Sep 20, 2024
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:

* The `operator snapshot state` command gains the ability to display key
  metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
  material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
  existing snapshot.
tgross added a commit that referenced this pull request Sep 20, 2024
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:

* The `operator snapshot state` command gains the ability to display key
  metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
  material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
  existing snapshot.
tgross added a commit that referenced this pull request Sep 25, 2024
In #23977 we moved the keyring into Raft. This changeset documents the
operational changes and adds notes to the upgrade guide.
tgross added a commit that referenced this pull request Oct 3, 2024
In #23977 we merged a change to how the keyring was stored. Because keyring
initialization takes slightly longer now, this uncovered existing timing bugs in
some of our tests where tests that require the keyring (ex. plan applier tests)
were waiting for the leader but not the keyring initialization. Fix another
example we've seen causing test flakes.
tgross added a commit that referenced this pull request Oct 3, 2024
In #23977 we merged a change to how the keyring was stored. Because keyring
initialization takes slightly longer now, this uncovered existing timing bugs in
some of our tests where tests that require the keyring (ex. plan applier tests)
were waiting for the leader but not the keyring initialization. Fix another
example we've seen causing test flakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store keyring in Raft
3 participants