-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keyring in raft #23977
keyring in raft #23977
Conversation
@@ -127,10 +127,13 @@ const ( | |||
ACLBindingRulesDeleteRequestType MessageType = 58 | |||
NodePoolUpsertRequestType MessageType = 59 | |||
NodePoolDeleteRequestType MessageType = 60 | |||
JobVersionTagRequestType MessageType = 61 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phil has this message ID reserved in another PR, so I'm trying to avoid creating a conflict for him here 😁
|
||
// WithBackoffFunc is a helper that runs a function with geometric backoff + a | ||
// small jitter to a maximum backoff. It returns once the context closes, with | ||
// the error wrapping over the error from the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: this logic was lifted from the csi_hook
, which is reasonably battle-tested at this point. Once we merge this I'll go back and update the csi_hook
to use this function.
428667f
to
0cac402
Compare
0cac402
to
94c16a3
Compare
|
94c16a3
to
bfab122
Compare
bfab122
to
8e1f37d
Compare
5c73c22
to
4959f20
Compare
4959f20
to
602ffbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few errors we need to handle, but otherwise all of my comments are ignorable hand wringing to try to make the keyring as understandable to users and developers as possible.
Very excited to see this happening!
// Block until keys are decrypted | ||
s.encrypter.IsReady(s.shutdownCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so in the KMS case we block starting the RPC server until we've contacted at least 1 KMS? Will there be a log line to indicate this is what's happening? No RPCs means no observability like remote log streaming or goroutine dumps, so I want to make sure a KMS misconfiguration (like a wrong IP or port that a defensive firewall silently drops packets to) is observable from Nomad agent logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so in the KMS case we block starting the RPC server until we've contacted at least 1 KMS?
The RPC server starts before we get here. So we can receive RPCs (which will fail if they need to VerifyClaim
), but we won't return from NewServer
and won't register the agent with Consul as a result. We could conceivably move this up higher, but I didn't think it made sense to have it land before we start the legacy keyring replication.
Will there be a log line to indicate this is what's happening?
Yeah it'll actually be extremely noisy in the logs with the line keyring is not ready - waiting for keys: <list of IDs>
each time it polls. I didn't want to make it the sort of thing that someone can miss a single log message and then not realize it's going to blow up later.
18fbc57
to
1e781c2
Compare
In #23977 we merged a change to how the keyring was stored. Because keyring initialization takes slightly longer now, this uncovered existing timing bugs in some of our tests where tests that require the keyring (ex. plan applier tests) were waiting for the leader but not the keyring initialization. Fix some of the examples we've seen cause test flakes.
In #23977 we merged a change to how the keyring was stored. Because keyring initialization takes slightly longer now, this uncovered existing timing bugs in some of our tests where tests that require the keyring (ex. plan applier tests) were waiting for the leader but not the keyring initialization. Fix some of the examples we've seen cause test flakes.
In #23977 we moved the keyring into Raft. This changeset documents the operational changes and adds notes to the upgrade guide.
In #23977 we moved the keyring into Raft, which can expose key material in Raft snapshots when using the less-secure AEAD keyring instead of KMS. This changeset adds tools for redacting this material from snapshots: * The `operator snapshot state` command gains the ability to display key metadata (only), which respects the `-filter` option. * The `operator snapshot save` command gains a `-redact` option that removes key material from the snapshot after it's downloaded. * A new `operator snapshot redact` command allows removing key material from an existing snapshot.
In #23977 we moved the keyring into Raft, which can expose key material in Raft snapshots when using the less-secure AEAD keyring instead of KMS. This changeset adds tools for redacting this material from snapshots: * The `operator snapshot state` command gains the ability to display key metadata (only), which respects the `-filter` option. * The `operator snapshot save` command gains a `-redact` option that removes key material from the snapshot after it's downloaded. * A new `operator snapshot redact` command allows removing key material from an existing snapshot.
In #23977 we moved the keyring into Raft. This changeset documents the operational changes and adds notes to the upgrade guide.
In #23977 we merged a change to how the keyring was stored. Because keyring initialization takes slightly longer now, this uncovered existing timing bugs in some of our tests where tests that require the keyring (ex. plan applier tests) were waiting for the leader but not the keyring initialization. Fix another example we've seen causing test flakes.
In #23977 we merged a change to how the keyring was stored. Because keyring initialization takes slightly longer now, this uncovered existing timing bugs in some of our tests where tests that require the keyring (ex. plan applier tests) were waiting for the leader but not the keyring initialization. Fix another example we've seen causing test flakes.
In Nomad 1.4, we implemented a root keyring to support encrypting Variables and signing Workload Identities. The keyring was originally stored with the AEAD-wrapped DEKs and the KEK together in a JSON keystore file on disk. We recently added support for using an external KMS for the KEK to improve the security model for the keyring. But we've encountered multiple instances of the keystore files not getting backed up separately from the Raft snapshot, resulting in failure to restore clusters from backup.
Move Nomad's root keyring into Raft (encrypted with a KMS/Vault where available) in order to eliminate operational problems with the separate on-disk keystore.
Fixes: #23665
Ref: https://hashicorp.atlassian.net/browse/NET-10523
Ref: https://go.hashi.co/rfc/nmd-203
Notes for reviewers: