Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store keyring in Raft #23665

Closed
schmichael opened this issue Jul 22, 2024 · 2 comments · Fixed by #23977
Closed

Store keyring in Raft #23665

schmichael opened this issue Jul 22, 2024 · 2 comments · Fixed by #23977

Comments

@schmichael
Copy link
Member

schmichael commented Jul 22, 2024

Background

With the introduction of Variables in Nomad 1.4, Nomad has had an internal keyring consisting of:

  1. A Data Encryption Key (DEK) for encrypting Variables in Raft and on disk
  2. A Key Encryption Key (KEK) for encrypting the DEK
  3. A Workload Identity signing key (see identity: default to RS256 for new workload ids #18882 for details)

The KEK is stored on disk distinctly from Raft logs and snapshots so that secret material may not be extracted from leaked snapshots.

As of #23580 Nomad will support using an external KMS for the KEK. However the wrapped (encrypted) DEK and WI keys will still be in the on disk keyring outside of snapshots.

The on disk keyring, even when using a KMS, poses an unfortunate challenge for Nomad operators: they must backup the keyring like snapshots, but distinctly. If a single script backs up both snapshots and keyring to the same location, the only remaining benefit to keeping the two distinct is in the case of sharing snapshots with a third party (eg HashiCorp) for debugging.

Nomad engineering has encountered multiple outages caused by users unaware that they need to backup and restore the keyring distinct but alongside the snapshot.

Proposal

The downsides of storing the keyring separate from snapshots outweighs the potential security benefits. The security benefits are likely rarely realized due to the difficulty of splitting backups across multiple processes and locations. This user experience should be fixed by doing the following:

  1. By default, without a KMS configured, Nomad should store the unencrypted KEK and other wrapped keys in Raft and snapshots. Server agents should log a warning instructing users their snapshots are functionally unencrypted and using a KMS is desirable (or linking to such docs).
  2. When using a KMS, wrapped key material should be stored in Raft and snapshots.
  3. The existing behavior of storing the KEK on disk distinctly from Raft and snapshots should be available for users who do not want to rely on an external KMS but also do not want to risk exposing their keys in Raft or snapshots.

In order to support storing key material separate from snapshots, the following features should be added:

  1. The Snapshot API and operator snapshot save CLI should gain a parameter that allows generating snapshots without the KEK included (for users without a KMS). Users with a KMS configured will never receive their KEK in a snapshot.
  2. A new operator snapshot redact command should remove the KEK from a snapshot if one is present. This command could optionally store the extracted key to a file for backing up distinctly from the snapshot.
  3. operator snapshot inspect should warn users if a snapshot contains a KEK.

Attempted Solutions

The Nomad team has attempted to document proper keyring handling, but has encountered multiple instances of operators being unaware of how to properly handle keyrings during backups and restores.

@pkazmierczak pkazmierczak self-assigned this Jul 30, 2024
@tgross tgross assigned tgross and unassigned pkazmierczak Sep 6, 2024
@tgross tgross added this to the 1.9.0 milestone Sep 6, 2024
tgross added a commit that referenced this issue Sep 11, 2024
For #23665 I'm about to make add a lot more code to the state store for the
keyring, so I'd like to pull these out to their own file. Also updates the test
to use `shoenig/test` and changes the name of one method to be a little more
accurate.
tgross added a commit that referenced this issue Sep 12, 2024
For #23665 I'm about to make add a lot more code to the state store for the
keyring, so I'd like to pull these out to their own file. Also updates the test
to use `shoenig/test` and changes the name of one method to be a little more
accurate.
@schmichael
Copy link
Member Author

Proposal
...
3. The existing behavior of storing the KEK on disk distinctly from Raft and snapshots should be available for users who do not want to rely on an external KMS but also do not want to risk exposing their keys in Raft or snapshots.

I don't think this is required. The benefits are just too hard to realize. Security sensitive users must use a KMS. Just offering this option adds a non-trivial amount of cognitive overhead to operators and security teams to find the right solution for their use case, and that effort is better put toward using a KMS.

tgross added a commit that referenced this issue Sep 17, 2024
While working on #23655 I found there were a few places in the encrypter/keyring
where we could make modest improvements to performance and reliability of the
existing code.

This changeset allows keyring replication to skip trying to replicate from
itself, switches some of the read-only keyring accesses to use the read lock
instead of a r/w lock, fixes the logging configuration to drop spurious "extra
value" warnings in the logs, drops an unused type, and makes a minor refactoring
to eliminate shadowing of the `keyset` type. Pulling this out to its own PR lets
us backport these changes to the LTS and reduces the size of the PR that
implements #23665.

Ref #23665
tgross added a commit that referenced this issue Sep 17, 2024
While working on #23655 I found there were a few places in the encrypter/keyring
where we could make modest improvements to performance and reliability of the
existing code.

This changeset allows keyring replication to skip trying to replicate from
itself, switches some of the read-only keyring accesses to use the read lock
instead of a r/w lock, fixes the logging configuration to drop spurious "extra
value" warnings in the logs, drops an unused type, and makes a minor refactoring
to eliminate shadowing of the `keyset` type. Pulling this out to its own PR lets
us backport these changes to the LTS and reduces the size of the PR that
implements #23665.

Ref #23665
tgross added a commit that referenced this issue Sep 19, 2024
In Nomad 1.4, we implemented a root keyring to support encrypting Variables and
signing Workload Identities. The keyring was originally stored with the
AEAD-wrapped DEKs and the KEK together in a JSON keystore file on disk. We
recently added support for using an external KMS for the KEK to improve the
security model for the keyring. But we've encountered multiple instances of the
keystore files not getting backed up separately from the Raft snapshot,
resulting in failure to restore clusters from backup.

Move Nomad's root keyring into Raft (encrypted with a KMS/Vault where available)
in order to eliminate operational problems with the separate on-disk keystore.

Fixes: #23665
Ref: https://hashicorp.atlassian.net/browse/NET-10523
@tgross tgross closed this as completed in 44f4970 Sep 19, 2024
@tgross
Copy link
Member

tgross commented Sep 19, 2024

#23977 has merged. I'll have a separate PR for docs and CLI tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants