-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file
parameter to job's vault
stanza
#13343
Conversation
Hi @grembo we weren't waiting on CI for a review here, I think there was just some confusion as to which of the several PRs we have open for the original issue was the one we want. I'm going to assign the PR review to @lgfa29 as it looks like it he was the one who spent a bunch of time going back and forth with you on the design. |
Apologies, I though my comment on #11900 (and turning the previous review into a Draft) would make it clear. Let me abandon the previous review to reduce the noise. |
@tgross Hi, any news on this? Thanks! |
Hi @grembo 👋 Apologies for delay on reviewing this, it unfortunately fell out of my radar...I will give it another pass in the coming days! |
Hi all! I'm looking forward to see this PR merged into Nomad. |
This complements the `env` parameter, so that the operator can author tasks that don't share their Vault token with the payload. As a result, more powerful tokens can be used in a job definition, allowing it to use template stanzas to issue all kinds of secrets (database secrets, Vault tokens with very specific policies, etc.), without sharing that issuing power with the task itself as long as a driver with `image` isolation is used. This is accomplished by creating a directory called `private` within the task's working directory, which shares many properties of the `secrets` directory (tmpfs where possible, not accessible by `nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the container. If the `file` parameter is set to `true` (its default), the Vault token is also written to the NOMAD_SECRETS_DIR, so the default behavior is backwards compatible. Even if the operator never changes the default, they will still benefit from the improved behavior of Nomad never reading the token back in from that - potentially altered - location. See hashicorp#11900
Show multiple templates using different change modes and explain what they mean.
In order to maintan backwards compatibility we are not able to create a new field where the expected default is different from its zero value because during an upgrade the Raft log entries will not have this field and therefore existing tasks will be set to non-backwards compatible zero value.
Tasks that were created before a Nomad upgrade will not have the private directory in their filesystem, so the task restore process would fail. This commit adds upgrade path logic to handle cases where the private directory doesn't exist.
If the task's Vault `disable_file` config changes the task group must be recreated to make sure the allocation file system is properly setup. For example, in an upgrade path the private directory will not exist for previous allocations.
Hi everyone 👋 Apologies for the delay on moving this forward, this is an area of the code we've been iterating over and it's also a sneaky complex part as well, so it took me a while to get the time to dedicate to a thorough review. I have been able to give it another pass and found some critical issues on the upgrade path.The biggest one is that this line will cause an existing alloc to fail when the Nomad agent is upgraded because the The other big problem is the lesson we learned the hard way in #17087, which is that we can't default to a non-zero value for a new attribute. What this means is that I had to change the attribute name from Since this PR has been stale for a while because of us, I don't think it would fair to ask @grembo to do all of these changes, so I implemented the fixes in https://github.com/hashicorp/nomad/compare/vault-token-file-rebased (ignore that Due to the merge conflicts I also had to rebase the branch, which means I would need to force push to your branch @grembo to get the changes to this PR. Would that be OK? |
Hi @lgfa29, Thanks for picking this up again.
That's a good find, how are you planning to work around it (create the directory in case it is missing)?
My intention was to have file set to "false" by default in the long run (following the philosophy of safe defaults and opt-in into less secure/more open ones). I understand the backwards compatibility argument though - I have some ideas how such a change could be managed (e.g., have a "use_defaults_" knob, that allows to opt-into a new set of secure defaults), but all of these add extra complications that would delay making progress. So if that's the way to move forward, so be it ;)
Yes, please go ahead |
For reference, this is how I tested the upgrade path.
|
The fix was two-fold:
You can check the diff here: 2f881b7#diff-fdbf72fb40a559f0592c927e632f7f8a7178915920dd439de74e8798bc4acb88 I renamed the variables because I was getting confused which was which 😅 Another thing that was missing that I forgot to mention before is that we need a destructive update when this flag changes because the alloc runner may need to recreate the task dir to create
Yeah, we can start discussions about changing the default behaviour, but that is a bit more complicated. We can open an issue once this is released to discuss it further.
Awesome, will do, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM code-wise, once comments are addressed.
- **«taskname»/private/**: This directory is used by Nomad to store private files | ||
related to the allocation, such as Vault tokens, that are not always shared with tasks | ||
when using `image` isolation. The contents of files in this directory cannot be read | ||
by the `nomad alloc fs` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should specifically document that this directory is not mounted to task drivers using image
isolation but is visible to task drivers using chroot
isolation, and that should also be documented in the jobspec flag, because it's very surprising. (So much so I'm not sure that's what we want at all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's kind of a tricky situation. I think that, ideally, when disable_file
is true
the token would written to a path outside of the alloc dir, but within Nomad's data_dir
, but that would require plumbing the token all the way to the template runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Remove private dir and write the Vault token to the secrets dir to | ||
// simulate an old task. | ||
err := conf.TaskDir.Build(false, nil) | ||
must.NoError(t, err) | ||
err = os.Remove(conf.TaskDir.PrivateDir) | ||
must.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in Slack, this is failing but isn't going to work because it's not doing the unlink we need (ref fs_linux.go#L81-L92
). We could maybe just lift that function into this test but the test is going to be pointless immediately after 1.6.0 ships so I'd say let's just remove this test. We've seen it work E2E.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We came up with a better solution in 371105a: restrict the test to Linux environments and unmount the path before deleting it.
This test requires a task path to be deleted, but on Linux systems it is mounted in a tmpfs, resulting in an error if the path is deleted before unmounting. Since our CI runs on Linux machines, restrict the test to Linux environments and unmount path before deleting.
UI test failure is a known flake. I thought I had fixed it in #17676 but it seems like it's not 100% yet. |
This should go out in Nomad 1.6.0. Thank you very much @grembo for the contribution! |
This complements the
env
parameter, so that the operator can authortasks that don't share their Vault token with the payload. As a result,
more powerful tokens can be used in a job definition, allowing it to
use template stanzas to issue all kinds of secrets (database secrets,
Vault tokens with very specific policies, etc.), without sharing that
issuing power with the task itself as long as a driver with
image
isolation is used.
This is accomplished by creating a directory called
private
withinthe task's working directory, which shares many properties of
the
secrets
directory (tmpfs where possible, not accessible bynomad alloc fs
or Nomad's web UI), but isn't mounted into/bound to thecontainer.
If the
file
parameter is set totrue
(its default), the Vault tokenis also written to the NOMAD_SECRETS_DIR, so the default behavior is
backwards compatible. Even if the operator never changes the default,
they will still benefit from the improved behavior of Nomad never reading
the token back in from that - potentially altered - location.
See #11900