Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

template: trigger change_mode for dynamic secrets on restore #9636

Merged
merged 8 commits into from
Dec 16, 2020

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 14, 2020

Fixes #9491

When a task is restored after a client restart, the template runner will
create a new lease for any dynamic secret (ex. Consul or PKI secrets
engines). But because this lease is being created in the prestart hook, we
don't trigger the change_mode.

This changeset uses the the existence of the task handle to detect a
previously running task that's been restored, so that we can trigger the
template change_mode if the template is changed, as it will be only with
dynamic secrets.


Note to reviewers: because this requires a bit of a refactor of the onTemplateRendered so that it can be used for both first-pass and re-render, this PR is probably best reviewed commit-by-commit. #9491 (comment) and the diagram in the comment that follow it can walk you thru the problem.

@tgross tgross marked this pull request as ready for review December 14, 2020 22:10
@tgross tgross requested review from notnoop and krishicks December 14, 2020 22:10
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have one question mostly so I understand the low level mechanics of the consul-template library

// driver, which is useful for distinguishing restored tasks during
// prestart hooks. Note: prestart should be idempotent whenever possible
// to handle restored tasks safely; use this as an escape hatch.
HasHandle() bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true only if the task is running? If so, would IsRunning() be a more appropriate name? I also suspect we need to watch out for check-then-act races (e.g. process killed immediately after check) - I suspect in this context, it's fine. Might be worth noting the issue in the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out... it's probably worth me double-checking that if the driver handle goes away we can't NPE in the template runner. At that point we're in a non-working state for this task to be sure but it shouldn't blow up the client process.

I went back-and-forth on the name because we only know that there's a driver handle, not that the process is running and I didn't want future callers to be confused by that. But given the TOCTOU concern you've pointed out, I don't think the caller can assume all that much here except a best effort so I'll change the name and add some commentary about that.

Copy link
Member Author

@tgross tgross Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed in 8c66608


events := tm.runner.RenderEvents()
for id, event := range events {
func (tm *TaskTemplateManager) onTemplateRendered(handledRenders map[string]time.Time, allRenderedTime time.Time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on extracting the function. This commit is best viewed with ?w=1: 815f01c?w=1 .

Comment on lines 836 to 837
select {
case <-harness.mockHooks.RestartCh:
t.Fatalf("should not have restarted: %+v", harness.mockHooks)
case <-harness.mockHooks.SignalCh:
t.Fatalf("should not have restarted: %+v", harness.mockHooks)
case <-time.After(time.Duration(1*testutil.TestMultiplier()) * time.Second):
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to testing that process doesn't get restarted if the template doesn't change. However, I'm unclear how consul-template state management handles that? I assumed consul-template library kept template rendition in memory to compare with re-renditions. On nomad agent restart, how does CT library refresh the latest state - does it read the template file and refresh its state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On nomad agent restart, how does CT library refresh the latest state - does it read the template file and refresh its state?

Yeah, there are effectively 3 chunks of state that CT has:

  • The source which gets reloaded on every client restart.
  • The Consul/Vault contents, which gets refreshed on every client restart. That's actually the source of our problem here because for dynamic secrets that's a write which creates a new lease and not a read. Every time this triggers, the RenderEvent.WouldRender is true. And that includes a client restart.
  • The destination: the CT library then compares the results with the current contents of the destination and the RenderEvent.DidRender is set to false if it didn't actually change on-disk.

So what we're doing is differentiating between the RenderEvent.WouldRender and RenderEvent.DidRender cases to avoid doing extra restarts we don't need.

Also: this is making me realize we should clarify the behavior around changing the source out of band; if you do that while Nomad is running, CT doesn't see it. But then if you restart it will see the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a CHANGELOG and upgrade guide docs PR to go with this (holding off until the giant docs replatform change lands). I'll update the docs on source in that PR.

client/allocrunner/taskrunner/template/template_test.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – nomad December 15, 2020 15:44 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 15, 2020 15:44 Inactive
@tgross tgross force-pushed the b-template-first-render branch from 2da0b1f to 8c66608 Compare December 15, 2020 15:48
@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 15, 2020 15:48 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad December 15, 2020 15:48 Inactive
@tgross
Copy link
Member Author

tgross commented Dec 15, 2020

Move this back into draft as per #9491 (comment)

After a long soak test it looks like the logmon error is a little misleading and the problem really is in the task restart, so I've got to rework the solution I've got in #9636. I haven't quite pinned down the problem yet.

@tgross tgross marked this pull request as draft December 15, 2020 20:12
@tgross tgross marked this pull request as ready for review December 16, 2020 16:06
@tgross tgross force-pushed the b-template-first-render branch from 8c66608 to b40ae67 Compare December 16, 2020 16:33
@tgross
Copy link
Member Author

tgross commented Dec 16, 2020

Added changelog and upgrade docs.

See #9491 (comment) for why this is now ready-to-go.
Broken UI test most likely blocked by #9649

@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 16, 2020 16:33 Inactive
@tgross tgross requested a review from notnoop December 16, 2020 16:33
Pull this function out so it can be used by the first-pass renderer when
we restore a task from a restarted client.
In the task hook implementation, `HasHandle` returns true if the task runner
has a handle to the task driver, so that we can distinguish restored tasks
during prestart hooks without passing around additional state.
When a task is restored after a client restart, the template runner will
create a new lease for any dynamic secret (ex. Consul or PKI secrets
engines). But because this lease is being created in the prestart hook, we
don't trigger the `change_mode`.

This changeset uses the the existence of the task handle to detect a
previously running task that's been restored, so that we can trigger the
template `change_mode` if the template is changed, as it will be only with
dynamic secrets.
@tgross tgross force-pushed the b-template-first-render branch from b40ae67 to f357837 Compare December 16, 2020 17:02
@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 16, 2020 17:02 Inactive
@tgross tgross added this to the 1.0.2 milestone Dec 16, 2020
Copy link
Contributor

@krishicks krishicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a bit beyond my ability to review in-depth; there's too much that I'm unfamiliar with.

client/allocrunner/taskrunner/interfaces/lifecycle.go Outdated Show resolved Hide resolved
client/allocrunner/taskrunner/template/template.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 16, 2020 18:13 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad December 16, 2020 18:13 Inactive
@tgross tgross merged commit 004f1c9 into master Dec 16, 2020
@tgross tgross deleted the b-template-first-render branch December 16, 2020 18:36
backspace pushed a commit that referenced this pull request Jan 22, 2021
When a task is restored after a client restart, the template runner will
create a new lease for any dynamic secret (ex. Consul or PKI secrets
engines). But because this lease is being created in the prestart hook, we
don't trigger the `change_mode`.

This changeset uses the the existence of the task handle to detect a
previously running task that's been restored, so that we can trigger the
template `change_mode` if the template is changed, as it will be only with
dynamic secrets.
@github-actions
Copy link

github-actions bot commented Dec 5, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues
3 participants