Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: rewrite envoy bootstrap on every restart #19787

Merged
merged 8 commits into from
Jan 24, 2024
Merged

Conversation

schmichael
Copy link
Member

Fixes #19781

Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781).

This fixes it by always rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact.

Alternative 1: Use a regular file

An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is unsafe as when Consul ACLs are enabled the file will contain a secret token:
https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap

Alternative 2: Detect if file is already written

An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't.

This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.

Fixes #19781

Do not mark the envoy bootstrap hook as done after successfully running
once. Since the bootstrap file is written to /secrets, which is a tmpfs
on supported platforms, it is not persisted across reboots. This causes
the task and allocation to fail on reboot (see #19781).

This fixes it by *always* rewriting the envoy bootstrap file every time
the Nomad agent starts. This does mean we may write a new bootstrap file
to an already running Envoy task, but in my testing that doesn't have
any impact.

*Alternative 1: Use a regular file*

An alternative approach would be to write the bootstrap file somewhere
other than the tmpfs, but this is *unsafe* as when Consul ACLs are
enabled the file will contain a secret token:
https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap

*Alternative 2: Detect if file is already written*

An alternative approach would be to detect if the bootstrap file exists,
and only write it if it doesn't.

This is just a more complicated form of the current fix. I think in
general in the absence of other factors task hooks should be idempotent
and therefore able to rerun on any agent startup. This simplifies the
code and our ability to reason about task restarts vs agent restarts vs
node reboots by making them all take the same code path.
@schmichael schmichael requested review from shoenig and tgross January 20, 2024 00:45
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM code-wise. Interestingly, this fix looks similar to what we did in the envoy_version hook: #16815. When we get some time it might be helpful to see where else we have this kind of logic in the hooks that has forgotten about host restarts.

The TestTaskRunner_EnvoyBootstrapHook_gateway_ok test is failing, which seems relevant to this change. (There's also a flaky test in the leadership transfer command tests which we know about in #19718).

@schmichael schmichael added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Jan 22, 2024
It's no longer accurate and isn't really related to this test.
//
// Done is useful for expensive operations such as downloading artifacts, or
// for operations which might fail needlessly if rerun while a node is
// disconnected.
Done bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great comment, this actually clarifies a lot

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once that last test is passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Envoy proxies always crash after a restart of the host
3 participants