Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy proxies always crash after a restart of the host #19781

Closed
avandierast opened this issue Jan 19, 2024 · 1 comment · Fixed by #19787
Closed

Envoy proxies always crash after a restart of the host #19781

avandierast opened this issue Jan 19, 2024 · 1 comment · Fixed by #19787

Comments

@avandierast
Copy link

avandierast commented Jan 19, 2024

Hello :)

Issue

After the restart of an instance, the envoy proxies are crashing.
The error in the log is:
error initializing configuration '/secrets/envoy_bootstrap.json': Invalid path: /secrets/envoy_bootstrap.json.
A strong hypothesis is that /secrets is a tmpfs and thus it's wiped when restarting.

The envoy_bootstrap.json is created by nomad directly in the /secrets. It doesn't seem to be possible to change that:

bootstrapFilePath := filepath.Join(req.TaskDir.SecretsDir, "envoy_bootstrap.json")

And the /secrets is always using tmpfs:
if err := syscall.Mount("tmpfs", dir, "tmpfs", flags, options); err != nil {

Is there any way we can bypass this problem ?
We use nomad in a context where there is restarts and it would be great to not have to wait for all proxies to be rescheduled.

It also prevents us from using /secrets to store secret files since it would mean we don't have it anymore after a restart and must reschedule to rebuild it.

Reproduction steps

Start any job with a consul connect proxy on an instance and restart the instance without draining the node.
It may not work if nomad is in -dev since it may drain the node when stopping.

Nomad version

Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Operating system and Environment details

Linux 4.18.0-372.9.1.el8.x86_64

Thanks for your help :)

@avandierast avandierast changed the title Envoy proxies always crashes after a restart of the host Envoy proxies always crashe after a restart of the host Jan 19, 2024
@avandierast avandierast changed the title Envoy proxies always crashe after a restart of the host Envoy proxies always crash after a restart of the host Jan 19, 2024
schmichael added a commit that referenced this issue Jan 20, 2024
Fixes #19781

Do not mark the envoy bootstrap hook as done after successfully running
once. Since the bootstrap file is written to /secrets, which is a tmpfs
on supported platforms, it is not persisted across reboots. This causes
the task and allocation to fail on reboot (see #19781).

This fixes it by *always* rewriting the envoy bootstrap file every time
the Nomad agent starts. This does mean we may write a new bootstrap file
to an already running Envoy task, but in my testing that doesn't have
any impact.

*Alternative 1: Use a regular file*

An alternative approach would be to write the bootstrap file somewhere
other than the tmpfs, but this is *unsafe* as when Consul ACLs are
enabled the file will contain a secret token:
https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap

*Alternative 2: Detect if file is already written*

An alternative approach would be to detect if the bootstrap file exists,
and only write it if it doesn't.

This is just a more complicated form of the current fix. I think in
general in the absence of other factors task hooks should be idempotent
and therefore able to rerun on any agent startup. This simplifies the
code and our ability to reason about task restarts vs agent restarts vs
node reboots by making them all take the same code path.
schmichael added a commit that referenced this issue Jan 24, 2024
Fixes #19781

Do not mark the envoy bootstrap hook as done after successfully running once.
Since the bootstrap file is written to /secrets, which is a tmpfs on supported
platforms, it is not persisted across reboots. This causes the task and
allocation to fail on reboot (see #19781).

This fixes it by *always* rewriting the envoy bootstrap file every time the
Nomad agent starts. This does mean we may write a new bootstrap file to an
already running Envoy task, but in my testing that doesn't have any impact.

This commit doesn't necessarily fix every use of Done by hooks, but hopefully
improves the situation. The comment on Done has been expanded to hopefully
avoid misuse in the future.

Done assertions were removed from tests as they add more noise than value.

*Alternative 1: Use a regular file*

An alternative approach would be to write the bootstrap file somewhere
other than the tmpfs, but this is *unsafe* as when Consul ACLs are
enabled the file will contain a secret token:
https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap

*Alternative 2: Detect if file is already written*

An alternative approach would be to detect if the bootstrap file exists,
and only write it if it doesn't.

This is just a more complicated form of the current fix. I think in
general in the absence of other factors task hooks should be idempotent
and therefore able to rerun on any agent startup. This simplifies the
code and our ability to reason about task restarts vs agent restarts vs
node reboots by making them all take the same code path.
@schmichael
Copy link
Member

Thanks for the report @avandierast! Should be fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants