-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy proxies always crash after a restart of the host #19781
Labels
Comments
avandierast
changed the title
Envoy proxies always crashes after a restart of the host
Envoy proxies always crashe after a restart of the host
Jan 19, 2024
avandierast
changed the title
Envoy proxies always crashe after a restart of the host
Envoy proxies always crash after a restart of the host
Jan 19, 2024
schmichael
added
theme/client
theme/consul/connect
Consul Connect integration
theme/client-restart
labels
Jan 20, 2024
schmichael
added a commit
that referenced
this issue
Jan 20, 2024
Fixes #19781 Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781). This fixes it by *always* rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact. *Alternative 1: Use a regular file* An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is *unsafe* as when Consul ACLs are enabled the file will contain a secret token: https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap *Alternative 2: Detect if file is already written* An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't. This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.
schmichael
added a commit
that referenced
this issue
Jan 24, 2024
Fixes #19781 Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781). This fixes it by *always* rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact. This commit doesn't necessarily fix every use of Done by hooks, but hopefully improves the situation. The comment on Done has been expanded to hopefully avoid misuse in the future. Done assertions were removed from tests as they add more noise than value. *Alternative 1: Use a regular file* An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is *unsafe* as when Consul ACLs are enabled the file will contain a secret token: https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap *Alternative 2: Detect if file is already written* An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't. This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.
This was referenced Jan 24, 2024
Thanks for the report @avandierast! Should be fixed in the next release. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Hello :)
Issue
After the restart of an instance, the envoy proxies are crashing.
The error in the log is:
error initializing configuration '/secrets/envoy_bootstrap.json': Invalid path: /secrets/envoy_bootstrap.json
.A strong hypothesis is that /secrets is a tmpfs and thus it's wiped when restarting.
The envoy_bootstrap.json is created by nomad directly in the /secrets. It doesn't seem to be possible to change that:
nomad/client/allocrunner/taskrunner/envoy_bootstrap_hook.go
Line 272 in fce30f3
And the /secrets is always using tmpfs:
nomad/client/allocdir/fs_linux.go
Line 63 in fce30f3
Is there any way we can bypass this problem ?
We use nomad in a context where there is restarts and it would be great to not have to wait for all proxies to be rescheduled.
It also prevents us from using /secrets to store secret files since it would mean we don't have it anymore after a restart and must reschedule to rebuild it.
Reproduction steps
Start any job with a consul connect proxy on an instance and restart the instance without draining the node.
It may not work if nomad is in -dev since it may drain the node when stopping.
Nomad version
Operating system and Environment details
Linux 4.18.0-372.9.1.el8.x86_64
Thanks for your help :)
The text was updated successfully, but these errors were encountered: