-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services: fix data integrity errors for Nomad native services #20590
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tgross
added
theme/service-discovery
theme/service-discovery/nomad
type/bug
backport/1.6.x
backport to 1.6.x release line
backport/1.7.x
backport to 1.7.x release line
labels
May 15, 2024
tgross
force-pushed
the
ensure-nomad-services-deregistered
branch
from
May 15, 2024 15:39
2fcae6f
to
0ef189b
Compare
This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations. * When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services. To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff. * When an allocation is marked client-terminal from the client in `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by deregistering services for that allocation. * When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal. This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR. Ref: #16616
tgross
force-pushed
the
ensure-nomad-services-deregistered
branch
from
May 15, 2024 15:39
0ef189b
to
f5761e9
Compare
shoenig
approved these changes
May 15, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
jrasell
approved these changes
May 15, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @tgross!
tgross
added a commit
that referenced
this pull request
May 15, 2024
When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616
tgross
added a commit
that referenced
this pull request
May 16, 2024
) When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport/1.6.x
backport to 1.6.x release line
backport/1.7.x
backport to 1.7.x release line
theme/service-discovery/nomad
theme/service-discovery
type/bug
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations.
When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services.
To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff.
When an allocation is marked client-terminal from the client in
UpdateAllocsFromClient
, we have an opportunity to ensure data integrity by deregistering services for that allocation.When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal.
This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR.
Ref: #16616