Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnected clients: Feature branch merge #12476

Merged
merged 17 commits into from
Apr 6, 2022

Conversation

DerekStrickland
Copy link
Contributor

This PR contains the final feature branch merge of the new Disconnected Clients feature. All code has been reviewed in previous PRs to the feature branch.

Prior to Nomad 1.3 when clients failed to heartbeat, they would transition to down, and all their allocations to lost. This PR adds support for a new duration configuration that can be set on task groups named max_client_disconnect. If a client fails to heartbeat, and has allocations running that are configured with this setting, the node will transition to a new status disconnected and its allocations will transition to unknown.

If the node reconnects before the duration expires, the allocations will attempt to reconnect. If they are still running, they will be compared to any replacement allocations that were scheduled, and if they still beat the node score of the replacement, they will resume without a restart, and the replacement will be stopped. If they lose the node score comparison they will be stopped. If the node fails to reconnect before the longest configured duration expires, the node will transition to down and the unknown allocations will transition to lost. If no task groups are configured with max_client_disconnect, Nomad will run with the current behavior preserved.

This feature is aimed at edge workload scenarios where network stability/connectivity may be sporadic. This feature allows workloads running in edge deployments to reconnect with zero downtime. Previously, tasks would be restarted even if they had continued to run while disconnected.

Closes #10953

DerekStrickland and others added 17 commits April 5, 2022 17:10
* Add merge helper for string maps
* structs: add statuses, MaxClientDisconnect, and helper funcs
* taintedNodes: Include disconnected nodes
* upsertAllocsImpl: don't use existing ClientStatus when upserting unknown
* allocSet: update filterByTainted and add delayByMaxClientDisconnect
* allocReconciler: support disconnecting and reconnecting allocs
* GenericScheduler: upsert unknown and queue reconnecting

Co-authored-by: Tim Gross <[email protected]>
* Add TaskClientReconnectedEvent constant
* Add allocRunner.Reconnect function to manage task state manually
* Removes server-side push
* Update reconnect test to new algorithm and interface; remove guard test
* Add disconnects/reconnect to log output and emit reschedule metrics

* TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics
* api: Add struct, conversion function, and tests
* TaskGroup: Add field, validation, and tests
* diff: Add diff handler and test
* docs: Update docs
…12202)

* planner: expose ServerMeetsMinimumVersion via Planner interface
* filterByTainted: add flag indicating disconnect support
* allocReconciler: accept and pass disconnect support flag
* tests: update dependent tests
…me (#12271)

* comments: update some stale comments referencing deprecated config name
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by.

* node_endpoint: Emit new eval for reconnecting unknown allocs.

* filterByTainted: handle 2 phase commit filtering rules.

* reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects.

* allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.
* allocrunner: Remove Shutdown call in Reconnect
* Node.UpdateAlloc: Stop orphaned allocs.
* reconciler: Stop failed reconnects.
* Apply feedback from code review. Handle rebase conflict.
* Apply suggestions from code review

Co-authored-by: Tim Gross <[email protected]>
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved Allocation Handling on Lost Clients
3 participants