-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disconnected clients: Feature branch merge #12476
disconnected clients: Feature branch merge #12476
Conversation
* Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <[email protected]>
* Add TaskClientReconnectedEvent constant * Add allocRunner.Reconnect function to manage task state manually * Removes server-side push
* Update reconnect test to new algorithm and interface; remove guard test
* Add disconnects/reconnect to log output and emit reschedule metrics * TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics
* api: Add struct, conversion function, and tests * TaskGroup: Add field, validation, and tests * diff: Add diff handler and test * docs: Update docs
…12202) * planner: expose ServerMeetsMinimumVersion via Planner interface * filterByTainted: add flag indicating disconnect support * allocReconciler: accept and pass disconnect support flag * tests: update dependent tests
…me (#12271) * comments: update some stale comments referencing deprecated config name
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.
Co-authored-by: Derek Strickland <[email protected]>
* allocrunner: Remove Shutdown call in Reconnect * Node.UpdateAlloc: Stop orphaned allocs. * reconciler: Stop failed reconnects. * Apply feedback from code review. Handle rebase conflict. * Apply suggestions from code review Co-authored-by: Tim Gross <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR contains the final feature branch merge of the new Disconnected Clients feature. All code has been reviewed in previous PRs to the feature branch.
Prior to Nomad 1.3 when clients failed to heartbeat, they would transition to
down
, and all their allocations tolost
. This PR adds support for a new duration configuration that can be set on task groups namedmax_client_disconnect
. If a client fails to heartbeat, and has allocations running that are configured with this setting, the node will transition to a new statusdisconnected
and its allocations will transition tounknown
.If the node reconnects before the duration expires, the allocations will attempt to reconnect. If they are still running, they will be compared to any replacement allocations that were scheduled, and if they still beat the node score of the replacement, they will resume without a restart, and the replacement will be stopped. If they lose the node score comparison they will be stopped. If the node fails to reconnect before the longest configured duration expires, the node will transition to
down
and theunknown
allocations will transition tolost
. If no task groups are configured withmax_client_disconnect
, Nomad will run with the current behavior preserved.This feature is aimed at edge workload scenarios where network stability/connectivity may be sporadic. This feature allows workloads running in edge deployments to reconnect with zero downtime. Previously, tasks would be restarted even if they had continued to run while disconnected.
Closes #10953