disconnected clients: Feature branch merge #12476

DerekStrickland · 2022-04-06T13:50:46Z

This PR contains the final feature branch merge of the new Disconnected Clients feature. All code has been reviewed in previous PRs to the feature branch.

Prior to Nomad 1.3 when clients failed to heartbeat, they would transition to down, and all their allocations to lost. This PR adds support for a new duration configuration that can be set on task groups named max_client_disconnect. If a client fails to heartbeat, and has allocations running that are configured with this setting, the node will transition to a new status disconnected and its allocations will transition to unknown.

If the node reconnects before the duration expires, the allocations will attempt to reconnect. If they are still running, they will be compared to any replacement allocations that were scheduled, and if they still beat the node score of the replacement, they will resume without a restart, and the replacement will be stopped. If they lose the node score comparison they will be stopped. If the node fails to reconnect before the longest configured duration expires, the node will transition to down and the unknown allocations will transition to lost. If no task groups are configured with max_client_disconnect, Nomad will run with the current behavior preserved.

This feature is aimed at edge workload scenarios where network stability/connectivity may be sporadic. This feature allows workloads running in edge deployments to reconnect with zero downtime. Previously, tasks would be restarted even if they had continued to run while disconnected.

Closes #10953

* Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <[email protected]>

* Add TaskClientReconnectedEvent constant * Add allocRunner.Reconnect function to manage task state manually * Removes server-side push

* Update reconnect test to new algorithm and interface; remove guard test

* Add disconnects/reconnect to log output and emit reschedule metrics * TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics

* api: Add struct, conversion function, and tests * TaskGroup: Add field, validation, and tests * diff: Add diff handler and test * docs: Update docs

…12202) * planner: expose ServerMeetsMinimumVersion via Planner interface * filterByTainted: add flag indicating disconnect support * allocReconciler: accept and pass disconnect support flag * tests: update dependent tests

…me (#12271) * comments: update some stale comments referencing deprecated config name

* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.

* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.

Co-authored-by: Derek Strickland <[email protected]>

* allocrunner: Remove Shutdown call in Reconnect * Node.UpdateAlloc: Stop orphaned allocs. * reconciler: Stop failed reconnects. * Apply feedback from code review. Handle rebase conflict. * Apply suggestions from code review Co-authored-by: Tim Gross <[email protected]>

tgross

👍

github-actions · 2022-10-19T02:45:51Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

DerekStrickland and others added 17 commits April 5, 2022 17:10

client: reconnect unknown allocations and sync state

042a07b

NodeStatusDisconnected: support state transitions for new node status

d06155e

evaluateNodePlan: validate plans for disconnected nodes

2cea992

reconciler: fix loop control bug

97ce949

disconnected clients: Add reconnect task event (#12133)

3575265

* Add TaskClientReconnectedEvent constant * Add allocRunner.Reconnect function to manage task state manually * Removes server-side push

Fix client test reconnect test; Remove guard test (#12173)

b3fb943

* Update reconnect test to new algorithm and interface; remove guard test

disconnected clients: Observability plumbing (#12141)

5b5c853

* Add disconnects/reconnect to log output and emit reschedule metrics * TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics

MaxClientDisconnect Jobspec checklist (#12177)

83dd636

* api: Add struct, conversion function, and tests * TaskGroup: Add field, validation, and tests * diff: Add diff handler and test * docs: Update docs

Add unknown to TaskGroupSummary (#12269)

b317aaa

Add description for allocs stopped due to reconnect (#12270)

bab3173

comments: update some stale comments referencing deprecated config na…

9a82b63

…me (#12271) * comments: update some stale comments referencing deprecated config name

disconnected clients: TaskGroup validation (#12418)

6791147

* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.

Add max client disconnect docs (#12467)

8493730

Co-authored-by: Derek Strickland <[email protected]>

DerekStrickland added the theme/edge label Apr 6, 2022

DerekStrickland added this to the 1.3.0 milestone Apr 6, 2022

DerekStrickland self-assigned this Apr 6, 2022

DerekStrickland requested review from tgross, schmichael and jrasell April 6, 2022 13:50

vercel bot deployed to Preview – nomad-storybook-and-ui April 6, 2022 13:55 View deployment

tgross approved these changes Apr 6, 2022

View reviewed changes

DerekStrickland merged commit 12b7647 into main Apr 6, 2022

DerekStrickland deleted the f-disconnected-client-allocation-handling branch April 6, 2022 14:11

DerekStrickland mentioned this pull request Apr 7, 2022

disconnected clients: Unknown alloc gets marked for migrate on node drain #12469

Closed

lgfa29 added a commit that referenced this pull request Apr 8, 2022

changelog: update #12476 entry to highlight the feature

76f1134

lgfa29 added a commit that referenced this pull request Apr 8, 2022

changelog: update #12476 entry to highlight the feature (#12528)

d4f8263

github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disconnected clients: Feature branch merge #12476

disconnected clients: Feature branch merge #12476

DerekStrickland commented Apr 6, 2022

tgross left a comment

github-actions bot commented Oct 19, 2022

disconnected clients: Feature branch merge #12476

disconnected clients: Feature branch merge #12476

Conversation

DerekStrickland commented Apr 6, 2022

tgross left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 19, 2022