disconnected clients: Support operator manual interventions #12436

DerekStrickland · 2022-04-01T17:56:27Z

This PR ensures that operator interventions node drain and stop -purge are correctly handled for disconnected clients.

allocSet
- A utility function has been added called filterByFailedReconnect. It filters allocations into a set that have failed on the client but do not have a terminal status at the server so that they can used by the reconciler.
- The filtering logic in filterByTainted has been updated to consider DesiredStatus now as well as ClientStatus.
  - Failed reconnects that have already been marked stop at the server are ignored.
  - Unknown allocs are only ignored if their DesiredStatus is run.
  - Otherwise, they are added to reconnecting so that the reconciler can handle purges and drains.
- The computation of reconnected and expired has been consolidated to in one place at the beginning of the filter
  to make expression and enforcement of the business rules more clear.
reconciler
- computeStop has been updated to always mark failed reconnects as stop even if the calculated number to remove is <= 0.
  - computePlacements has been updated to discount failed reconnects when calculating the existing allocs.
  - computeStopByReconnecting has been updated to add failed reconnects to the stop set if the number to remove calculated by computeStop is > 0.
  - computeReconnecting has been updated to not add failed reconnects to the reconcilerResult.
allocWatcher
- The call to Shutdown was removed from the Reconnect function. Now that the reconciler is
  setting the DesiredStatus to stop, it is necessary for the runner to continue to run so that the existing syncing
  process can proceed as the rest of the logic expects.
Node.UpdateAlloc
- Has been updated to handle orphaned allocs after job purge/drain.
- Has been updated to update DesiredTransition.Migrate = false in the case of drains and purges while the alloc was unknown.

nomad/node_endpoint.go

nomad/node_endpoint_test.go

scheduler/reconcile.go

scheduler/reconcile_util.go

tgross

LGTM. I've left some comments but nothing that should be a blocker.

nomad/node_endpoint.go

nomad/node_endpoint_test.go

scheduler/reconcile_util.go

nomad/node_endpoint.go

* Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <[email protected]>

* api: Add struct, conversion function, and tests * TaskGroup: Add field, validation, and tests * diff: Add diff handler and test * docs: Update docs

* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.

* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.

…ined allocs.

tgross

👍

github-actions · 2022-10-22T02:43:19Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

DerekStrickland added the theme/edge label Apr 1, 2022

DerekStrickland added this to the 1.3.0 milestone Apr 1, 2022

DerekStrickland self-assigned this Apr 1, 2022

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 1, 2022 21:47 Failure

vercel bot temporarily deployed to Preview – nomad April 1, 2022 21:47 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 2, 2022 15:28 Failure

vercel bot temporarily deployed to Preview – nomad April 2, 2022 15:28 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 4, 2022 10:59 Failure

vercel bot temporarily deployed to Preview – nomad April 4, 2022 10:59 Inactive

vercel bot temporarily deployed to Preview – nomad April 4, 2022 14:47 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 4, 2022 14:47 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 4, 2022 18:36 Failure

vercel bot temporarily deployed to Preview – nomad April 4, 2022 18:36 Inactive

vercel bot temporarily deployed to Preview – nomad April 4, 2022 18:42 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 4, 2022 18:42 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 4, 2022 20:14 Failure

vercel bot temporarily deployed to Preview – nomad April 4, 2022 20:14 Inactive

DerekStrickland force-pushed the f-manual-interventions-disconnected-clients branch from e123452 to a2e2a3c Compare April 4, 2022 20:29

vercel bot temporarily deployed to Preview – nomad April 4, 2022 20:29 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 4, 2022 20:29 Failure

DerekStrickland force-pushed the f-manual-interventions-disconnected-clients branch from a2e2a3c to 38ebd95 Compare April 5, 2022 11:29

vercel bot temporarily deployed to Preview – nomad April 5, 2022 11:29 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 5, 2022 11:29 Failure

DerekStrickland changed the title ~~[WIP] cli: Support operator manual interventions~~ disconnected clients: Support operator manual interventions Apr 5, 2022

DerekStrickland marked this pull request as ready for review April 5, 2022 12:21

DerekStrickland requested a review from tgross April 5, 2022 12:21

tgross reviewed Apr 5, 2022

View reviewed changes

vercel bot temporarily deployed to Preview – nomad April 5, 2022 14:39 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 5, 2022 14:39 Failure

DerekStrickland force-pushed the f-manual-interventions-disconnected-clients branch from dc7576a to b9d6e83 Compare April 5, 2022 20:11

vercel bot temporarily deployed to Preview – nomad April 5, 2022 20:11 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 5, 2022 20:11 Failure

tgross approved these changes Apr 5, 2022

View reviewed changes

nomad/node_endpoint.go Outdated Show resolved Hide resolved

nomad/node_endpoint_test.go Show resolved Hide resolved

scheduler/reconcile_util.go Outdated Show resolved Hide resolved

nomad/node_endpoint.go Show resolved Hide resolved

nomad/node_endpoint.go Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview – nomad April 5, 2022 20:55 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui April 5, 2022 20:55 Failure

DerekStrickland force-pushed the f-disconnected-client-allocation-handling branch from ff9ed55 to 6791147 Compare April 5, 2022 21:29

DerekStrickland and others added 8 commits April 5, 2022 17:57

MaxClientDisconnect Jobspec checklist (#12177)

05093dd

* api: Add struct, conversion function, and tests * TaskGroup: Add field, validation, and tests * diff: Add diff handler and test * docs: Update docs

disconnected clients: TaskGroup validation (#12418)

f37e92f

* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.

allocrunner: Remove Shutdown call in Reconnect

da9c307

Node.UpdateAlloc: Stop orphaned allocs. Don't migrate orphaned or dra…

77ec79d

…ined allocs.

reconciler: Stop failed reconnects.

a7fe3fa

Apply feedback from code review. Handle rebase conflict.

480b973

DerekStrickland force-pushed the f-manual-interventions-disconnected-clients branch from 85b815f to 480b973 Compare April 6, 2022 11:56

vercel bot temporarily deployed to Preview – nomad April 6, 2022 11:56 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui April 6, 2022 11:56 View deployment

Apply suggestions from code review

f68587a

vercel bot deployed to Preview – nomad-storybook-and-ui April 6, 2022 12:54 View deployment

vercel bot temporarily deployed to Preview – nomad April 6, 2022 12:54 Inactive

tgross approved these changes Apr 6, 2022

View reviewed changes

DerekStrickland merged commit 8863d1e into f-disconnected-client-allocation-handling Apr 6, 2022

DerekStrickland deleted the f-manual-interventions-disconnected-clients branch April 6, 2022 13:33

tgross mentioned this pull request Apr 11, 2022

EventStream index not respected #12538

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disconnected clients: Support operator manual interventions #12436

disconnected clients: Support operator manual interventions #12436

DerekStrickland commented Apr 1, 2022 •

edited

Loading

tgross left a comment

tgross left a comment •

edited

Loading

github-actions bot commented Oct 22, 2022

disconnected clients: Support operator manual interventions #12436

disconnected clients: Support operator manual interventions #12436

Conversation

DerekStrickland commented Apr 1, 2022 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Oct 22, 2022

DerekStrickland commented Apr 1, 2022 •

edited

Loading

tgross left a comment •

edited

Loading