Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnected clients: Support operator manual interventions #12436

Conversation

DerekStrickland
Copy link
Contributor

@DerekStrickland DerekStrickland commented Apr 1, 2022

This PR ensures that operator interventions node drain and stop -purge are correctly handled for disconnected clients.

  • allocSet
    • A utility function has been added called filterByFailedReconnect. It filters allocations into a set that have failed on the client but do not have a terminal status at the server so that they can used by the reconciler.
    • The filtering logic in filterByTainted has been updated to consider DesiredStatus now as well as ClientStatus.
      • Failed reconnects that have already been marked stop at the server are ignored.
      • Unknown allocs are only ignored if their DesiredStatus is run.
      • Otherwise, they are added to reconnecting so that the reconciler can handle purges and drains.
    • The computation of reconnected and expired has been consolidated to in one place at the beginning of the filter
      to make expression and enforcement of the business rules more clear.
  • reconciler
    • computeStop has been updated to always mark failed reconnects as stop even if the calculated number to remove is <= 0.
      • computePlacements has been updated to discount failed reconnects when calculating the existing allocs.
      • computeStopByReconnecting has been updated to add failed reconnects to the stop set if the number to remove calculated by computeStop is > 0.
      • computeReconnecting has been updated to not add failed reconnects to the reconcilerResult.
  • allocWatcher
    • The call to Shutdown was removed from the Reconnect function. Now that the reconciler is
      setting the DesiredStatus to stop, it is necessary for the runner to continue to run so that the existing syncing
      process can proceed as the rest of the logic expects.
  • Node.UpdateAlloc
    • Has been updated to handle orphaned allocs after job purge/drain.
    • Has been updated to update DesiredTransition.Migrate = false in the case of drains and purges while the alloc was unknown.

@DerekStrickland DerekStrickland added this to the 1.3.0 milestone Apr 1, 2022
@DerekStrickland DerekStrickland self-assigned this Apr 1, 2022
@vercel vercel bot temporarily deployed to Preview – nomad April 1, 2022 21:47 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad April 2, 2022 15:28 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad April 4, 2022 10:59 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad April 4, 2022 14:47 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad April 4, 2022 18:36 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad April 4, 2022 18:42 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad April 4, 2022 20:14 Inactive
@DerekStrickland DerekStrickland force-pushed the f-manual-interventions-disconnected-clients branch from e123452 to a2e2a3c Compare April 4, 2022 20:29
@vercel vercel bot temporarily deployed to Preview – nomad April 4, 2022 20:29 Inactive
@DerekStrickland DerekStrickland force-pushed the f-manual-interventions-disconnected-clients branch from a2e2a3c to 38ebd95 Compare April 5, 2022 11:29
@vercel vercel bot temporarily deployed to Preview – nomad April 5, 2022 11:29 Inactive
@DerekStrickland DerekStrickland changed the title [WIP] cli: Support operator manual interventions disconnected clients: Support operator manual interventions Apr 5, 2022
@DerekStrickland DerekStrickland marked this pull request as ready for review April 5, 2022 12:21
@DerekStrickland DerekStrickland requested a review from tgross April 5, 2022 12:21
nomad/node_endpoint.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
scheduler/reconcile.go Outdated Show resolved Hide resolved
scheduler/reconcile_util.go Outdated Show resolved Hide resolved
scheduler/reconcile_util.go Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've left some comments but nothing that should be a blocker.

nomad/node_endpoint.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Show resolved Hide resolved
scheduler/reconcile_util.go Outdated Show resolved Hide resolved
nomad/node_endpoint.go Show resolved Hide resolved
nomad/node_endpoint.go Outdated Show resolved Hide resolved
DerekStrickland and others added 8 commits April 5, 2022 17:57
* Add merge helper for string maps
* structs: add statuses, MaxClientDisconnect, and helper funcs
* taintedNodes: Include disconnected nodes
* upsertAllocsImpl: don't use existing ClientStatus when upserting unknown
* allocSet: update filterByTainted and add delayByMaxClientDisconnect
* allocReconciler: support disconnecting and reconnecting allocs
* GenericScheduler: upsert unknown and queue reconnecting

Co-authored-by: Tim Gross <[email protected]>
* api: Add struct, conversion function, and tests
* TaskGroup: Add field, validation, and tests
* diff: Add diff handler and test
* docs: Update docs
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by.

* node_endpoint: Emit new eval for reconnecting unknown allocs.

* filterByTainted: handle 2 phase commit filtering rules.

* reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects.

* allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@DerekStrickland DerekStrickland merged commit 8863d1e into f-disconnected-client-allocation-handling Apr 6, 2022
@DerekStrickland DerekStrickland deleted the f-manual-interventions-disconnected-clients branch April 6, 2022 13:33
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants