-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disconnected clients: Support operator manual interventions #12436
disconnected clients: Support operator manual interventions #12436
Conversation
e123452
to
a2e2a3c
Compare
a2e2a3c
to
38ebd95
Compare
dc7576a
to
b9d6e83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've left some comments but nothing that should be a blocker.
ff9ed55
to
6791147
Compare
* Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <[email protected]>
* api: Add struct, conversion function, and tests * TaskGroup: Add field, validation, and tests * diff: Add diff handler and test * docs: Update docs
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
* TaskGroup: Validate that max_client_disconnect and stop_after_client_disconnect are mutually exclusive.
85b815f
to
480b973
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR ensures that operator interventions
node drain
andstop -purge
are correctly handled for disconnected clients.allocSet
filterByFailedReconnect
. It filters allocations into a set that have failed on the client but do not have a terminal status at the server so that they can used by thereconciler
.filterByTainted
has been updated to considerDesiredStatus
now as well asClientStatus
.DesiredStatus
is run.reconnecting
so that thereconciler
can handle purges and drains.reconnected
andexpired
has been consolidated to in one place at the beginning of the filterto make expression and enforcement of the business rules more clear.
reconciler
computeStop
has been updated to always mark failed reconnects as stop even if the calculated number to remove is <= 0.computePlacements
has been updated to discount failed reconnects when calculating the existing allocs.computeStopByReconnecting
has been updated to add failed reconnects to the stop set if the number to remove calculated bycomputeStop
is > 0.computeReconnecting
has been updated to not add failed reconnects to thereconcilerResult
.allocWatcher
Shutdown
was removed from theReconnect
function. Now that thereconciler
issetting the
DesiredStatus
to stop, it is necessary for the runner to continue to run so that the existing syncingprocess can proceed as the rest of the logic expects.
Node.UpdateAlloc
DesiredTransition.Migrate
= false in the case of drains and purges while the alloc was unknown.