-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: enforce strict steps for clients reconnect #15808
Conversation
e1df7a0
to
2180a10
Compare
When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.
2180a10
to
4331b7a
Compare
nomad/node_endpoint_test.go
Outdated
require.ErrorContains(t, err, "not ready") | ||
require.ErrorContains(t, err, "not allow to update allocs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nomad/structs/structs.go
Outdated
// LastMissedHeartbeatIndex stores the Raft index when the node | ||
// last missed a heartbeat. | ||
LastMissedHeartbeatIndex uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how accurate this name is. It's more like the first time the node last transition to an unresponsive status, but I couldn't think of a good name for this 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this name really implies to me that it's going to keep getting updated, whereas it's the index the node became unresponsive. It might be nice if we could clear the value once we're certain the node is live again, and that'd make things a little less confusing at the cost of a little extra logic to check for 0
in the UpdateStatus
RPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. Resetting it zero may help to indicate that this field is sort edge triggered. I pushed a commit do just that. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good @lgfa29. I think you boiled down the complex problem into a really nice targeted change here. I've mostly left suggestions but also one potentially nasty bug.
Co-authored-by: Tim Gross <[email protected]>
Update allocs and the LastAllocUpdateIndex in the same Raft transaction to avoid data inconsistency in case the UpdateAlloc request fails midway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @lgfa29!
nomad/structs/structs.go
Outdated
// LastMissedHeartbeatIndex stores the Raft index when the node | ||
// last missed a heartbeat. | ||
LastMissedHeartbeatIndex uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this name really implies to me that it's going to keep getting updated, whereas it's the index the node became unresponsive. It might be nice if we could clear the value once we're certain the node is live again, and that'd make things a little less confusing at the cost of a little extra logic to check for 0
in the UpdateStatus
RPC.
When a Nomad client that is running an allocation with
max_client_disconnect
set misses a heartbeat the Nomad server will update its status todisconnected
.Upon reconnecting, the client will make three main RPC calls:
Node.UpdateStatus
is used to set the client status toready
.Node.UpdateAlloc
is used to update the client-side information aboutallocations, such as their
ClientStatus
, task states etc.Node.Register
is used to upsert the entire node information,including its status.
These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations.
#15068 already enforced clients to heartbeat before updating their allocation data, but there are still scenarios that can generate wrong results.
For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources.
When this client comes back the order of events may be:
Node.UpdateStatus
and is nowready
.the client. The client is now assigned two allocations: the original
alloc that is still
unknown
and the replacement that ispending
.Node.UpdateAlloc
and updates the original alloc torunning
.This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled.
To avoid problems like this clients must update all of its relevant information before they can be considered
ready
and available for scheduling.To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting:
Node.Register
does not set the client status anymore.Node.UpdateStatus
sets the reconnecting client to theinitializing
status until it successfully calls
Node.UpdateAlloc
.These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would.
The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.
Closes #15483