Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: enforce strict steps for clients reconnect #15808

Merged
merged 13 commits into from
Jan 25, 2023
Merged

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Jan 17, 2023

When a Nomad client that is running an allocation with max_client_disconnect set misses a heartbeat the Nomad server will update its status to disconnected.

Upon reconnecting, the client will make three main RPC calls:

  • Node.UpdateStatus is used to set the client status to ready.
  • Node.UpdateAlloc is used to update the client-side information about
    allocations, such as their ClientStatus, task states etc.
  • Node.Register is used to upsert the entire node information,
    including its status.

These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations.

#15068 already enforced clients to heartbeat before updating their allocation data, but there are still scenarios that can generate wrong results.

For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources.

When this client comes back the order of events may be:

  1. Client calls Node.UpdateStatus and is now ready.
  2. Scheduler reconciles allocations and places the replacement alloc to
    the client. The client is now assigned two allocations: the original
    alloc that is still unknown and the replacement that is pending.
  3. Client calls Node.UpdateAlloc and updates the original alloc to
    running.
  4. Scheduler notices too many allocs and stops the replacement.

node-disconnect drawio

This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled.

To avoid problems like this clients must update all of its relevant information before they can be considered ready and available for scheduling.

To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting:

  • Node.Register does not set the client status anymore.
  • Node.UpdateStatus sets the reconnecting client to the initializing
    status until it successfully calls Node.UpdateAlloc.

node-disconnect drawio

These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.

Closes #15483

When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.

Upon reconnecting, the client will make three main RPC calls:

- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
  allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
  including its status.

These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.

For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.

When this client comes back the order of events may be:

1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
   the client. The client is now assigned two allocations: the original
   alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
   `running`.
4. Scheduler notices too many allocs and stops the replacement.

This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.

To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.

To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:

- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
  status until it successfully calls `Node.UpdateAlloc`.

These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
@lgfa29 lgfa29 force-pushed the b-node-status-fsm branch from 2180a10 to 4331b7a Compare January 19, 2023 01:31
@lgfa29 lgfa29 changed the title [wip] core: enforce strict steps for clients reconnect Jan 19, 2023
@lgfa29 lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Jan 19, 2023
@lgfa29 lgfa29 added this to the 1.4.4 milestone Jan 19, 2023
@lgfa29 lgfa29 marked this pull request as ready for review January 19, 2023 01:41
Comment on lines 2584 to 2746
require.ErrorContains(t, err, "not ready")
require.ErrorContains(t, err, "not allow to update allocs")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 2093 to 2095
// LastMissedHeartbeatIndex stores the Raft index when the node
// last missed a heartbeat.
LastMissedHeartbeatIndex uint64
Copy link
Contributor Author

@lgfa29 lgfa29 Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how accurate this name is. It's more like the first time the node last transition to an unresponsive status, but I couldn't think of a good name for this 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this name really implies to me that it's going to keep getting updated, whereas it's the index the node became unresponsive. It might be nice if we could clear the value once we're certain the node is live again, and that'd make things a little less confusing at the cost of a little extra logic to check for 0 in the UpdateStatus RPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Resetting it zero may help to indicate that this field is sort edge triggered. I pushed a commit do just that. Thanks!

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good @lgfa29. I think you boiled down the complex problem into a really nice targeted change here. I've mostly left suggestions but also one potentially nasty bug.

nomad/node_endpoint.go Outdated Show resolved Hide resolved
nomad/node_endpoint.go Outdated Show resolved Hide resolved
.changelog/15808.txt Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Outdated Show resolved Hide resolved
nomad/node_endpoint_test.go Show resolved Hide resolved
Update allocs and the LastAllocUpdateIndex in the same Raft transaction
to avoid data inconsistency in case the UpdateAlloc request fails
midway.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @lgfa29!

nomad/node_endpoint.go Outdated Show resolved Hide resolved
Comment on lines 2093 to 2095
// LastMissedHeartbeatIndex stores the Raft index when the node
// last missed a heartbeat.
LastMissedHeartbeatIndex uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this name really implies to me that it's going to keep getting updated, whereas it's the index the node became unresponsive. It might be nice if we could clear the value once we're certain the node is live again, and that'd make things a little less confusing at the cost of a little extra logic to check for 0 in the UpdateStatus RPC.

nomad/state/state_store.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New behavior when max_client_disconnect is used and the worker node moves from disconnected to ready
3 participants