Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

hc-github-team-nomad-core · 2023-01-25T20:55:06Z

Backport

This PR is auto-generated from #15808 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.

When a Nomad client that is running an allocation with max_client_disconnect set misses a heartbeat the Nomad server will update its status to disconnected.

Upon reconnecting, the client will make three main RPC calls:

Node.UpdateStatus is used to set the client status to ready.
Node.UpdateAlloc is used to update the client-side information about
allocations, such as their ClientStatus, task states etc.
Node.Register is used to upsert the entire node information,
including its status.

These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations.

#15068 already enforced clients to heartbeat before updating their allocation data, but there are still scenarios that can generate wrong results.

For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources.

When this client comes back the order of events may be:

Client calls Node.UpdateStatus and is now ready.
Scheduler reconciles allocations and places the replacement alloc to
the client. The client is now assigned two allocations: the original
alloc that is still unknown and the replacement that is pending.
Client calls Node.UpdateAlloc and updates the original alloc to
running.
Scheduler notices too many allocs and stops the replacement.

This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled.

To avoid problems like this clients must update all of its relevant information before they can be considered ready and available for scheduling.

To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting:

Node.Register does not set the client status anymore.
Node.UpdateStatus sets the reconnecting client to the initializing
status until it successfully calls Node.UpdateAlloc.

These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.

Closes #15483

lgfa29 added 13 commits January 19, 2023 01:11

backport of commit 4331b7a

214bd54

backport of commit bb7cab4

037504d

backport of commit 8ae8144

d5224de

backport of commit 0b03f92

b80ad89

backport of commit 0040e17

6f145e3

backport of commit 67411b9

28adbb4

backport of commit 952b971

3f64590

backport of commit 6e1f1d8

b815c4a

backport of commit 8202f0b

1cf447e

backport of commit e52e15f

7906159

backport of commit bb0aa13

4a76645

backport of commit 5cb8b31

d3eea8f

backport of commit f479a6a

d2f234e

hc-github-team-nomad-core force-pushed the backport/b-node-status-fsm/rapidly-dominant-gnat branch from c75e596 to d2f234e Compare January 25, 2023 20:55

hc-github-team-nomad-core merged commit d5d20f6 into release/1.4.x Jan 25, 2023

hc-github-team-nomad-core deleted the backport/b-node-status-fsm/rapidly-dominant-gnat branch January 25, 2023 20:55

vercel bot deployed to Preview – nomad-storybook-and-ui January 25, 2023 21:05 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

hc-github-team-nomad-core commented Jan 25, 2023

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

Conversation

hc-github-team-nomad-core commented Jan 25, 2023

Backport