Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of core: enforce strict steps for clients reconnect into release/1.4.x #15879

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #15808 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.


When a Nomad client that is running an allocation with max_client_disconnect set misses a heartbeat the Nomad server will update its status to disconnected.

Upon reconnecting, the client will make three main RPC calls:

  • Node.UpdateStatus is used to set the client status to ready.
  • Node.UpdateAlloc is used to update the client-side information about
    allocations, such as their ClientStatus, task states etc.
  • Node.Register is used to upsert the entire node information,
    including its status.

These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations.

#15068 already enforced clients to heartbeat before updating their allocation data, but there are still scenarios that can generate wrong results.

For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources.

When this client comes back the order of events may be:

  1. Client calls Node.UpdateStatus and is now ready.
  2. Scheduler reconciles allocations and places the replacement alloc to
    the client. The client is now assigned two allocations: the original
    alloc that is still unknown and the replacement that is pending.
  3. Client calls Node.UpdateAlloc and updates the original alloc to
    running.
  4. Scheduler notices too many allocs and stops the replacement.

node-disconnect drawio

This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled.

To avoid problems like this clients must update all of its relevant information before they can be considered ready and available for scheduling.

To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting:

  • Node.Register does not set the client status anymore.
  • Node.UpdateStatus sets the reconnecting client to the initializing
    status until it successfully calls Node.UpdateAlloc.

node-disconnect drawio

These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.

Closes #15483

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-node-status-fsm/rapidly-dominant-gnat branch from c75e596 to d2f234e Compare January 25, 2023 20:55
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit d5d20f6 into release/1.4.x Jan 25, 2023
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/b-node-status-fsm/rapidly-dominant-gnat branch January 25, 2023 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants