Node restart causes modify index of all allocs on that node to be bumped #16381

marvinchin · 2023-03-08T09:24:45Z

Nomad version

Nomad v1.4.2

Operating system and Environment details

Unix

Issue

When client nodes are restarted, all allocations that have been scheduled on the node have their modify index changed to the index at the time of restart, including allocations that are in terminal state. This problem affects both allocations that have and have not been GC-ed from the client's local state as long as they are in the scheduler's raft state.

System Impact

Restarts of nodes at a cadence higher than the GC limits lead to infinite accumulation of data in the system, OOMs of the client and server nodes and high CPU, disk and network load within the entire system.
Client node restarts lead to high load on the system – both servers and clients
Rolling, splayed restart of client nodes leads to stampeding herd behaviors and cascading failures

Root Cause
The rough flow is:

On restart the client adds terminal allocs back to its local state
It then sends the server updates about these terminal allocs
The server writes these allocs back to the raft state bumping the modify index in the process

More details:

An alloc runner is created for a terminal allocation

Case 1: The terminal allocation has been GC-ed from the client's local state

When the client node starts back up, it's watchAllocations routine calls the Node.GetClientAllocs RPC to request the allocations it should know about from the server:

nomad/client/client.go

Line 2219 in 6f52a91

err := c.RPC("Node.GetClientAllocs", &req, &resp)

The handler for GetClientAllocs calls AllocsByNode to get all the allocations that have been scheduled on the node (including those that are already terminal):

nomad/nomad/node_endpoint.go

Line 1152 in 6f52a91

allocs, err = state.AllocsByNode(ws, args.NodeID)

The watchAllocations enriches those allocations returned by the server and puts them into the allocUpdates channel:

nomad/client/client.go

Line 2102 in 6f52a91

case c.allocUpdates <- stripped:

Which is then read and handled by runAllocs:

nomad/client/client.go

Lines 1854 to 1864 in 6f52a91

    
           case update := <-allocUpdates: 
        
           	// Don't apply updates while shutting down. 
        
           	c.shutdownLock.Lock() 
        
           	if c.shutdown { 
        
           		c.shutdownLock.Unlock() 
        
           		return 
        
           	} 
        
           	// Apply updates inside lock to prevent a concurrent 
        
           	// shutdown. 
        
           	c.runAllocs(update)

Since the allocations that have already been GC-ed from client's local state, they are not in the client's allocs and thus will be considered as allocations to be added (see client logs from repro below):

nomad/client/client.go

Line 2441 in 6f52a91

if err := c.addAlloc(add, migrateToken); err != nil {

Which does a write to the client's local state DB:

nomad/client/client.go

Line 2583 in 6f52a91

if err := c.stateDB.PutAllocation(alloc); err != nil {

(Note that this write is actually redundant since it will immediately be marked for GC once client has breached its max allocs)

And creates an alloc runner for the allocation:

nomad/client/client.go

Line 2633 in 6f52a91

ar, err := allocrunner.NewAllocRunner(arConf)

Case 2: The terminal allocation has not been GC-ed from the client's local state

When the client node starts back up, it calls restoreState, which finds the alloc in it's local state DB, and creates an alloc runner for it:

nomad/client/client.go

Line 1250 in 6f52a91

ar, err := allocrunner.NewAllocRunner(arConf)

The alloc runner realises that the allocation is terminal, and schedules an update to the server about the allocation before terminating

nomad/client/allocrunner/alloc_runner.go

Line 627 in 6f52a91

ar.stateUpdater.AllocStateUpdated(calloc)
The server receives the update about the allocation, and writes it to the raft state bumping the modify index in the process

nomad/nomad/state/state_store.go

Line 3681 in 5d5740b

copyAlloc.ModifyIndex = index

If the number of terminal allocations for the node is large, then the client does a large number of redundant writes (both additions and deletions) to the state DB, and the server has to send large number of raft messages to update allocations in the raft state.

Reproduction steps

# Server config
$ cat server.conf
data_dir = "/tmp/nomad/server"

advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}

server {
  enabled = true
  bootstrap_expect = 1
}

# Start the server
$ nomad agent -config server.conf

# Client config
$ cat client.conf
data_dir = "/tmp/nomad/client"
log_level = "debug"

advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}

ports {
  http = "9876"
  rpc = "9875"
  serf = "9874"
}

client {
  enabled = true
  servers = ["127.0.0.1"]
  gc_max_allocs = 1
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

# Start the client
$ nomad agent -config client.conf

# Show the job specs, they are identical except for the job name
$ cat example-1.nomad
job "example-1" {
  datacenters = ["dc1"]
  type = "batch"

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
	command = "/usr/bin/true"
      }
    }

    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
}
$ cat example-2.nomad
job "example-2" {
  datacenters = ["dc1"]
  type = "batch"

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
	command = "/usr/bin/true"
      }
    }

    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
}

# Run both jobs - the second allocation causes the first to get GC-ed (because we set gc_max_allocs to 1)
$ nomad job run example-1.nomad
==> 2023-03-07T22:42:56+08:00: Monitoring evaluation "4ba37e42"
    2023-03-07T22:42:56+08:00: Evaluation triggered by job "example-1"
    2023-03-07T22:42:57+08:00: Allocation "d538290d" created: node "a52d4f78", group "test-group"
    2023-03-07T22:42:57+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-03-07T22:42:57+08:00: Evaluation "4ba37e42" finished with status "complete"
$ nomad job run example-2.nomad
==> 2023-03-07T22:43:03+08:00: Monitoring evaluation "11aa17f4"
    2023-03-07T22:43:03+08:00: Evaluation triggered by job "example-2"
    2023-03-07T22:43:03+08:00: Allocation "f8720e5e" created: node "a52d4f78", group "test-group"
    2023-03-07T22:43:04+08:00: Allocation "f8720e5e" status changed: "pending" -> "complete" (All tasks have completed)
    2023-03-07T22:43:04+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-03-07T22:43:04+08:00: Evaluation "11aa17f4" finished with status "complete"
 
# Kill client...

# Get the modify index of the allocations, and show that they are terminal
$ nomad alloc status -json d538290d | jq '.JobID,.ClientStatus,.ModifyIndex'
"example-1"
"complete"
18
$ nomad alloc status -json f8720e5e | jq '.JobID,.ClientStatus,.ModifyIndex'
"example-2"
"complete"
19

# Start client back up
$ nomad agent -config client.conf

# Show that the allocations have their modify index bumped
$ nomad alloc status -json d538290d | jq '.JobID,.ClientStatus,.ModifyIndex'
"example-1"
"complete"
37
$ nomad alloc status -json f8720e5e | jq '.JobID,.ClientStatus,.ModifyIndex'
"example-2"
"complete"
37

Some interesting/relevant logs from the nomad client on the second startup, which shows that it thinks it needs to add the terminal alloc:

2023-03-07T22:47:05.749+0800 [INFO]  client.gc: marking allocation for GC: alloc_id=f8720e5e-9e78-fbc2-1072-310bb8fd5546
2023-03-07T22:47:05.750+0800 [DEBUG] client: updated allocations: index=19 total=2 pulled=1 filtered=1
2023-03-07T22:47:05.750+0800 [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=1
2023-03-07T22:47:05.750+0800 [INFO]  client.gc: garbage collecting allocation: alloc_id=f8720e5e-9e78-fbc2-1072-310bb8fd5546 reason="new allocations and over max (1)"

Expected Result

Modify index of terminal allocations should not be updated

Actual Result

The modify index of terminal allocations were updated

The text was updated successfully, but these errors were encountered:

marvinchin · 2023-03-22T18:40:01Z

@lgfa29 sorry to bother - could I just check if the description above is sufficient? Or is there any additional information I can provide to help with investigating the issue?

tgross · 2023-03-24T20:15:53Z

Hi @marvinchin! From a high-level, Nomad needs to track separate DesiredStatus and ClientStatus. So a "terminal" allocation can have 3 different states:

DesiredStatus	ClientStatus
stop	terminal (complete, failed, etc.)
stop	non-terminal (running)
run	terminal (complete, failed, etc.)

The 2nd and 3rd states are why the update happens. The server tells the client the desired state and the client reports back the actual (client) state of the allocation. So when a client comes back up, the server says "hey here are the desired states I know about" and the client has to acknowledge or reject that state.

As a result, there's no way around quite a bit of load when clients are restarted. It's simply an expensive operation to resync the state of all the allocations assigned to that node (plus the node's fingerprint!). Draining a node before restarting is a good way to avoid this, but obviously that's not always what you want, which is why we support in-place client agent restarts.

There's probably potential for optimization of the 1st state. If the server already knows that the client status is terminal, there's maybe no reason for it to update the server. I'd probably want the client to update the server anyways and let the server decide whether to discard the update, just for consistency so that if something else changes the terminal alloc we can just check that it hasn't changed. The client could have been down for quite a while, after all. That'd also help the case you've describe here where the client has already GC'd the local state.

All that being said, the pattern you're hitting here seems really weird to me. I'd expect allocations to be GC'd in a reasonable amount of time where this isn't going to be a problem unless you're restarting client agents very frequently relative to the GC value.

marvinchin · 2023-03-24T22:00:48Z

Thank you for responding! The explanation about the restart behavior was helpful.

I agree that allocations in the 1st state is the one that leads to redundant work.

There's probably potential for optimization of the 1st state. If the server already knows that the client status is terminal, there's maybe no reason for it to update the server.

Do you mean that the server would have no reason to update the client? If so, I believe that would be sufficient to solve the problem.

The client sending the server updates of terminal allocations that are still in its local state sounds fine, since the number of such allocations can be bounded by gc_max_allocs.

All that being said, the pattern you're hitting here seems really weird to me. I'd expect allocations to be GC'd in a reasonable amount of time where this isn't going to be a problem unless you're restarting client agents very frequently relative to the GC value.

Regarding extending the lifetime of allocations - I think its unfortunate that client restarts could lead to GC configuration not being upheld. It makes it hard to reason about the memory utilisation on the cluster. Moreover, frequent client restarts (which to my understanding is not explicitly unsupported behavior) could lead to unbounded growth of the raft state and cause OOM of the servers.

Separately, I think that the issue of high load on client upon restart scales with the number of allocations that run on a client within the GC interval. If that number is large (e.g. if GC interval is big or if you simply run many short lived jobs) then the number of allocations in the 1st state as described above is large, and the client needs to do a lot of redundant work processing these allocations (as a datapoint, I observed a client take ~10min to process ~30k terminal allocations upon restart).

tgross · 2023-03-27T14:32:20Z

Do you mean that the server would have no reason to update the client? If so, I believe that would be sufficient to solve the problem.

I didn't, because I was thinking mostly of Raft writes as the problematic load. But you're right, ideally we wouldn't pull such a large set from the server at all. We could filter out allocs that are both client-and-server terminal, but knowing how features sprawl I suspect we'd eventually end up having a regression with that. What we really want is to ensure that the server and client have identical states, but to shed messages that don't need to be sent anymore. To do that, we'd need to have the client get allocs that:

Belong to the node
And...
- Are non-terminal, or
- Are terminal and have not been modified since the last time the client polled

(as a datapoint, I observed a client take ~10min to process ~30k terminal allocations upon restart).

The client had 30k terminal but not yet server-GC'd allocations between restarts? If you're pushing that kind of volume of allocations I would expect you'd need to GC fairly frequently to avoid problems with server memory. It's true that this isn't an explicitly unsupported behavior but it definitely feels like an outlier. I think it's definitely worth fixing this case but with this kind of load is draining the node before restarting not an option for some reason?

stswidwinski · 2023-03-27T15:32:30Z

Hi!

Sorry for barging in uninvited -- I have been following the discussion primarily because I've referenced this report in #16283 which seems related in the repercussions of the bug.

I think it's definitely worth fixing this case but with this kind of load is draining the node before restarting not an option for some reason?

Why not drain?

Draining a node affects the workloads which are running on said node. Hence, it is a very intrusive operation. Nomad client has been explicitly been designed with the idea of being able to restart it without affecting workloads (that is the underpinning of the design in which Nomad Client, Task Driver and Executors are all separate processes with separate lifetimes).

Are we then saying that restarts of Nomad Client are expected to be destructive and/or highly intrusive?

Drains aren't always an option.

Drains are not an option in many scenarios including:

Configuration changes of task drivers -- it is not uncommon that I change the configuration of the task driver used for my workloads. This is for many different reasons - debugging, tuning etc. Moving workloads for every such change makes it very difficult, but also introduces a lot of friction into debugging (debugging live processes is now impossible since we require a node drain before application of changes and hence one needs to have reliable repros which isn't always possible.)
Power failures, OOMs, D-states and other factors one cannot control. I think this point stands on it's own. One cannot expect to drain on every restarts since not all restarts are controlled.

The worrisome part of this issue is that it amplifies with the number of restarts in the fleet. If one performs a rolling restart of the nomad clients, it will double the amount of memory that the server requires right after (the GC period is reset). Moreover, a restart of one node can actually impact the entire fleet because of the load which is exerted on the scheduler as a result.

Alright, but what if we managed to drain on every restart?

Now, let's assume that all of the above is not a problem. Does draining of the node actually make a difference?

I believe that the answer is: no. On the first re-registration of the node (due to eligibility change into "available") the code paths mentioned in the original bug report will all be hit exactly as described triggering the exact same bug. It seems that the only way to not hit the bug is purge the terminal allocations from the scheduler.

The client had 30k terminal but not yet server-GC'd allocations between restarts? If you're pushing that kind of volume of allocations I would expect you'd need to GC fairly frequently to avoid problems with server memory.

I read the example above as just a caricature just to show an extreme example. However, what stands out to me is the fact that even with a much smaller number of allocations (say, 1k or 500) Nomad Client will affect the workloads running on the machine (due to CPU, RAM and disk pressure) OR will thrash if it is put within a proper cgroup with constrained resources.

If 30k allocations take 10 minutes, 500 allocations take ~10 seconds. That's 10 seconds in which Nomad Client causes spikes in CPU and disk pressure. In other words: the workloads will notice Nomad Client restarting and may be affected.

tgross · 2023-03-27T15:58:58Z

On the first re-registration of the node (due to eligibility change into "available") the code paths mentioned in the original bug report will all be hit exactly as described triggering the exact same bug. It seems that the only way to not hit the bug is purge the terminal allocations from the scheduler.

Ah, that's a good point. The allocations that have been moved off are just terminal, not GC'd. So yeah that doesn't really help at all unless the allocations get a chance to be GC'd in the meanwhile.

It occurs to me that using the previously-seen index doesn't help here either (at least in the obvious implementation), because the restarting node won't have a previously-seen index. We'd need to persist that last-seen index in the client state store.

I read the example above as just a caricature just to show an extreme example... 30k allocations take 10 minutes, 500 allocations take ~10 seconds.

In which case, that'd be unhelpful. It's obvious that performance will suffer at outlier conditions (if for no other reason than Raft writes are single-threaded), but you can't linearly down-scale performance problems like that because the extreme windows almost always come from contention. We've recognized this issue is a real problem so it's getting triaged. But when we see problems we also try to provide mitigating workarounds and those workarounds have to live in a real world context.

stswidwinski · 2023-03-27T16:11:12Z

In which case, that'd be unhelpful. It's obvious that performance will suffer at outlier conditions (if for no other reason than Raft writes are single-threaded), but you can't linearly down-scale performance problems like that because the extreme windows almost always come from contention. We've recognized this issue is a real problem so it's getting triaged. But when we see problems we also try to provide mitigating workarounds and those workarounds have to live in a real world context.

This makes sense! Sorry if I came through strongly -- I wasn't clear whether or not we agreed that this is a problem in the not-so-pathological case or not (my hope was to clarify that point). I now see that my statement wasn't contextualized properly!

tgross · 2023-03-27T19:49:56Z

I'm going to mark this for roadmapping. I've got a window of time planned to look at the client-to-server communication coming up in the next few weeks. I'll tackle this then.

marvinchin · 2023-03-27T19:53:33Z

That sounds great, thanks for looking into this!

tgross · 2023-04-06T13:29:06Z

ideally we wouldn't pull such a large set from the server at all. We could filter out allocs that are both client-and-server terminal,

Leaving a note for myself that apparently we do filter allocs that haven't had their modify index incremented (ref client.go#L2265-L2269) from the more detailed fetch of individual allocs that happens after Client.GetAllocs. There's probably an opportunity in this code path to remove the ones we know we won't need to further update here.

tgross · 2023-04-27T19:58:41Z

Just a heads up that I'm actively working this issue. I've worked up a failing integration test that demonstrates the problem. It spins up 4 jobs to make a matrix of behaviors across the boundary of the client restart.

test scenario	should send a `Node.UpdateAlloc` on restart?	sends a `Node.UpdateAlloc` today?
(1) job is running	no	yes ❌
(2) job is stopped (and acknowledged by client) before we shutdown	no	yes ❌
(3) job is stopped after we shutdown but before we restart	yes	yes ✔️
(4) job is stopped and purged, and alloc is GC'd while client is down	yes?	yes ✔️

Scenario (1) where the job is still running seems obviously spurious; there's no events to send but it's still showing up in updates.
Scenario (2) is in a terminal state and the client knows this, so there's literally no reason to send any more updates. This should apply whether or not the alloc was GC'd on the server.
Scenario (3) is working as expected -- we need to tell the server that we've completed the stop.
Scenario (4) is kind of tricky. The server doesn't care at this point whether we've completed the stop, but the client also doesn't know that it doesn't care and won't get an explicit update from the server saying that. (The server only says "these are the allocs I want you to run now", not "and stop these".) I think we still need to send the initial update here, but see below.

On top of the spurious updates, because we're not getting the Node.GetClientAllocs first, we're actually sending two batches of updates after the restart. In the first batch, alloc (2) is still complete as we'd expect but alloc (1), (2), and (3) are running. Then in the second batch we've corrected allocs (3) and (4) to complete. So it's actually worse than "hey the client is sending spurious updates", but the updates are also initially wrong.

Now that I've got this integration test set up, I'm going to look into gating the initial updates by the 1st Node.GetClientAllocs and then improving the logic of the alloc sync to shed updates that we know aren't needed. (Addendum: looks like simply gating the updates isn't enough, b/c the allocrunners are still running and queueing up the bad update, which immediately gets sent once we remove the gate... will be a little more complicated than I'd hoped but looks like I've got the behavior figured out at least!)

stswidwinski · 2023-04-27T21:49:39Z

This sounds great and tracks my understanding of what is happening (and what we would ideally want to happen). Thank you for digging into it!

When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381

tgross · 2023-05-11T13:06:52Z

I've just merged #17074 which will ship in Nomad 1.6.0. I'm currently working on another effort to provide some backpressure on the clients allocation updates as well... will post at least an issue with all the testing on that done so far in the next day or two.

marvinchin · 2024-02-22T11:14:30Z

Hey! Sorry for bumping this old issue.

We're still experiencing some symptoms of this issue, where clients spend a long time on startup re-processing terminal allocations.

Leaving a note for myself that apparently we do filter allocs that haven't had their modify index incremented (ref client.go#L2265-L2269) from the more detailed fetch of individual allocs that happens after Client.GetAllocs. There's probably an opportunity in this code path to remove the ones we know we won't need to further update here.

Looking at the filtering logic you linked, it seems like it does not filter out allocations which are:

Terminal
Does not already have an alloc runner

For such allocations, I believe there is actually no need to do anything with them. @tgross does that sound right to you? If so, I can submit a PR to filter them out.

marvinchin added the type/bug label Mar 8, 2023

stswidwinski mentioned this issue Mar 9, 2023

Unexpected evaluations created when client nodes are updated #16283

Open

tgross self-assigned this Mar 24, 2023

tgross added theme/client-restart stage/waiting-reply labels Mar 24, 2023

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Mar 27, 2023

tgross mentioned this issue May 3, 2023

client: de-duplicate alloc updates and gate during restore #17074

Merged

tgross closed this as completed in #17074 May 11, 2023

tgross added this to the 1.6.0 milestone May 11, 2023

stswidwinski mentioned this issue Jun 2, 2023

Periodic sysbatch jobs run much more frequently than the spec expresses. #17397

Open

marvinchin mentioned this issue Apr 11, 2024

Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

Open

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node restart causes modify index of all allocs on that node to be bumped #16381

Node restart causes modify index of all allocs on that node to be bumped #16381

marvinchin commented Mar 8, 2023 •

edited

Loading

marvinchin commented Mar 22, 2023

tgross commented Mar 24, 2023

marvinchin commented Mar 24, 2023

tgross commented Mar 27, 2023

stswidwinski commented Mar 27, 2023

tgross commented Mar 27, 2023

stswidwinski commented Mar 27, 2023

tgross commented Mar 27, 2023

marvinchin commented Mar 27, 2023

tgross commented Apr 6, 2023

tgross commented Apr 27, 2023 •

edited

Loading

stswidwinski commented Apr 27, 2023

tgross commented May 11, 2023

marvinchin commented Feb 22, 2024

Node restart causes modify index of all allocs on that node to be bumped #16381

Node restart causes modify index of all allocs on that node to be bumped #16381

Comments

marvinchin commented Mar 8, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

marvinchin commented Mar 22, 2023

tgross commented Mar 24, 2023

marvinchin commented Mar 24, 2023

tgross commented Mar 27, 2023

stswidwinski commented Mar 27, 2023

tgross commented Mar 27, 2023

stswidwinski commented Mar 27, 2023

tgross commented Mar 27, 2023

marvinchin commented Mar 27, 2023

tgross commented Apr 6, 2023

tgross commented Apr 27, 2023 • edited Loading

stswidwinski commented Apr 27, 2023

tgross commented May 11, 2023

marvinchin commented Feb 22, 2024

marvinchin commented Mar 8, 2023 •

edited

Loading

tgross commented Apr 27, 2023 •

edited

Loading