-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node restart causes modify index of all allocs on that node to be bumped #16381
Comments
@lgfa29 sorry to bother - could I just check if the description above is sufficient? Or is there any additional information I can provide to help with investigating the issue? |
Hi @marvinchin! From a high-level, Nomad needs to track separate
The 2nd and 3rd states are why the update happens. The server tells the client the desired state and the client reports back the actual (client) state of the allocation. So when a client comes back up, the server says "hey here are the desired states I know about" and the client has to acknowledge or reject that state. As a result, there's no way around quite a bit of load when clients are restarted. It's simply an expensive operation to resync the state of all the allocations assigned to that node (plus the node's fingerprint!). Draining a node before restarting is a good way to avoid this, but obviously that's not always what you want, which is why we support in-place client agent restarts. There's probably potential for optimization of the 1st state. If the server already knows that the client status is terminal, there's maybe no reason for it to update the server. I'd probably want the client to update the server anyways and let the server decide whether to discard the update, just for consistency so that if something else changes the terminal alloc we can just check that it hasn't changed. The client could have been down for quite a while, after all. That'd also help the case you've describe here where the client has already GC'd the local state. All that being said, the pattern you're hitting here seems really weird to me. I'd expect allocations to be GC'd in a reasonable amount of time where this isn't going to be a problem unless you're restarting client agents very frequently relative to the GC value. |
Thank you for responding! The explanation about the restart behavior was helpful. I agree that allocations in the 1st state is the one that leads to redundant work.
Do you mean that the server would have no reason to update the client? If so, I believe that would be sufficient to solve the problem. The client sending the server updates of terminal allocations that are still in its local state sounds fine, since the number of such allocations can be bounded by
Regarding extending the lifetime of allocations - I think its unfortunate that client restarts could lead to GC configuration not being upheld. It makes it hard to reason about the memory utilisation on the cluster. Moreover, frequent client restarts (which to my understanding is not explicitly unsupported behavior) could lead to unbounded growth of the raft state and cause OOM of the servers. Separately, I think that the issue of high load on client upon restart scales with the number of allocations that run on a client within the GC interval. If that number is large (e.g. if GC interval is big or if you simply run many short lived jobs) then the number of allocations in the 1st state as described above is large, and the client needs to do a lot of redundant work processing these allocations (as a datapoint, I observed a client take ~10min to process ~30k terminal allocations upon restart). |
I didn't, because I was thinking mostly of Raft writes as the problematic load. But you're right, ideally we wouldn't pull such a large set from the server at all. We could filter out allocs that are both client-and-server terminal, but knowing how features sprawl I suspect we'd eventually end up having a regression with that. What we really want is to ensure that the server and client have identical states, but to shed messages that don't need to be sent anymore. To do that, we'd need to have the client get allocs that:
The client had 30k terminal but not yet server-GC'd allocations between restarts? If you're pushing that kind of volume of allocations I would expect you'd need to GC fairly frequently to avoid problems with server memory. It's true that this isn't an explicitly unsupported behavior but it definitely feels like an outlier. I think it's definitely worth fixing this case but with this kind of load is draining the node before restarting not an option for some reason? |
Hi! Sorry for barging in uninvited -- I have been following the discussion primarily because I've referenced this report in #16283 which seems related in the repercussions of the bug.
Why not drain? Draining a node affects the workloads which are running on said node. Hence, it is a very intrusive operation. Nomad client has been explicitly been designed with the idea of being able to restart it without affecting workloads (that is the underpinning of the design in which Nomad Client, Task Driver and Executors are all separate processes with separate lifetimes). Are we then saying that restarts of Nomad Client are expected to be destructive and/or highly intrusive? Drains aren't always an option. Drains are not an option in many scenarios including:
The worrisome part of this issue is that it amplifies with the number of restarts in the fleet. If one performs a rolling restart of the nomad clients, it will double the amount of memory that the server requires right after (the GC period is reset). Moreover, a restart of one node can actually impact the entire fleet because of the load which is exerted on the scheduler as a result. Alright, but what if we managed to drain on every restart? Now, let's assume that all of the above is not a problem. Does draining of the node actually make a difference? I believe that the answer is: no. On the first re-registration of the node (due to eligibility change into "available") the code paths mentioned in the original bug report will all be hit exactly as described triggering the exact same bug. It seems that the only way to not hit the bug is purge the terminal allocations from the scheduler.
I read the example above as just a caricature just to show an extreme example. However, what stands out to me is the fact that even with a much smaller number of allocations (say, 1k or 500) Nomad Client will affect the workloads running on the machine (due to CPU, RAM and disk pressure) OR will thrash if it is put within a proper cgroup with constrained resources. If 30k allocations take 10 minutes, 500 allocations take ~10 seconds. That's 10 seconds in which Nomad Client causes spikes in CPU and disk pressure. In other words: the workloads will notice Nomad Client restarting and may be affected. |
Ah, that's a good point. The allocations that have been moved off are just terminal, not GC'd. So yeah that doesn't really help at all unless the allocations get a chance to be GC'd in the meanwhile. It occurs to me that using the previously-seen index doesn't help here either (at least in the obvious implementation), because the restarting node won't have a previously-seen index. We'd need to persist that last-seen index in the client state store.
In which case, that'd be unhelpful. It's obvious that performance will suffer at outlier conditions (if for no other reason than Raft writes are single-threaded), but you can't linearly down-scale performance problems like that because the extreme windows almost always come from contention. We've recognized this issue is a real problem so it's getting triaged. But when we see problems we also try to provide mitigating workarounds and those workarounds have to live in a real world context. |
This makes sense! Sorry if I came through strongly -- I wasn't clear whether or not we agreed that this is a problem in the not-so-pathological case or not (my hope was to clarify that point). I now see that my statement wasn't contextualized properly! |
I'm going to mark this for roadmapping. I've got a window of time planned to look at the client-to-server communication coming up in the next few weeks. I'll tackle this then. |
That sounds great, thanks for looking into this! |
Leaving a note for myself that apparently we do filter allocs that haven't had their modify index incremented (ref |
Just a heads up that I'm actively working this issue. I've worked up a failing integration test that demonstrates the problem. It spins up 4 jobs to make a matrix of behaviors across the boundary of the client restart.
On top of the spurious updates, because we're not getting the Now that I've got this integration test set up, I'm going to look into gating the initial updates by the 1st |
This sounds great and tracks my understanding of what is happening (and what we would ideally want to happen). Thank you for digging into it! |
When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381
When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381
When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381
I've just merged #17074 which will ship in Nomad 1.6.0. I'm currently working on another effort to provide some backpressure on the clients allocation updates as well... will post at least an issue with all the testing on that done so far in the next day or two. |
Hey! Sorry for bumping this old issue. We're still experiencing some symptoms of this issue, where clients spend a long time on startup re-processing terminal allocations.
Looking at the filtering logic you linked, it seems like it does not filter out allocations which are:
For such allocations, I believe there is actually no need to do anything with them. @tgross does that sound right to you? If so, I can submit a PR to filter them out. |
Nomad version
Nomad v1.4.2
Operating system and Environment details
Unix
Issue
When client nodes are restarted, all allocations that have been scheduled on the node have their modify index changed to the index at the time of restart, including allocations that are in terminal state. This problem affects both allocations that have and have not been GC-ed from the client's local state as long as they are in the scheduler's raft state.
System Impact
Root Cause
The rough flow is:
More details:
An alloc runner is created for a terminal allocation
Case 1: The terminal allocation has been GC-ed from the client's local state
When the client node starts back up, it's
watchAllocations
routine calls theNode.GetClientAllocs
RPC to request the allocations it should know about from the server:nomad/client/client.go
Line 2219 in 6f52a91
The handler for
GetClientAllocs
callsAllocsByNode
to get all the allocations that have been scheduled on the node (including those that are already terminal):nomad/nomad/node_endpoint.go
Line 1152 in 6f52a91
The
watchAllocations
enriches those allocations returned by the server and puts them into theallocUpdates
channel:nomad/client/client.go
Line 2102 in 6f52a91
Which is then read and handled by
runAllocs
:nomad/client/client.go
Lines 1854 to 1864 in 6f52a91
Since the allocations that have already been GC-ed from client's local state, they are not in the client's
allocs
and thus will be considered as allocations to be added (see client logs from repro below):nomad/client/client.go
Line 2441 in 6f52a91
Which does a write to the client's local state DB:
nomad/client/client.go
Line 2583 in 6f52a91
(Note that this write is actually redundant since it will immediately be marked for GC once client has breached its max allocs)
And creates an alloc runner for the allocation:
nomad/client/client.go
Line 2633 in 6f52a91
Case 2: The terminal allocation has not been GC-ed from the client's local state
When the client node starts back up, it calls
restoreState
, which finds the alloc in it's local state DB, and creates an alloc runner for it:nomad/client/client.go
Line 1250 in 6f52a91
The alloc runner realises that the allocation is terminal, and schedules an update to the server about the allocation before terminating
nomad/client/allocrunner/alloc_runner.go
Line 627 in 6f52a91
The server receives the update about the allocation, and writes it to the raft state bumping the modify index in the process
nomad/nomad/state/state_store.go
Line 3681 in 5d5740b
If the number of terminal allocations for the node is large, then the client does a large number of redundant writes (both additions and deletions) to the state DB, and the server has to send large number of raft messages to update allocations in the raft state.
Reproduction steps
Some interesting/relevant logs from the nomad client on the second startup, which shows that it thinks it needs to add the terminal alloc:
Expected Result
Modify index of terminal allocations should not be updated
Actual Result
The modify index of terminal allocations were updated
The text was updated successfully, but these errors were encountered: