Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

Open
marvinchin opened this issue Apr 11, 2024 · 5 comments

Comments

@marvinchin
Copy link
Contributor

marvinchin commented Apr 11, 2024

Nomad version

1.6.x

Operating system and Environment details

Unix

Issue

When a client is restarted, it spends a long time on startup re-processing allocations that are already known to be terminal and had previously been GC-ed from it's state.

The result of this is that the client node takes a long time to start new allocations that are assigned to it, as it does not poll the server for new allocations until it is done processing the initial set of allocations it receives. In our environment, we have observed this delay to be in the order of minutes for busy client nodes.

This is similar to #16381, but focuses on the impact on the client. #17074 has fixed the issue on the server which causes the modify index of the allocations to be bumped on client restart, but the impact on client still remains.

Relevant code

Looking at the client's logic for filtering out allocations to process:

nomad/client/client.go

Lines 2374 to 2384 in 594fedb

if (!ok || modifyIndex > currentAR.Alloc().AllocModifyIndex) && !isInvalid {
// Only pull allocs that are required. Filtered
// allocs might be at a higher index, so ignore
// it.
if modifyIndex > pullIndex {
pullIndex = modifyIndex
}
pull = append(pull, allocID)
} else {
filtered[allocID] = struct{}{}
}

It appears that it will pull allocations which are both:

  • Terminal
  • Does not have an alloc runner

My understanding is that such allocations need not be pulled, since they are already known to be terminal and there is no local state for them. If this is true, then we can simply update the filtering logic to filter out such allocations, and that should fix the issue.

Reproduction steps

Same as in #16381, in particular the part where we observe in the client logs that it pulls an allocation even though both the allocations it receives are already terminal.

Expected Result

Clients should not re-process allocations which are terminal on restart.

Actual Result

Clients re-process terminal allocations on restart, causing delays in starting new allocations.

@tgross
Copy link
Member

tgross commented Apr 11, 2024

Hi @marvinchin, nice to hear from you again!

My understanding is that such allocations need not be pulled, since they are already known to be terminal and there is no local state for them. If this is true, then we can simply update the filtering logic to filter out such allocations, and that should fix the issue.

The catch here is that the client doesn't know the allocations are terminal unless it's already seen them, and it doesn't know whether it's seen the allocation before on startup if it doesn't have an alloc runner. And the server doesn't necessarily know the client doesn't have an alloc runner for them, because the client could have restarted and come back up without missing a heartbeat.

If we zoom out a bit from your code snippet to client.go#L2361-L2385:

for allocID, modifyIndex := range resp.Allocs {
    // Pull the allocation if we don't have an alloc runner for the
    // allocation or if the alloc runner requires an updated allocation.
    //XXX Part of Client alloc index tracking exp
    c.allocLock.RLock()
    currentAR, ok := c.allocs[allocID]
    c.allocLock.RUnlock()

    // Ignore alloc updates for allocs that are invalid because of initialization errors
    c.invalidAllocsLock.Lock()
    _, isInvalid := c.invalidAllocs[allocID]
    c.invalidAllocsLock.Unlock()

    if (!ok || modifyIndex > currentAR.Alloc().AllocModifyIndex) && !isInvalid {
        // Only pull allocs that are required. Filtered
        // allocs might be at a higher index, so ignore
        // it.
        if modifyIndex > pullIndex {
            pullIndex = modifyIndex
        }
        pull = append(pull, allocID)
    } else {
        filtered[allocID] = struct{}{}
    }
}

The resp variable at the top is a NodeClientAllocsResponse which contains a map of alloc IDs to modify indexes, but not the desired status of the allocation. We could probably add a new field with the status to Node.GetClientAllocs RPC with the list of stopped allocs. Then watchAllocations can skip pulling those if it doesn't have an allocrunner for that ID.

It might even be possible to leave those stopped allocs out of the .Allocs field entirely, but I suspect that we're going to want to keep those to get updated allocations for shutdown... I'd need to go spelunking a bit to check for that.

@marvinchin
Copy link
Contributor Author

Likewise :)

The resp variable at the top is a NodeClientAllocsResponse which contains a map of alloc IDs to modify indexes, but not the desired status of the allocation.

Ah, I didn't notice that! Thanks for pointing that out.

It might even be possible to leave those stopped allocs out of the .Allocs field entirely, but I suspect that we're going to want to keep those to get updated allocations for shutdown...

Hm, that's an interesting idea. My immediate thought is that we wouldn't be able to distinguish between when an allocation has been GC-ed from the server, vs. when an allocation has reached a terminal client state. But maybe it doesn't matter because they should be handled the same way anyway? I'm not confident in this at all though!

@marvinchin
Copy link
Contributor Author

Hi @tgross, we might be able to work on a PR for this issue 😄

Have you had the chance to see whether or not leaving out the stopped allocs in the .Allocs field is a workable approach? Otherwise, I'm tempted to go with the more straightforward approach of adding the desired status of the allocation to the response and filtering based on that. Let me know which you'd prefer!

@tgross
Copy link
Member

tgross commented May 2, 2024

Hi @marvinchin, that'd be great!

Have you had the chance to see whether or not leaving out the stopped allocs in the .Allocs field is a workable approach?

Yeah, we almost certainly don't want to do that. Because if the alloc is simply missing from the list, the client uses that as a signal that it not only should stop the alloc but that it should be client-side garbage collected (because the server doesn't know about it anymore... this is typically going to happen for clients that have been offline for a while and have reconnected). That would cause allocation logs to be immediately cleaned up after an allocation gets rescheduled, and we don't want that.

@tgross tgross removed their assignment May 10, 2024
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
@marvinchin
Copy link
Contributor Author

marvinchin commented Jul 14, 2024

Sorry for the delay, just wanted to mention that I haven't managed to find the time to work on this yet, in case anyone else is interested in picking this up.

In the meantime we've been working around this by reducing the GC interval to avoid accumulating of too many allocations in the state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants