Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

marvinchin · 2024-04-11T14:51:02Z

Nomad version

1.6.x

Operating system and Environment details

Unix

Issue

When a client is restarted, it spends a long time on startup re-processing allocations that are already known to be terminal and had previously been GC-ed from it's state.

The result of this is that the client node takes a long time to start new allocations that are assigned to it, as it does not poll the server for new allocations until it is done processing the initial set of allocations it receives. In our environment, we have observed this delay to be in the order of minutes for busy client nodes.

This is similar to #16381, but focuses on the impact on the client. #17074 has fixed the issue on the server which causes the modify index of the allocations to be bumped on client restart, but the impact on client still remains.

Relevant code

Looking at the client's logic for filtering out allocations to process:

nomad/client/client.go

Lines 2374 to 2384 in 594fedb

    
           if (!ok || modifyIndex > currentAR.Alloc().AllocModifyIndex) && !isInvalid { 
        
           	// Only pull allocs that are required. Filtered 
        
           	// allocs might be at a higher index, so ignore 
        
           	// it. 
        
           	if modifyIndex > pullIndex { 
        
           		pullIndex = modifyIndex 
        
           	} 
        
           	pull = append(pull, allocID) 
        
           } else { 
        
           	filtered[allocID] = struct{}{} 
        
           }

It appears that it will pull allocations which are both:

Terminal
Does not have an alloc runner

My understanding is that such allocations need not be pulled, since they are already known to be terminal and there is no local state for them. If this is true, then we can simply update the filtering logic to filter out such allocations, and that should fix the issue.

Reproduction steps

Same as in #16381, in particular the part where we observe in the client logs that it pulls an allocation even though both the allocations it receives are already terminal.

Expected Result

Clients should not re-process allocations which are terminal on restart.

Actual Result

Clients re-process terminal allocations on restart, causing delays in starting new allocations.

tgross · 2024-04-11T20:33:24Z

Hi @marvinchin, nice to hear from you again!

My understanding is that such allocations need not be pulled, since they are already known to be terminal and there is no local state for them. If this is true, then we can simply update the filtering logic to filter out such allocations, and that should fix the issue.

The catch here is that the client doesn't know the allocations are terminal unless it's already seen them, and it doesn't know whether it's seen the allocation before on startup if it doesn't have an alloc runner. And the server doesn't necessarily know the client doesn't have an alloc runner for them, because the client could have restarted and come back up without missing a heartbeat.

If we zoom out a bit from your code snippet to client.go#L2361-L2385:

for allocID, modifyIndex := range resp.Allocs {
    // Pull the allocation if we don't have an alloc runner for the
    // allocation or if the alloc runner requires an updated allocation.
    //XXX Part of Client alloc index tracking exp
    c.allocLock.RLock()
    currentAR, ok := c.allocs[allocID]
    c.allocLock.RUnlock()

    // Ignore alloc updates for allocs that are invalid because of initialization errors
    c.invalidAllocsLock.Lock()
    _, isInvalid := c.invalidAllocs[allocID]
    c.invalidAllocsLock.Unlock()

    if (!ok || modifyIndex > currentAR.Alloc().AllocModifyIndex) && !isInvalid {
        // Only pull allocs that are required. Filtered
        // allocs might be at a higher index, so ignore
        // it.
        if modifyIndex > pullIndex {
            pullIndex = modifyIndex
        }
        pull = append(pull, allocID)
    } else {
        filtered[allocID] = struct{}{}
    }
}

The resp variable at the top is a NodeClientAllocsResponse which contains a map of alloc IDs to modify indexes, but not the desired status of the allocation. We could probably add a new field with the status to Node.GetClientAllocs RPC with the list of stopped allocs. Then watchAllocations can skip pulling those if it doesn't have an allocrunner for that ID.

It might even be possible to leave those stopped allocs out of the .Allocs field entirely, but I suspect that we're going to want to keep those to get updated allocations for shutdown... I'd need to go spelunking a bit to check for that.

marvinchin · 2024-04-15T15:02:15Z

Likewise :)

The resp variable at the top is a NodeClientAllocsResponse which contains a map of alloc IDs to modify indexes, but not the desired status of the allocation.

Ah, I didn't notice that! Thanks for pointing that out.

It might even be possible to leave those stopped allocs out of the .Allocs field entirely, but I suspect that we're going to want to keep those to get updated allocations for shutdown...

Hm, that's an interesting idea. My immediate thought is that we wouldn't be able to distinguish between when an allocation has been GC-ed from the server, vs. when an allocation has reached a terminal client state. But maybe it doesn't matter because they should be handled the same way anyway? I'm not confident in this at all though!

marvinchin · 2024-05-02T14:52:51Z

Hi @tgross, we might be able to work on a PR for this issue 😄

Have you had the chance to see whether or not leaving out the stopped allocs in the .Allocs field is a workable approach? Otherwise, I'm tempted to go with the more straightforward approach of adding the desired status of the allocation to the response and filtering based on that. Let me know which you'd prefer!

tgross · 2024-05-02T20:29:14Z

Hi @marvinchin, that'd be great!

Have you had the chance to see whether or not leaving out the stopped allocs in the .Allocs field is a workable approach?

Yeah, we almost certainly don't want to do that. Because if the alloc is simply missing from the list, the client uses that as a signal that it not only should stop the alloc but that it should be client-side garbage collected (because the server doesn't know about it anymore... this is typically going to happen for clients that have been offline for a while and have reconnected). That would cause allocation logs to be immediately cleaned up after an allocation gets rescheduled, and we don't want that.

marvinchin · 2024-07-14T12:13:27Z

Sorry for the delay, just wanted to mention that I haven't managed to find the time to work on this yet, in case anyone else is interested in picking this up.

In the meantime we've been working around this by reducing the GC interval to avoid accumulating of too many allocations in the state.

marvinchin added the type/bug label Apr 11, 2024

tgross added the theme/client-restart label Apr 11, 2024

tgross added the stage/needs-discussion label Apr 11, 2024

tgross self-assigned this Apr 11, 2024

tgross removed their assignment May 10, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

marvinchin commented Apr 11, 2024 •

edited

Loading

tgross commented Apr 11, 2024 •

edited

Loading

marvinchin commented Apr 15, 2024

marvinchin commented May 2, 2024

tgross commented May 2, 2024

marvinchin commented Jul 14, 2024 •

edited

Loading

Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

Restarted node takes a long time to become useful due to re-processesing already terminal allocations #20354

Comments

marvinchin commented Apr 11, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Relevant code

Reproduction steps

Expected Result

Actual Result

tgross commented Apr 11, 2024 • edited Loading

marvinchin commented Apr 15, 2024

marvinchin commented May 2, 2024

tgross commented May 2, 2024

marvinchin commented Jul 14, 2024 • edited Loading

marvinchin commented Apr 11, 2024 •

edited

Loading

tgross commented Apr 11, 2024 •

edited

Loading

marvinchin commented Jul 14, 2024 •

edited

Loading