-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: do not restart restored tasks until server is contacted #5669
Conversation
} | ||
|
||
return h.TaskStatus(), nil | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up not using this code during testing, but it seems generally safe and useful.
return | ||
case <-tr.serversContactedCh: | ||
tr.logger.Trace("server contacted; unblocking waiting task") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a timeout after which we allow the task to restart? Some people may be relying on the fact that disconnected Nomad nodes will continue to run their tasks. thinking
Alternatively we could have a client config and/or jobspec parameter for controlling this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.
To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one big concern is that when this is released it will break the ability for disconnected nodes to restart.
Right now completely disconnected nodes can fully reboot and will restart anything they were running before regardless of whether or not contact is ever made with servers.
It's difficult to know if anyone is relying on this behavior because it is much more of an emergent property of the client's code than an intentional and documented design choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline and decided to block indefinitely. If a client can't contact servers in a timely fashion (seconds) it's likely to have its non-system allocs marked as lost and rescheduled elsewhere anyway so there's even less reason to restart at that point.
9707a9d
to
e272d0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code and approach lgtm.
// serversContactedCh is passed to TaskRunners so they can detect when | ||
// servers have been contacted for the first time in case of a failed | ||
// restore. | ||
serversContactedCh <-chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more precisely: fetched expected running allocations from server - not heartbeat/registration/etc. Not sure if it's worth making the distinction or where to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we've never had good nomenclature for this. The code base often talks about "syncing" allocs, but since client->server
and server->client
updates are completely independent code paths it's always ambiguous to say "syncing".
syncedFromServersCh
allocsPulledCh
allocsRetrievedCh
unblockedRestoredCh
- we could go with a name related to its use instead of the event/state it represents, but that doesn't seem as good.
Any thoughts/preferences? allocsPulledCh
might be my preference as we use the term pulled
in related code/logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be ok with clarifying in comment with keeping the field name the same.
return | ||
case <-tr.serversContactedCh: | ||
tr.logger.Trace("server contacted; unblocking waiting task") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.
To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.
Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.
Refactoring of 104067b Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.
Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.
ce5b474
to
abd809d
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #1795
Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.
This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
The PR also includes a small mock_driver fix and systemd unit file improvement from failed testing efforts.
I gave up on e2e testing reboots as there were many non-deterministic aspects and even if the correct assertions were made for each possible series of events, I couldn't figure out a way to prevent false-positives (as in the old buggy code would still often pass the test due to the specific setup of the test environment).