-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator: Avoid allocating tasks that are no longer running #2017
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2017 +/- ##
==========================================
+ Coverage 53.76% 53.83% +0.06%
==========================================
Files 109 109
Lines 19037 19033 -4
==========================================
+ Hits 10236 10247 +11
+ Misses 7563 7557 -6
+ Partials 1238 1229 -9 Continue to review full report at Codecov.
|
The issue would potentially be exposed only during leader re-election, because at swarm init there can't be any network attachment task in store. Correct ? |
That's right. It wouldn't happen when the task first fails, because the state would be |
Looks good to me |
Having a hard time mentally mapping this. Current state:
Proposed state:
So we are completely reversing the behavior? |
@aluzzardi: This shouldn't be reversing the behavior. I think maybe you're missing the Current state:
Proposed state:
|
Sorry, I don't get it :/ Aren't those two the same? Also, "Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below." -> why would we want to allocate tasks that are running? |
Basically this is changing an
Pseudocode for after:
The motivation for this is that network attachment tasks never change their desired state. The tasks I saw in a dump from a cluster getting lots of log spam from repeated allocations were network attachment tasks with desired state
I had a comment in the PR description about this:
Basically trying to make the least invasive change that will fix the problem, but I think tightening the check to skip running tasks as well would also be correct. |
Why not:
If we don't desire a task to be running, but it is actually running, should we really skip it? |
I think you're right. Paying any attention at all to desired state here seems bogus. Suppose the orchestrator tries to start a task, but soon after, the service is updated, and it ends up changing that task to Should we just switch to observed state? I think we should, but I'm wondering if I'm missing anything here. The only potential risk I see in making this change is that we won't deallocate failed tasks until the observed state changes. So if a node is down, we might not be able to free the resources until the task becomes orphaned. But I thought that was how it was supposed to work anyway! The fact that it's freeing resources based on desired state seems like a bug. |
81be482
to
aad8422
Compare
@aluzzardi: I've updated this to only check observed state. I think this is the correct thing to do. It also makes the code simpler. It's a little bit more of a change from what was there before, though. I removed the @aboch: Can you please take a second look? |
The allocator currently uses taskDead to check if a task should not be allocated. However, this checks that both the desired state and actual state are past "running". In the case of attachment tasks, there is no orchestration, and the desired state is always "running". This means allocation can be reattempted on a failed attachment task indefinitely. Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below. Signed-off-by: Aaron Lehmann <[email protected]>
aad8422
to
9de6ccd
Compare
The PR that added desired state checking is #435. I reviewed it, but right now I don't understand why it's necessary or beneficial. I see that the conditions inside Still, I'm a little concerned that we don't understand this code very well. I'm tempted to go back to the original fix to be safer, but that's just going to delay addressing the technical debt. What about removing the desired state checking in master, and just doing the tweak in the original PR for the stable release? Not perfect either because it introduces a weird difference, but it seems worth considering. |
FWIW, I vendored this PR (along with all the others) into docker, and all integration tests passed. |
@@ -281,7 +281,7 @@ func (a *Allocator) doNetworkInit(ctx context.Context) (err error) { | |||
} | |||
|
|||
for _, t := range tasks { | |||
if taskDead(t) { | |||
if t.Status.State > api.TaskStateRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, because if the task's state is beyond running, we should not bother attempting an allocation for it, regardless of the desired state. Once a task moves beyond running, it cannot be brought back in the pipeline right ?
A new task will be created in place of it, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once a task moves beyond running, it cannot be brought back in the pipeline right ?
That's correct.
if taskDead(t) || isDelete { | ||
// If the task has stopped running then we should free the network | ||
// resources associated with the task right away. | ||
if t.Status.State > api.TaskStateRunning || isDelete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only concern here is, as you guys already discussed here and also in the pr which introduced the current check, is that we release the resources associated to this task, while the task may still be running on an unresponsive node, and a new task coming up will be given those resources.
So, we could end up with two containers in the cluster having same IP address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think this is a problem. This is looking at observed state, not desired state. So we err on the side of keeping the resources tied up if the node is unresponsive. There is a mechanism that marks tasks as orphaned after 24 hours of the node being down, so the allocator can finally free these resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the clarification.
Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think |
Agreed. I realized that after making that comment. |
Looks good to me still. |
LGTM |
1 similar comment
LGTM |
The allocator currently uses
taskDead
to check if a task should not be allocated. However, this checks that both the desired state and actual state are past "running". In the case of network attachment tasks, there is no orchestration, and the desired state is always "running". This means allocation can be reattempted on a failed attachment task indefinitely.Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.
Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.
cc @alexmavr @yongtang @aboch @dongluochen