Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: avoid the deadlock when failed to remove invalid sandbox #2073

Merged
merged 1 commit into from
Aug 15, 2018

Conversation

YaoZengzeng
Copy link
Contributor

@YaoZengzeng YaoZengzeng commented Aug 10, 2018

Signed-off-by: YaoZengzeng [email protected]

Ⅰ. Describe what this PR did

It has the possibility that we failed to run the sandbox and it is not being cleaned up.

Kubelet will use list to get the sandboxes, but will not get the status of the failed pod whose meta data has not been put into the Sandbox Store.

And Kubelet will keep trying to get the status of the failed pod and won't create a new one to replace it.

It's a DEAD LOCK.

Actually Kubelet should not know the existent of invalid pod whose meta data won't be in the Sandbox Store.

So we could avoid the DEAD LOCK mentioned above.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how you did it

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

@pouchrobot pouchrobot added kind/bug This is bug report for project size/L labels Aug 10, 2018
// Kubelet will use list to get the sandboxes, but will not get the status of the failed pod
// whose meta data has not been put into the Sandbox Store. And Kubelet will keep trying to
// get the status of the failed pod and won't create a new one to replace it. It's a DEAD LOCK.
// Actually Kubelet should not know the existent of invalid pod whose meta data won't be in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%s/existent/existence ?

// it is still running?
if status != apitypes.StatusRunning && status != apitypes.StatusCreated {
// Remove invalid sandbox.
c.ContainerMgr.Remove(ctx, sandbox.ID, &apitypes.ContainerRemoveOptions{Volumes: true, Force: true})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove is a dangerous action, should we just print a warning log and just keep the invalid containers.

Maybe some containers are invalid for cri, but valid for pouchd, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containers belong to pouchd have been filtered already.

The containers here are CRI sandbox.

If we don't remove the invalid sandbox container here, they will never be removed :)

@pouchrobot
Copy link
Collaborator

ping @YaoZengzeng
We found that this PR is 20 commits, which is more than 10 commits, behind master.
Please rebase the branch against master and push it back again. Thanks a lot.

@rudyfly
Copy link
Collaborator

rudyfly commented Aug 11, 2018

I think add checking invalid sandboxes is necessary, but we should try to know why remove container is failed?

@YaoZengzeng
Copy link
Contributor Author

@rudyfly Yes, maybe we should print the error message when failed to remove the invalid sandbox.

@codecov-io
Copy link

codecov-io commented Aug 13, 2018

Codecov Report

Merging #2073 into master will decrease coverage by 0.13%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2073      +/-   ##
==========================================
- Coverage      65%   64.87%   -0.14%     
==========================================
  Files         209      209              
  Lines       16227    16275      +48     
==========================================
+ Hits        10548    10558      +10     
- Misses       4370     4399      +29     
- Partials     1309     1318       +9
Flag Coverage Δ
#criv1alpha1test 33.41% <33.33%> (+0.04%) ⬆️
#criv1alpha2test 33.94% <33.33%> (-0.06%) ⬇️
#integrationtest 39.48% <0%> (-0.41%) ⬇️
#unittest 23.74% <0%> (-0.08%) ⬇️
Impacted Files Coverage Δ
cri/v1alpha2/cri.go 64.51% <60%> (-0.56%) ⬇️
cri/v1alpha1/cri.go 63.84% <60%> (-0.39%) ⬇️
cri/v1alpha1/cri_utils.go 83.64% <68.42%> (-0.52%) ⬇️
cri/v1alpha2/cri_utils.go 82.33% <68.42%> (-0.42%) ⬇️
daemon/containerio/hijack_conn.go 77.14% <0%> (-11.43%) ⬇️
daemon/containerio/options.go 84.12% <0%> (-4.77%) ⬇️
daemon/mgr/container_utils.go 82.22% <0%> (-2.97%) ⬇️
pkg/meta/store.go 57.14% <0%> (-2.39%) ⬇️
daemon/containerio/container_io.go 73.48% <0%> (-1.11%) ⬇️
daemon/mgr/container.go 55.43% <0%> (-0.41%) ⬇️
... and 3 more

if err != nil {
return nil, fmt.Errorf("failed to filter invalid sandboxes: %v", err)
}

sandboxes := make([]*runtime.PodSandbox, 0, len(sandboxList))
for _, s := range sandboxList {
sandbox, err := toCriSandbox(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about put the outermost loop of filterInvalidSandboxes here, WDYT?

Expect that, LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... It's a code organization issue, I want to encapsulate all the logic into this function.

Now, I think it's OK.

As the code evolve, maybe it will be better to put the loop outside :)

@YaoZengzeng YaoZengzeng force-pushed the meta branch 2 times, most recently from 74b449b to 0fa378e Compare August 14, 2018 08:33
Copy link
Contributor

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @rudyfly please check it again. Thanks

@rudyfly rudyfly merged commit ccea89e into AliyunContainerService:master Aug 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is bug report for project size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants