Use backoff to tolerant race condition. #1894

Random-Liu · 2018-02-23T21:47:34Z

I keep seeing this error in kubelet log:

Feb 23 09:31:05 workstation kubelet[11956]: I0223 09:31:05.336135   11956 factory.go:105] Error trying to work out if we can handle /kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373: error inspecting container: Error: No such container: de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373
Feb 23 09:31:05 workstation kubelet[11956]: I0223 09:31:05.336156   11956 factory.go:116] Factory "docker" was unable to handle container "/kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373"
Feb 23 09:31:05 workstation kubelet[11956]: I0223 09:31:05.336733   11956 factory.go:112] Using factory "containerd" for container "/kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373"
Feb 23 09:31:05 workstation kubelet[11956]: W0223 09:31:05.337858   11956 manager.go:1178] Failed to process watch event {EventType:0 Name:/kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373 WatchSource:0}: task de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373 not found: not found

The reason is that container cgroup is created in the middle of task creation. And there is a race condition that cadvisor see the cgroup, but the corresponding task hasn't been fully created yet in containerd.

There is no such problem for docker, because docker has some internal lock to make sure Inspect only returns after container start is finished. CRI-Containerd and I believe cri-o are doing the same thing. Actually cadvisor is relying on some container runtime internal implementation details here.

However, here we are talking with containerd directly, which still has this race condition.

This PR added a retry to avoid this problem for now, and we should come up with a better fix next release.

I've validated this PR, and it works for me.

/cc @abhi @dashpole

Signed-off-by: Lantao Liu [email protected]

dashpole · 2018-02-23T21:52:47Z

container/containerd/handler.go

+	// `ContainerStatus` only returns result after `StartContainer` finishes.
+	var taskPid uint32
+	backoff := 100 * time.Millisecond
+	for retry := 5; retry > 0; retry-- {


so after 5 retries we continue on? Will we not get any metrics in this case? Would it be better just to return an error so it doesnt fail silently?

Signed-off-by: Lantao Liu <[email protected]>

dashpole · 2018-02-23T22:15:12Z

container/containerd/handler.go

+		}
+		retry--
+		if !errdefs.IsNotFound(err) || retry == 0 {
+			return nil, err


If retry == 0 case is hit, we will return err = nil. This will probably result in a nil pointer somewhere down the road. Can we return a new error for this case?

No, we check err == nil before this, right?

dashpole

lgtm

abhi

LGTM

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update cadvisor to v0.29.1 Update cadvisor to v0.29.1 to include a bug fix for containerd integration. google/cadvisor#1894 **Release note**: ```release-note none ```

Random-Liu assigned dashpole Feb 23, 2018

dashpole reviewed Feb 23, 2018

View reviewed changes

Use backoff to tolerant race condition.

d5ee05f

Signed-off-by: Lantao Liu <[email protected]>

Random-Liu force-pushed the avoid-containerd-race branch from f484c62 to d5ee05f Compare February 23, 2018 22:04

dashpole reviewed Feb 23, 2018

View reviewed changes

dashpole approved these changes Feb 23, 2018

View reviewed changes

dashpole merged commit b817801 into google:master Feb 23, 2018

Random-Liu deleted the avoid-containerd-race branch February 23, 2018 22:32

abhi reviewed Feb 26, 2018

View reviewed changes

dashpole mentioned this pull request Feb 26, 2018

Use backoff to tolerant race condition. #1896

Merged

Random-Liu mentioned this pull request Mar 7, 2018

Update cadvisor to v0.29.1 kubernetes/kubernetes#60867

Merged

bobbypage mentioned this pull request Apr 22, 2022

WIP: DEBUG DO NOT MERGE kubernetes/kubernetes#109608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use backoff to tolerant race condition. #1894

Use backoff to tolerant race condition. #1894

Random-Liu commented Feb 23, 2018 •

edited

Loading

dashpole Feb 23, 2018

Random-Liu Feb 23, 2018

Random-Liu Feb 23, 2018

dashpole Feb 23, 2018

Random-Liu Feb 23, 2018

dashpole Feb 23, 2018

dashpole left a comment

abhi left a comment

Use backoff to tolerant race condition. #1894

Use backoff to tolerant race condition. #1894

Conversation

Random-Liu commented Feb 23, 2018 • edited Loading

dashpole Feb 23, 2018

Choose a reason for hiding this comment

Random-Liu Feb 23, 2018

Choose a reason for hiding this comment

Random-Liu Feb 23, 2018

Choose a reason for hiding this comment

dashpole Feb 23, 2018

Choose a reason for hiding this comment

Random-Liu Feb 23, 2018

Choose a reason for hiding this comment

dashpole Feb 23, 2018

Choose a reason for hiding this comment

dashpole left a comment

Choose a reason for hiding this comment

abhi left a comment

Choose a reason for hiding this comment

Random-Liu commented Feb 23, 2018 •

edited

Loading