reload cni network config if has fs change events #1405

fuweid · 2020-02-27T08:17:28Z

With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.

The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.

Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.

Signed-off-by: Wei Fu [email protected]

fuweid · 2020-02-27T08:46:36Z

ERROR: gcloud crashed (MetadataServerException): HTTP Error 400: Bad Request
If you would like to report this issue, please run the following command:
  gcloud feedback

no idea what it is. 😢

fuweid · 2020-02-27T11:32:29Z

/retest

fuweid · 2020-02-27T13:05:04Z

Forgot one thing that rlock will wait when the wlock is pending. Looping to reload CNI network conf still can cause that NotReady issue. Need inotify to load conf on demand.

mikebrow · 2020-02-27T16:14:56Z

ERROR: gcloud crashed (MetadataServerException): HTTP Error 400: Bad Request
If you would like to report this issue, please run the following command:
  gcloud feedback

no idea what it is. 😢

@yujuhong @Random-Liu still having problems running infra on gcloud is this a known issue? thoughts?

mikebrow

I know this is wip :) but thought I'd give some early comments.

pkg/server/service.go

pkg/server/status.go

pkg/server/service.go

fuweid · 2020-03-08T13:24:33Z

@mikebrow updated. PTAL. Thanks

It seems that cri-containerd-node-e2e fails because containerd doesn't start. Any idea how to check the log?

fuweid · 2020-03-09T02:43:21Z

/retest

mikebrow · 2020-03-09T15:32:07Z

/test pull-cri-containerd-node-e2e

mikebrow · 2020-03-09T15:35:13Z

@mikebrow updated. PTAL. Thanks

Any idea how to check the log?

Sorry I don't. @Random-Liu how would one get a containerd log dump, would we have to insert a print of the log at failure?

mikebrow

looking good! see nits

pkg/server/cni_conf_syncer.go

vendor.conf

pkg/server/service.go

fuweid · 2020-03-30T10:21:57Z

ping @Random-Liu and how to check containerd log in e2e 😂

fuweid · 2020-04-03T03:28:14Z

@mikebrow I check kubernetes with extra-logs and upload logs into artifacts :) finally get it

With go RWMutex design, no goroutine should expect to be able to acquire a read lock until the read lock has been released, if one goroutine call lock. The original design is to reload cni network config on every single Status CRI gRPC call. If one RunPodSandbox request holds read lock to allocate IP for too long, all other RunPodSandbox/StopPodSandbox requests will wait for the RunPodSandbox request to release read lock. And the Status CRI call will fail and kubelet becomes NOTReady. Reload cni network config at every single Status CRI call is not necessary and also brings NOTReady situation. To lower the possibility of NOTReady, CRI will reload cni network config if there is any valid fs change events from the cni network config dir. Signed-off-by: Wei Fu <[email protected]>

fuweid · 2020-04-03T05:34:58Z

pkg/server/cni_conf_syncer.go

+		return nil, errors.Wrap(err, "failed to create fsnotify watcher")
+	}
+
+	if err := os.MkdirAll(confDir, 0700); err != nil {


the previous test failed because the conf dir doesn't exist.

mikebrow · 2020-04-03T19:09:29Z

@mikebrow I check kubernetes with extra-logs and upload logs into artifacts :) finally get it

ah there it is in the artifacts/tmp* directory :-)

mikebrow

LGTM

fuweid · 2020-04-15T01:14:59Z

ping @mikebrow is it ok to move on?

mikebrow · 2020-04-15T18:58:23Z

@fuweid yes we have a fix for the windows cri test ... it has been moved to github actions!

fuweid · 2020-04-16T02:10:57Z

@mikebrow great. Thanks!

This unbreaks bbolt (as part of containerd) on 1.14+ (see etcd-io/bbolt#201 and etcd-io/bbolt#220), pulls in my patch to ignore image-defined volumes (containerd/cri#1504) and gets us some robustness fixes in containerd CNI/CRI integration (containerd/cri#1405). This also updates K8s at the same time since they share a lot of dependencies and only updating one is very annoying. On the K8s side we mostly get the standard stream of fixes plus some patches that are no longer necessary. One annoying on the K8s side (but with no impact to the functionality) are these messages in the logs of various components: ``` W0714 11:51:26.323590 1 warnings.go:67] policy/v1beta1 PodSecurityPolicy is deprecated in v1.22+, unavailable in v1.25+ ``` They are caused by KEP-1635, but there's not explanation why this gets logged so aggressively considering the operators cannot do anything about it. There's no newer version of PodSecurityPolicy and you are pretty much required to use it if you use RBAC. Test Plan: Covered by existing tests Bug: T753 X-Origin-Diff: phab/D597 GitOrigin-RevId: f6c447da1de037c27646f9ec9f45ebd5d6660ab0

k8s-ci-robot added the size/M label Feb 27, 2020

fuweid requested review from Random-Liu and mikebrow February 27, 2020 08:54

fuweid changed the title ~~Status: async reload latest CNI network conf~~ [wip] Status: async reload latest CNI network conf Feb 27, 2020

mikebrow reviewed Feb 27, 2020

View reviewed changes

pkg/server/service.go Outdated Show resolved Hide resolved

pkg/server/status.go Show resolved Hide resolved

pkg/server/service.go Outdated Show resolved Hide resolved

fuweid force-pushed the me-async-load-cnicnf branch from 80027f4 to b796ebf Compare March 8, 2020 08:38

k8s-ci-robot added size/L and removed size/M labels Mar 8, 2020

fuweid force-pushed the me-async-load-cnicnf branch from b796ebf to 032812b Compare March 8, 2020 08:39

fuweid changed the title ~~[wip] Status: async reload latest CNI network conf~~ reload cni network config if has fs change events Mar 8, 2020

fuweid force-pushed the me-async-load-cnicnf branch 2 times, most recently from 8499fd7 to faadc5f Compare March 8, 2020 10:45

k8s-ci-robot added size/XXL and removed size/L labels Mar 8, 2020

fuweid force-pushed the me-async-load-cnicnf branch from faadc5f to 59adb48 Compare March 8, 2020 11:09

mikebrow reviewed Mar 9, 2020

View reviewed changes

pkg/server/cni_conf_syncer.go Outdated Show resolved Hide resolved

vendor.conf Outdated Show resolved Hide resolved

pkg/server/service.go Outdated Show resolved Hide resolved

fuweid force-pushed the me-async-load-cnicnf branch 2 times, most recently from 36391d9 to 0e29917 Compare March 12, 2020 12:46

fuweid force-pushed the me-async-load-cnicnf branch from 0e29917 to 6ef19c4 Compare April 3, 2020 03:22

fuweid force-pushed the me-async-load-cnicnf branch from 6ef19c4 to cf4b4d9 Compare April 3, 2020 03:33

fuweid force-pushed the me-async-load-cnicnf branch 7 times, most recently from 84c1bb9 to 19f6bee Compare April 3, 2020 04:22

fuweid force-pushed the me-async-load-cnicnf branch from 19f6bee to 4ce334a Compare April 3, 2020 04:29

fuweid commented Apr 3, 2020

View reviewed changes

mikebrow approved these changes Apr 3, 2020

View reviewed changes

mikebrow merged commit d531dc4 into containerd:master Apr 15, 2020

fuweid deleted the me-async-load-cnicnf branch April 16, 2020 02:10

fuweid mentioned this pull request Nov 9, 2020

Handle potential timeouts when attempting to get the CNI lock containerd/containerd#4666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reload cni network config if has fs change events #1405

reload cni network config if has fs change events #1405

fuweid commented Feb 27, 2020 •

edited

Loading

fuweid commented Feb 27, 2020

fuweid commented Feb 27, 2020

fuweid commented Feb 27, 2020

mikebrow commented Feb 27, 2020

mikebrow left a comment

fuweid commented Mar 8, 2020

fuweid commented Mar 9, 2020

mikebrow commented Mar 9, 2020

mikebrow commented Mar 9, 2020

mikebrow left a comment

fuweid commented Mar 30, 2020

fuweid commented Apr 3, 2020

fuweid Apr 3, 2020

mikebrow commented Apr 3, 2020

mikebrow left a comment

fuweid commented Apr 15, 2020

mikebrow commented Apr 15, 2020

fuweid commented Apr 16, 2020

reload cni network config if has fs change events #1405

reload cni network config if has fs change events #1405

Conversation

fuweid commented Feb 27, 2020 • edited Loading

fuweid commented Feb 27, 2020

fuweid commented Feb 27, 2020

fuweid commented Feb 27, 2020

mikebrow commented Feb 27, 2020

mikebrow left a comment

Choose a reason for hiding this comment

fuweid commented Mar 8, 2020

fuweid commented Mar 9, 2020

mikebrow commented Mar 9, 2020

mikebrow commented Mar 9, 2020

mikebrow left a comment

Choose a reason for hiding this comment

fuweid commented Mar 30, 2020

fuweid commented Apr 3, 2020

fuweid Apr 3, 2020

Choose a reason for hiding this comment

mikebrow commented Apr 3, 2020

mikebrow left a comment

Choose a reason for hiding this comment

fuweid commented Apr 15, 2020

mikebrow commented Apr 15, 2020

fuweid commented Apr 16, 2020

fuweid commented Feb 27, 2020 •

edited

Loading