-
Notifications
You must be signed in to change notification settings - Fork 348
reload cni network config if has fs change events #1405
Conversation
no idea what it is. 😢 |
/retest |
Forgot one thing that rlock will wait when the wlock is pending. Looping to reload CNI network conf still can cause that NotReady issue. Need inotify to load conf on demand. |
@yujuhong @Random-Liu still having problems running infra on gcloud is this a known issue? thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is wip :) but thought I'd give some early comments.
80027f4
to
b796ebf
Compare
b796ebf
to
032812b
Compare
8499fd7
to
faadc5f
Compare
faadc5f
to
59adb48
Compare
@mikebrow updated. PTAL. Thanks It seems that |
/retest |
/test pull-cri-containerd-node-e2e |
Sorry I don't. @Random-Liu how would one get a containerd log dump, would we have to insert a print of the log at failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! see nits
36391d9
to
0e29917
Compare
ping @Random-Liu and how to check containerd log in e2e 😂 |
0e29917
to
6ef19c4
Compare
@mikebrow I check kubernetes with extra-logs and upload logs into artifacts :) finally get it |
6ef19c4
to
cf4b4d9
Compare
84c1bb9
to
19f6bee
Compare
With go RWMutex design, no goroutine should expect to be able to acquire a read lock until the read lock has been released, if one goroutine call lock. The original design is to reload cni network config on every single Status CRI gRPC call. If one RunPodSandbox request holds read lock to allocate IP for too long, all other RunPodSandbox/StopPodSandbox requests will wait for the RunPodSandbox request to release read lock. And the Status CRI call will fail and kubelet becomes NOTReady. Reload cni network config at every single Status CRI call is not necessary and also brings NOTReady situation. To lower the possibility of NOTReady, CRI will reload cni network config if there is any valid fs change events from the cni network config dir. Signed-off-by: Wei Fu <[email protected]>
19f6bee
to
4ce334a
Compare
return nil, errors.Wrap(err, "failed to create fsnotify watcher") | ||
} | ||
|
||
if err := os.MkdirAll(confDir, 0700); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous test failed because the conf dir doesn't exist.
ah there it is in the artifacts/tmp* directory :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ping @mikebrow is it ok to move on? |
@fuweid yes we have a fix for the windows cri test ... it has been moved to github actions! |
@mikebrow great. Thanks! |
This unbreaks bbolt (as part of containerd) on 1.14+ (see etcd-io/bbolt#201 and etcd-io/bbolt#220), pulls in my patch to ignore image-defined volumes (containerd/cri#1504) and gets us some robustness fixes in containerd CNI/CRI integration (containerd/cri#1405). This also updates K8s at the same time since they share a lot of dependencies and only updating one is very annoying. On the K8s side we mostly get the standard stream of fixes plus some patches that are no longer necessary. One annoying on the K8s side (but with no impact to the functionality) are these messages in the logs of various components: ``` W0714 11:51:26.323590 1 warnings.go:67] policy/v1beta1 PodSecurityPolicy is deprecated in v1.22+, unavailable in v1.25+ ``` They are caused by KEP-1635, but there's not explanation why this gets logged so aggressively considering the operators cannot do anything about it. There's no newer version of PodSecurityPolicy and you are pretty much required to use it if you use RBAC. Test Plan: Covered by existing tests Bug: T753 X-Origin-Diff: phab/D597 GitOrigin-RevId: f6c447da1de037c27646f9ec9f45ebd5d6660ab0
With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.
The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.
Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.
Signed-off-by: Wei Fu [email protected]