Handle potential timeouts when attempting to get the CNI lock #4666

lbernail · 2020-10-28T11:08:09Z

With two different CNI plugins (and for two very different problems with them) we have seen situations where a CNI operation never returns and hangs, never releasing the the CNI lock. I think it would be helpful to add a timeout to the tentatives to get the lock and surface meaningful errors to the kubelet.

Today when this happens the CRI status call from the kubelet will time out with this log:

 Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The kubelet will then mark the node as NotReady with Reason container runtime is down which is misleading because crictl pods and crictl ps continues to work perfectly (but crictl info doesn't because it also calls the CRI status method).

The containerd call in the Status method is here: https://github.com/containerd/cri/blob/master/pkg/server/status.go#L44-L49
And the libcni.Status method is here: https://github.com/containerd/go-cni/blob/master/cni.go#L124-L132

I think adding a timeout to https://github.com/containerd/go-cni/blob/master/cni.go#L126 and returning a different meaningful error would be helpful and make debugging easier when this happens.

The default kubelet timeout for this call is 2mn: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L447-L453

So if we decide to this, we should probably use a shorter timeout (90s?).

I don't think sync/rwmutex supports timeouts so we'd need to use a different implementation. I'd be more than happy to help with a PR if you think it makes sense.

The text was updated successfully, but these errors were encountered:

AkihiroSuda · 2020-10-28T11:31:17Z

Transferred the issue from the old CRI repo github.com/containerd/cri

fuweid · 2020-11-09T01:54:50Z

@lbernail which version of cri plugin are you using right now? I changed the logic that the CRI will not load the CNI plugin during Status call. containerd/cri#1405

lbernail · 2020-11-09T13:28:13Z

@fuweid I saw this on 1.2 but I'm pretty sure it also impacts master because we still have this call:

if err := c.netPlugin.Status(); err != nil {

which tries to acquire the CNI lock in the CRI status handler

I'll reproduce with 1.4 to be sure

fuweid · 2020-11-11T11:36:29Z

Any update? @lbernail

Both the c.netPlugin.Status() and c.netPlugin.Setup requires RLock. If there is no c.netPlugin.Load call, the RLock will not be impacted.

lbernail · 2020-11-12T17:05:53Z

@fuweid Sorry for the delay. I just test with 1.4.1 and it works a lot better!

So the only potential problem now is that we can't reload the configuration if a RunPodSandbox call is holding a Rlock which is not a big deal I think.

Is this only available since 1.4?

Many thanks and sorry for the issue, I had misunderstood the locking logic here

fuweid · 2020-11-18T02:30:05Z

@lbernail Yes. It is available since major release 1.4. :)

It is good to upgrade your containerd version ~

lbernail · 2020-11-18T11:18:00Z

We will soon!
Quick follow-up question: do you think we should add timeouts to Add/Del calls? It will avoid the situation where the kubelet CRI call times out with a misleading message.
We could surface a clearer error message from containerd such as "CNI invocation timed out")

fuweid · 2020-11-18T23:17:56Z

@lbernail the timeout can be propagated by kubelet RunPodSandbox context actually, like https://github.com/kubernetes/kubernetes/blob/v1.19.0/pkg/kubelet/cri/remote/remote_runtime.go#L98.

And containerd 1.4 release has added the context into https://github.com/containerd/cri/blob/release/1.4/pkg/server/sandbox_run.go#L147 (cni call uses that ctx https://github.com/containerd/go-cni/blob/8fbf3637eb5f67bb16eaacd4ec23a82e4d33b3ec/vendor/github.com/containernetworking/cni/pkg/invoke/raw_exec.go#L34)

fuweid · 2020-11-20T00:49:28Z

@lbernail I am going to close this issue. Please create new one if you have other questions. And thanks for reporting.

AkihiroSuda transferred this issue from containerd/cri Oct 28, 2020

AkihiroSuda added the area/cri Container Runtime Interface (CRI) label Oct 28, 2020

fuweid closed this as completed Nov 20, 2020

ykulazhenkov mentioned this issue Nov 4, 2022

Add capability to set VLANs on uplink interfaces k8snetworkplumbingwg/accelerated-bridge-cni#56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle potential timeouts when attempting to get the CNI lock #4666

Handle potential timeouts when attempting to get the CNI lock #4666

lbernail commented Oct 28, 2020

AkihiroSuda commented Oct 28, 2020

fuweid commented Nov 9, 2020

lbernail commented Nov 9, 2020 •

edited

Loading

fuweid commented Nov 11, 2020

lbernail commented Nov 12, 2020

fuweid commented Nov 18, 2020

lbernail commented Nov 18, 2020

fuweid commented Nov 18, 2020

fuweid commented Nov 20, 2020

Handle potential timeouts when attempting to get the CNI lock #4666

Handle potential timeouts when attempting to get the CNI lock #4666

Comments

lbernail commented Oct 28, 2020

AkihiroSuda commented Oct 28, 2020

fuweid commented Nov 9, 2020

lbernail commented Nov 9, 2020 • edited Loading

fuweid commented Nov 11, 2020

lbernail commented Nov 12, 2020

fuweid commented Nov 18, 2020

lbernail commented Nov 18, 2020

fuweid commented Nov 18, 2020

fuweid commented Nov 20, 2020

lbernail commented Nov 9, 2020 •

edited

Loading