Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrashLoopBackOff Error in kube-proxy with kernel versions 5.12.2.arch1-1 and 5.10.35-1-lts #2240

Closed
ghost opened this issue May 11, 2021 · 20 comments · Fixed by #2241
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ghost
Copy link

ghost commented May 11, 2021

What happened: After creating the cluster with kind create cluster, the kube-proxy pod have a CrashLoopBackOff Error. This happens at the kernel versions 5.12.2.arch1-1 and 5.10.35-1-lts. With kernel versions 5.12.1.arch1-1 and 5.10.34-1-lts I didn't had the issue.

What you expected to happen: All pods in the cluster should start without problems.

How to reproduce it (as minimally and precisely as possible): On a Arch Linux install with kernel version 5.12.2.arch1-1 or 5.10.35-1-lts with docker installed download the latest version of kind and run kind create cluster.

Anything else we need to know?:

  • Log of kube-proxy pod:
I0511 11:47:28.906526       1 node.go:172] Successfully retrieved node IP: 172.18.0.2                                                                                
I0511 11:47:28.906613       1 server_others.go:142] kube-proxy node IP is an IPv4 address (172.18.0.2), assume IPv4 operation
I0511 11:47:28.953210       1 server_others.go:185] Using iptables Proxier.                                                                                          
I0511 11:47:28.953346       1 server_others.go:192] creating dualStackProxier for iptables.
W0511 11:47:28.960804       1 server_others.go:492] detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for I
I0511 11:47:28.962804       1 server.go:650] Version: v1.20.2                                                                                                        
I0511 11:47:28.965997       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072                                                                
F0511 11:47:28.966114       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied
  • Events from pod:
Events:                                                                                                                                                              
Type     Reason     Age               From               Message                                                                                                   
----     ------     ----              ----               -------                                                                                                   
Normal   Scheduled  48s               default-scheduler  Successfully assigned kube-system/kube-proxy-s7w5w to kind-control-plane
Normal   Pulled     2s (x4 over 48s)  kubelet            Container image "k8s.gcr.io/kube-proxy:v1.20.2" already present on machine
Normal   Created    2s (x4 over 45s)  kubelet            Created container kube-proxy                                                                              
Normal   Started    2s (x4 over 45s)  kubelet            Started container kube-proxy                                                                              
Warning  BackOff    1s (x5 over 42s)  kubelet            Back-off restarting failed container
  • tried it with iptables and nftables, same result with both.

Enviroment:

  • kind version: (use kind version): Tested both:

    • v0.11.0-alpha+1d4788dd7461b3 go1.16.4
    • v0.10.0 go1.16.4
  • Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"archive", BuildDate:"2021-04-09T16:47:30Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-03-11T06:23:38Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info):
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)

Server:
 Containers: 12
  Running: 1
  Paused: 0
  Stopped: 11
 Images: 8
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8c906ff108ac28da23f69cc7b74f8e7a470d1df0.m
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.35-1-lts
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.666GiB
 Name: avocado
 ID: ZNGF:FTZV:6BK6:VPE3:ZGAR:A5A2:VYEI:LUQE:AEU6:6MHN:ZGTZ:WR2V
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling

Kernel: 5.10.35-1-lts
CPU: Intel i5-7200U (4) @ 3.100GHz
  • iptables version: v1.8.7 (legacy)
  • nftables version: v0.9.8 (E.D.S.)
@ghost ghost added the kind/bug Categorizes issue or PR as related to a bug. label May 11, 2021
@cubic3d
Copy link

cubic3d commented May 11, 2021

I'm getting same results with 5.12.2-arch1-1.

Quick workaround if a cluster is needed fast: Manually set the parameter with sudo sysctl net/netfilter/nf_conntrack_max=131072 before creating the Kind cluster.

@BenTheElder
Copy link
Member

This workaround works because kubernetes/kubernetes#44919 (kube-proxy will not try to write if the existing value is high enough, despite the logs suggesting that it set it).

We could explicitly configure the max to 0 or some very small value in kind's kube-proxy possibly, but I think you would still want to increase the actual value for things to work well.

In normal usage kind is not setting this and relying on the host kernel to have a suitable value, as we've encountered
so far. I'm guessing arch reduced the default in their latest kernels?

cc @aojea

@Juneezee
Copy link

@hyutota @BenTheElder I don't think this is an Arch Linux-only issue.

According to the changleog of Linux 5.12.2, this commit (torvalds/linux@671c54e) has changed the behaviour of netfilter conntrack. I believe this is the commit that has caused this issue after upgrading to Linux 5.12.2.

@aojea
Copy link
Contributor

aojea commented May 11, 2021

@hyutota @BenTheElder I don't think this is an Arch Linux-only issue.

According to the changleog of Linux 5.12.2, this commit (torvalds/linux@671c54e) has changed the behaviour of netfilter conntrack. I believe this is the commit that has caused this issue after upgrading to Linux 5.12.2.

wow, so it seems that we can't set nf_conntrack_max in kind, it will fail for kernels +5.12.2 🤔

the good thing is that jthe fix seems simple, is just enable by default

{{if .RootlessProvider}}conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
tcpCloseWaitTimeout: 0s

@ghost
Copy link
Author

ghost commented May 12, 2021

I can confirm that #2241 fixes the issue for me on kernel 5.12.2-arch1-1.

@BenTheElder
Copy link
Member

thanks all, #2241 should be in shortly, and since we're quite overdue for a release it should be released soon.

@tikessler
Copy link

tikessler commented May 24, 2021

sudo sysctl net/netfilter/nf_conntrack_max=131072

Hello, thanks for sharing. Can you elaborate further what todo exactly?

I deleted all cluster configs. (.minikube, .kube)
After deleting, I ran the above command, but on the host system. But the problem still exists, should it be executed in a pod?

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE
kube-system   coredns-74ff55c5b-j67mh            0/1     Running   0          5m48s
kube-system   etcd-minikube                      1/1     Running   0          5m57s
kube-system   kube-apiserver-minikube            1/1     Running   0          5m57s
kube-system   kube-controller-manager-minikube   1/1     Running   0          5m57s
kube-system   kube-proxy-d5zbf                   0/1     Error     6          5m48s
kube-system   kube-scheduler-minikube            1/1     Running   0          5m57s
kube-system   storage-provisioner                1/1     Running   5          6m2s
$ kubectl -n kube-system logs kube-proxy-d5zbf     
I0524 13:25:56.346577       1 node.go:172] Successfully retrieved node IP: 192.168.49.2
I0524 13:25:56.346621       1 server_others.go:142] kube-proxy node IP is an IPv4 address (192.168.49.2), assume IPv4 operation
W0524 13:25:56.362150       1 server_others.go:578] Unknown proxy mode "", assuming iptables proxy
I0524 13:25:56.362217       1 server_others.go:185] Using iptables Proxier.
I0524 13:25:56.362371       1 server.go:650] Version: v1.20.2
I0524 13:25:56.362572       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 524288
F0524 13:25:56.362584       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

I assume there is no other solution atm? I am using Minikube with Kernel 5.10.36-2-MANJARO.

@Juneezee
Copy link

@muellerti Could you try the following steps and see if it works?

  1. Delete your local cluster first
  2. Set sudo sysctl net/netfilter/nf_conntrack_max=524288
  3. Start a new local cluster again

@tikessler
Copy link

Man! Thanks! That worked, I should have thought of that myself.

lukaszo added a commit to capactio/capact that referenced this issue Jul 1, 2021
It contains a fix for kubernetes-sigs/kind#2240

We've hit when running GitHub actions actions/runner-images#3673
matt-mazzucato added a commit to matt-mazzucato/astarte-kubernetes-operator that referenced this issue Jul 1, 2021
- use a ubuntu 20.04
- set nf_conntrack_max to avoid CrashLoopBackOff for kube proxy (see
kubernetes-sigs/kind#2240 (comment))
- print cluster info as soon as KinD is up

Signed-off-by: Mattia Mazzucato <[email protected]>
matt-mazzucato added a commit to matt-mazzucato/astarte-kubernetes-operator that referenced this issue Jul 1, 2021
- use ubuntu 20.04
- set nf_conntrack_max to avoid CrashLoopBackOff for kube proxy (see
kubernetes-sigs/kind#2240 (comment))
- print cluster info as soon as KinD is up

Signed-off-by: Mattia Mazzucato <[email protected]>
matt-mazzucato added a commit to matt-mazzucato/astarte-kubernetes-operator that referenced this issue Jul 1, 2021
- use ubuntu 20.04
- set nf_conntrack_max to avoid CrashLoopBackOff for kube proxy (see
kubernetes-sigs/kind#2240 (comment))
- print cluster info as soon as KinD is up

Signed-off-by: Mattia Mazzucato <[email protected]>
matt-mazzucato added a commit to matt-mazzucato/astarte-kubernetes-operator that referenced this issue Jul 1, 2021
- use a ubuntu 20.04
- set nf_conntrack_max to avoid CrashLoopBackOff for kube proxy (see
kubernetes-sigs/kind#2240 (comment))
- print cluster info as soon as KinD is up
- use kind-action v1.4.0
- bump KinD to v0.10.0
- use kube-tools v1.5.0

Signed-off-by: Mattia Mazzucato <[email protected]>
matt-mazzucato added a commit to matt-mazzucato/astarte-kubernetes-operator that referenced this issue Jul 1, 2021
- use ubuntu 20.04
- set nf_conntrack_max to avoid CrashLoopBackOff for kube proxy (see
kubernetes-sigs/kind#2240 (comment))
- print cluster info as soon as KinD is up
- use kind-action v1.4.0
- bump KinD to v0.10.0
- use kube-tools v1.5.0

Signed-off-by: Mattia Mazzucato <[email protected]>
matt-mazzucato added a commit to matt-mazzucato/astarte-kubernetes-operator that referenced this issue Jul 1, 2021
- use ubuntu 20.04
- set nf_conntrack_max to avoid CrashLoopBackOff for kube proxy (see
kubernetes-sigs/kind#2240 (comment))
- print cluster info as soon as KinD is up
- use kind-action v1.4.0
- bump KinD to v0.10.0

Signed-off-by: Mattia Mazzucato <[email protected]>
Brian-McM pushed a commit to projectcalico/node that referenced this issue Aug 5, 2021
For kube-proxy not becoming ready, like this:

    semaphore@semaphore-vm:~$ kubectl logs kube-proxy-42v55 -n kube-system
    I0727 19:55:26.230888       1 node.go:135] Successfully retrieved node IP: 172.17.0.2
    I0727 19:55:26.230923       1 server_others.go:172] Using ipvs Proxier.
    I0727 19:55:26.230930       1 server_others.go:174] creating dualStackProxier for ipvs.
    W0727 19:55:26.232364       1 proxier.go:420] IPVS scheduler not specified, use rr by default
    W0727 19:55:26.232522       1 proxier.go:420] IPVS scheduler not specified, use rr by default
    W0727 19:55:26.232538       1 ipset.go:107] ipset name truncated; [KUBE-6-LOAD-BALANCER-SOURCE-CIDR] -> [KUBE-6-LOAD-BALANCER-SOURCE-CID]
    W0727 19:55:26.232546       1 ipset.go:107] ipset name truncated; [KUBE-6-NODE-PORT-LOCAL-SCTP-HASH] -> [KUBE-6-NODE-PORT-LOCAL-SCTP-HAS]
    I0727 19:55:26.232648       1 server.go:571] Version: v1.17.0
    I0727 19:55:26.232963       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
    F0727 19:55:26.232982       1 server.go:485] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

See kubernetes-sigs/kind#2240 and
kubernetes-sigs/kind#2241.
kensipe pushed a commit to kudobuilder/kuttl that referenced this issue Aug 9, 2021
Resolves an crash issue on linux machines noted here:  kubernetes-sigs/kind#2240
Co-authored-by: Ken Sipe <[email protected]>
@wkjun
Copy link

wkjun commented Aug 11, 2021

I'm getting same results with 5.12.2-arch1-1.

Quick workaround if a cluster is needed fast: Manually set the parameter with sudo sysctl net/netfilter/nf_conntrack_max=131072 before creating the Kind cluster.

i'm using ubuntu operating system ,add the line to sysctl config file /etc/sysctl.d/99-sysctl.conf will alse be find, if you not want to manually set the parameter every reboot your pc or note book.

net.netfilter.nf_conntrack_max=131072

and let the parameter work imediately and restart minikube

sudo sysctl -p
minikube stop 
minikube start

garloff added a commit to SovereignCloudStack/k8s-cluster-api-provider that referenced this issue Aug 18, 2021
garloff added a commit to SovereignCloudStack/k8s-cluster-api-provider that referenced this issue Aug 18, 2021
garloff added a commit to SovereignCloudStack/k8s-cluster-api-provider that referenced this issue Aug 18, 2021
* Inject sysctl changing nf_conntrack_max to 131072.

This addresses
#18
kubernetes-sigs/kind#2240

* Need to load nf_conntrack kmod for the sysctl setting.

* Add nf_conntrack to modules-load.d to ensure sysctl works.

This is required to be reboot safe.

Signed-off-by: Kurt Garloff <[email protected]>
@manchinagarjuna
Copy link

manchinagarjuna commented Aug 27, 2021

Hi All,

With the latest Kind binary and Kubernetes images I am no longer seeing this issue.
However, on one machine we were able to see this issue on multi-node kind setup.

Bumping the max number to the same value we observed in the Kube-proxy logs, solved the issue and we are able to create the cluster fine.
sudo sysctl net/netfilter/nf_conntrack_max=393216

Kind version: 0.11.1
Kubernets node images: 1.20.7
Host os: Debian 10 buster

I'm wondering if there is a long term solution that avoids the need of this?

Thanks in advance!

@BenTheElder
Copy link
Member

You should not see this issue with any number of nodes in the latest release. Can you confirm that this is minimally reproducible with the latest release and file a new issue if not?

@manchinagarjuna
Copy link

Thanks for your response Ben.

On further investigation, the old Kind executable is taking precedence in the path on that particular environment. Removing it out showed no issues, cluster is up and running as expected.
The Kind 0.11.1 with node images 1.20.7 works without the additional settings

thehajime added a commit to ukontainer/runu that referenced this issue Sep 6, 2021
We also update the base image version to v1.21.1.

kubernetes-sigs/kind#2240

Signed-off-by: Hajime Tazaki <[email protected]>
thehajime added a commit to ukontainer/runu that referenced this issue Sep 8, 2021
We also update the base image version to v1.21.1.

kubernetes-sigs/kind#2240

Signed-off-by: Hajime Tazaki <[email protected]>
thehajime added a commit to ukontainer/runu that referenced this issue Sep 8, 2021
We also update the base image version to v1.21.1.

kubernetes-sigs/kind#2240

Signed-off-by: Hajime Tazaki <[email protected]>
@yharish991
Copy link

how do i fix this issue on mac os?

@deepak7093
Copy link

change maxPerCore to 0 in configMap of kube-proxy to leave the limit as-is and ignore conntrack-min

https://serverfault.com/questions/1063166/kube-proxy-wont-start-in-minikube-because-of-permission-denied-issue-with-proc#

pregnor added a commit to banzaicloud/banzai-cli that referenced this issue Oct 18, 2021
kubernetes-sigs/kind#2240

Kind v0.11.1 is going to fix this issue, but
upgrading breaks the bank-vaults which takes a lot
to upgrade to postponed.

Instead I added a post-kind-create-clusters step
to fix the issue for macOS context where a
ConfigMap change is enough.
pregnor added a commit to banzaicloud/banzai-cli that referenced this issue Oct 18, 2021
kubernetes-sigs/kind#2240

Kind v0.11.1 is going to fix this issue, but
upgrading breaks the bank-vaults which takes a lot
to upgrade to postponed.

Instead I added a post-kind-create-clusters step
to fix the issue for macOS context where a
ConfigMap change is enough.
@arkodg
Copy link

arkodg commented Oct 26, 2021

@yharish991 run brew upgrade kind which will upgrade your kind version to 0.11.1 and fix the issue.

airshipbot pushed a commit to airshipit/airshipctl that referenced this issue Dec 4, 2021
As mentioned in kubernetes-sigs/kind#2240 there was a change
in the linux kernel after 5.12.2 that makes nf_conntrack_max read-only in non-init network
namespaces, which prevents kind's kube-proxy container from working correctly on kind versions
older than v0.11.1. This PS updates the script to download v0.11.1 to avoid this issue.
If older versions are needed, the kind url can be set as an environment variable as shown in
airshipctl/tools/deployment.provider_common/01_install_kind.sh.

Relates-To: #583
Change-Id: Icd9e649fa112e9f6307034ec69dde5d4a7ad613d
theothertomelliott added a commit to telliott-io/platform that referenced this issue Oct 15, 2022
Kind was failing to come up silently likely due to kubernetes-sigs/kind#2240
Bumping versions appears to have fixed the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.