Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New containers unable to reach the network for a while when using Calico #2308

Closed
bandesz opened this issue Jun 14, 2021 · 14 comments
Closed
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@bandesz
Copy link

bandesz commented Jun 14, 2021

First of all I know that Calico is not officially supported, but I'm leaving this question here for others to find. Maybe others have/had the same issue.

When I disable the default CNI and install Calico, new containers are unable to reach the network for at least ~30 seconds.

In my tests I was trying to reach a service on a cluster IP to rule out any DNS resolution issues.

When I remove Calico and use the built-in CNI, new containers can reach the network instantly.

OS: macOS 11.4
Kind version: v0.11.1
Calico versions tested: v3.16.10, v3.19.1
Docker Desktop: 3.3.3 (6 CPUs, 6Gb memory)

Kind config:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  metadata:
    name: config
  apiServer:
    extraArgs:
      "enable-admission-plugins": NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,PersistentVolumeClaimResize,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota
nodes:
  - role: control-plane
    image: kindest/node:v1.20.7@sha256:cbeaf907fc78ac97ce7b625e4bf0de16e3ea725daf6b04f930bd14c67c671ff9
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
        listenAddress: 127.0.0.1
      - containerPort: 443
        hostPort: 443
        protocol: TCP
        listenAddress: 127.0.0.1
networking:
  disableDefaultCNI: true
  podSubnet: 192.168.0.0/16

Calico manifests used: https://docs.projectcalico.org/[v3.16|v3.19]/manifests/calico.yaml

@bandesz bandesz added the kind/support Categorizes issue or PR as a support question. label Jun 14, 2021
@BenTheElder
Copy link
Member

Is it only when first bringing up the cluster or an ongoing issue with new containers?

The former would probably just be time to spinup calico, the latter is probably something more interesting and might need help from the calico folks (realistically as long as the default is working fine I probably won't be able to dig into this anytime soon, not sure about the other maintainers).

@bandesz bandesz changed the title Containers unable to reach the network for a while when using Calico New containers unable to reach the network for a while when using Calico Jun 15, 2021
@bandesz
Copy link
Author

bandesz commented Jun 15, 2021

@BenTheElder it's an ongoing issue with new containers.

@lwr20
Copy link
Contributor

lwr20 commented Jun 15, 2021

I don't see this behaviour on my machine at all (dell laptop, running Ubuntu 20.04). Here's my session output:

lance@lwr20:~/scratch/kind-test$ kind version
kind v0.10.0 go1.15.7 linux/amd64

lance@lwr20:~/scratch/kind-test$ kind create cluster --config kind-config.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.20.7) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊
lance@lwr20:~/scratch/kind-test$ kubectl cluster-info --context kind-kind
Kubernetes control plane is running at https://127.0.0.1:42637
KubeDNS is running at https://127.0.0.1:42637/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
lance@lwr20:~/scratch/kind-test$ kubectl get no
NAME                 STATUS     ROLES                  AGE   VERSION
kind-control-plane   NotReady   control-plane,master   21s   v1.20.7

lance@lwr20:~/scratch/kind-test$ kubectl apply -f https://docs.projectcalico.org/v3.19/manifests/calico.yaml
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/kubecontrollersconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
poddisruptionbudget.policy/calico-kube-controllers created
lance@lwr20:~/scratch/kind-test$ kubectl get po -A
NAMESPACE            NAME                                         READY   STATUS     RESTARTS   AGE
kube-system          calico-kube-controllers-7f4f5bf95d-g9mts     0/1     Pending    0          4s
kube-system          calico-node-bshgn                            0/1     Init:0/3   0          4s
kube-system          coredns-74ff55c5b-bg9mc                      0/1     Pending    0          25s
kube-system          coredns-74ff55c5b-fkqkb                      0/1     Pending    0          25s
kube-system          etcd-kind-control-plane                      0/1     Running    0          30s
kube-system          kube-apiserver-kind-control-plane            1/1     Running    0          30s
kube-system          kube-controller-manager-kind-control-plane   0/1     Running    0          30s
kube-system          kube-proxy-4hv4l                             1/1     Running    0          26s
kube-system          kube-scheduler-kind-control-plane            0/1     Running    0          30s
local-path-storage   local-path-provisioner-547f784dff-mp2wh      0/1     Pending    0          25s

<snipped out a bunch of waiting here for calico pods to start>

lance@lwr20:~/scratch/kind-test$ kubectl get po -A
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
kube-system          calico-kube-controllers-7f4f5bf95d-g9mts     1/1     Running   0          67s
kube-system          calico-node-bshgn                            1/1     Running   0          67s
kube-system          coredns-74ff55c5b-bg9mc                      1/1     Running   0          88s
kube-system          coredns-74ff55c5b-fkqkb                      1/1     Running   0          88s
kube-system          etcd-kind-control-plane                      1/1     Running   0          93s
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          93s
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          93s
kube-system          kube-proxy-4hv4l                             1/1     Running   0          89s
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          93s
local-path-storage   local-path-provisioner-547f784dff-mp2wh      1/1     Running   0          88s
lance@lwr20:~/scratch/kind-test$ kubectl create ns policy-demo
namespace/policy-demo created
lance@lwr20:~/scratch/kind-test$ kubectl create deployment --namespace=policy-demo nginx --image=nginx
deployment.apps/nginx created
lance@lwr20:~/scratch/kind-test$ kubectl expose --namespace=policy-demo deployment nginx --port=80
service/nginx exposed
lance@lwr20:~/scratch/kind-test$ kubectl run --namespace=policy-demo access --rm -ti --image busybox /bin/sh
If you don't see a command prompt, try pressing enter.
/ # wget -q nginx -O -
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
/ # 

Ah - my kind version is wrong. I'll try again with the same version as OP.

@lwr20
Copy link
Contributor

lwr20 commented Jun 15, 2021

Same result with kind 0.11.1. Though as before, the big difference is I'm on linux, not mac.

Is there something wrong with the way I'm reproing?

@BenTheElder
Copy link
Member

It could be something with how the docker for mac linux VM is setup, e.g. there was an issue with a bad proxy recently. #2208 (comment)

Thanks for looking into this and @aojea for reaching out in #sig-network.

@rikatz
Copy link
Contributor

rikatz commented Jun 16, 2021

I've tested on Mac here with the same script from @lwr20 and it was almost instantly.

Used @bandesz manifest to create the kind cluster, and deployed with Calico v3.19 from docs.

@bandesz anything that can direct me to a specific point of problem? Can you take a look into the docker bridges that exists (docker network ls, docker network inspect kind) to check if there's any difference with my env?

Thanks

@BenTheElder
Copy link
Member

perhaps more or less resources allocated to the docker VM, or the version of docker desktop?

@bandesz
Copy link
Author

bandesz commented Jun 16, 2021

I did some additional testing, following the exact steps by @lwr20.

I get different results, tried alpine/busybox to see if there is any difference. I mostly see delays around ~10 seconds, but the worst I saw was ~40 seconds.

$ kubectl run --namespace=policy-demo access --rm -ti --image busybox /bin/sh
If you don't see a command prompt, try pressing enter.
/ # date; while ! nc -z -v -w5 nginx 80; do date; done; date
Wed Jun 16 19:16:10 UTC 2021
nginx (10.96.213.175:80) open
Wed Jun 16 19:16:20 UTC 2021
kubectl run --namespace=policy-demo access --rm -ti --image alpine:latest /bin/sh
If you don't see a command prompt, try pressing enter.
/ # date; while ! nc -z -v -w5 nginx 80; do date; done; date
Wed Jun 16 19:17:01 UTC 2021
nc: bad address 'nginx'
Wed Jun 16 19:17:06 UTC 2021
nc: bad address 'nginx'
Wed Jun 16 19:17:11 UTC 2021
nginx (10.96.213.175:80) open
Wed Jun 16 19:17:13 UTC 2021
kubectl run --namespace=policy-demo access2 --rm -ti --image busybox /bin/sh
If you don't see a command prompt, try pressing enter.
/ # date; while ! nc -z -v -w5 nginx 80; do date; done; date
Wed Jun 16 19:30:18 UTC 2021
nginx (10.96.213.175:80) open
Wed Jun 16 19:30:38 UTC 2021

Docker Desktop version: 3.3.3 (6 CPUs, 6Gb memory)

$ docker network ls
NETWORK ID     NAME       DRIVER    SCOPE
a5204fe333c4   bridge     bridge    local
dd2f1fb07e96   host       host      local
09084d3de320   k3d-kore   bridge    local
639cf925ab1b   kind       bridge    local
ed95a679a094   none       null      local
$ docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "639cf925ab1b7cde9e44049ae4afb04f5d834b69f532407f9acf3900c79fb869",
        "Created": "2021-04-12T16:20:25.221050178Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64",
                    "Gateway": "fc00:f853:ccd:e793::1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "da83f47296036051bf573c8c558c944cf1fdbcd37a50b15b4c7cd7cffc6c505f": {
                "Name": "kind-control-plane",
                "EndpointID": "2c5c3e5e628dc592168611902679bb97e03424ac01cb812980b424b9451ddc61",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true"
        },
        "Labels": {}
    }
]

@bandesz
Copy link
Author

bandesz commented Jun 16, 2021

Just upgraded Docker Desktop to 3.4.0 (latest), but no change.

@bandesz
Copy link
Author

bandesz commented Jun 16, 2021

Additional things I've tried, but no change:

  • using kind v0.9.0 with a matching 1.19 image
  • deleting the kind docker network and let kind recreate it

@bandesz
Copy link
Author

bandesz commented Jun 17, 2021

Tried on another Macbook, I get the same result. Same kind and Docker for Mac versions.

It's interesting that so far the very first time (in a new kind cluster in the first test) I get an immediate response from wget, so if anyone's testing this, you have to repeat the test with multiple new containers to make sure you don't see any significant latency.

@aojea
Copy link
Contributor

aojea commented Jun 20, 2021

reported today in the slack channel https://kubernetes.slack.com/archives/CEKK1KTN2/p1624167563435000

It seems very related

Hey folks !
A couple weeks ago we hit an issue when running e2e tests for Cluster API on our local Mac. It always failed on the first run with webhook connection timeout and passes in subsequent runs when you pass the USE_EXISTING_CLUSTER flag.
After much digging, we found that the problem is with Docker Desktop for Mac - see issue (container with many published ports takes 10x slower to start)
From the dockerd logs in the issue, also looks like the docker daemon spends a lot of time in iptables invocation - owing to an upgraded Linux Kernel version in Docker Desktop 3.3.0 - see Release Notes
So on local Mac, downgrading Docker seemed to have resolved the issue. Since we use kind for local clusters, wondering if any of you faced similar issues? Here kindnet is using iptables-based portmap plugin, something might have to do with converting these iptables rules to nftable?

@danquah
Copy link

danquah commented Jun 24, 2021

I've had the exact same issue, and had to introduce som waits and extra checks in our setup to compensate. I'm 99% sure I did not upgrade kind in the period where this startet happening, and 100% sure I've updated Docker For Mac multiple times, so for what it is worth it could be explained by an issue with Docker For Mac!

@aojea
Copy link
Contributor

aojea commented Jul 27, 2021

can we blame docker for mac and close it then 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

6 participants