Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master broken? #228

Closed
CecileRobertMichon opened this issue Jun 17, 2019 · 11 comments · Fixed by #229
Closed

Master broken? #228

CecileRobertMichon opened this issue Jun 17, 2019 · 11 comments · Fixed by #229

Comments

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jun 17, 2019

I have a pretty strong signal that #223 broke cluster deployments.

repro:

I've repro'd this multiple times (it seems to repro 100% of the time). The same steps from the previous commit (4b9a297) work as expected.

logs from azure-provider-controller-manager-0 show in a loop:

I0617 23:29:51.114998       1 cluster_controller.go:90] Running reconcile Cluster for "cluster-7"
I0617 23:29:51.115282       1 cluster_controller.go:139] reconciling cluster object cluster-7 triggers idempotent reconcile.
I0617 23:29:51.115468       1 actuator.go:54] Reconciling cluster cluster-7
I0617 23:29:51.117244       1 reconciler.go:67] reconciling cluster cluster-7
I0617 23:29:51.117475       1 certificates.go:59] generating certificates
I0617 23:29:51.117619       1 certificates.go:107] CreatePKIAssets
I0617 23:29:51.117762       1 certs.go:37] creating PKI assets
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [fakenodecluster-7 localhost] and IPs [10.0.0.1 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [fakenodecluster-7 localhost] and IPs [10.0.0.1 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [fakenodecluster-7 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local cluster-7-62f2d3a.southcentralus.cloudapp.azure.com] and IPs [10.96.0.1 10.0.0.1 10.0.0.100]
[certs] valid certificates and keys now exist in "/tmp/cluster-api/cluster-7/certs"
I0617 23:30:05.512747       1 certs.go:72] creating a new public/private key files for signing service account users
[certs] Generating "sa" key and public key
I0617 23:30:06.717182       1 certificates.go:111] CreatePKIAssets success
I0617 23:30:06.717233       1 certificates.go:117] CreateSACertificates
I0617 23:30:06.717257       1 certs.go:72] creating a new public/private key files for signing service account users
[certs] Using the existing "sa" key
I0617 23:30:07.709928       1 certificates.go:121] CreateSACertificates success
I0617 23:30:07.709955       1 certificates.go:212] CreateKubeconfigs admin kubeconfig
I0617 23:30:07.709977       1 kubeconfig.go:92] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0617 23:30:08.523837       1 certificates.go:225] CreateKubeconfigs admin kubeconfig success
I0617 23:30:08.523894       1 certificates.go:100] successfully created certificates
I0617 23:30:08.524782       1 groups.go:36] creating resource group cluster-7
I0617 23:30:10.213936       1 groups.go:38] successfully created resource group cluster-7
I0617 23:30:10.411246       1 securitygroups.go:61] using additional rules for control plane cluster-7-controlplane-nsg
I0617 23:30:10.411311       1 securitygroups.go:92] creating security group cluster-7-controlplane-nsg
I0617 23:30:10.732161       1 securitygroups.go:117] created security group cluster-7-controlplane-nsg
I0617 23:30:10.732200       1 securitygroups.go:92] creating security group cluster-7-node-nsg
I0617 23:30:11.014724       1 securitygroups.go:117] created security group cluster-7-node-nsg
I0617 23:30:11.014794       1 routetables.go:56] creating route table cluster-7-node-routetable
I0617 23:30:11.308892       1 routetables.go:79] successfully created route table cluster-7-node-routetable
I0617 23:30:11.308951       1 subnets.go:79] getting nsg cluster-7-controlplane-nsg
I0617 23:30:11.589547       1 subnets.go:88] got nsg cluster-7-controlplane-nsg
I0617 23:30:11.589618       1 subnets.go:91] creating subnet cluster-7-controlplane-subnet in vnet cluster-7-vnet
I0617 23:30:12.284609       1 subnets.go:115] successfully created subnet cluster-7-controlplane-subnet in vnet cluster-7-vnet
I0617 23:30:12.284689       1 subnets.go:66] getting route table cluster-7-node-routetable
I0617 23:30:12.659209       1 subnets.go:75] sucessfully got route table cluster-7-node-routetable
I0617 23:30:12.659291       1 subnets.go:79] getting nsg cluster-7-node-nsg
I0617 23:30:13.008885       1 subnets.go:88] got nsg cluster-7-node-nsg
I0617 23:30:13.008944       1 subnets.go:91] creating subnet cluster-7-node-subnet in vnet cluster-7-vnet
I0617 23:30:13.732351       1 subnets.go:115] successfully created subnet cluster-7-node-subnet in vnet cluster-7-vnet
I0617 23:30:13.732433       1 internalloadbalancers.go:62] creating internal load balancer cluster-7-internal-lb
I0617 23:30:13.732471       1 internalloadbalancers.go:69] getting subnet cluster-7-controlplane-subnet
I0617 23:30:13.993738       1 internalloadbalancers.go:79] successfully got subnet cluster-7-controlplane-subnet
I0617 23:30:14.842700       1 internalloadbalancers.go:150] successfully created internal load balancer cluster-7-internal-lb
I0617 23:30:14.842847       1 publicips.go:58] creating public ip cluster-7-62f2d3a
I0617 23:30:15.411200       1 publicips.go:93] successfully created public ip cluster-7-62f2d3a
I0617 23:30:15.411305       1 publicloadbalancers.go:64] creating public load balancer cluster-7-public-lb
I0617 23:30:15.411335       1 publicloadbalancers.go:66] getting public ip cluster-7-62f2d3a
I0617 23:30:15.690319       1 publicloadbalancers.go:76] successfully got public ip cluster-7-62f2d3a
I0617 23:30:16.532441       1 publicloadbalancers.go:187] successfully created public load balancer cluster-7-public-lb
I0617 23:30:16.532510       1 reconciler.go:155] successfully reconciled cluster cluster-7
I0617 23:30:40.677935       1 machine_controller.go:113] Reconciling Machine "cluster-7-controlplane-0"
I0617 23:30:40.678032       1 actuator.go:153] Checking if machine cluster-7-controlplane-0 for cluster cluster-7 exists
I0617 23:30:41.541450       1 machine_controller.go:222] Reconciling machine object cluster-7-controlplane-0 triggers idempotent create.
I0617 23:30:41.541504       1 actuator.go:74] Creating machine cluster-7-controlplane-0 for cluster cluster-7
I0617 23:30:42.061369       1 reconciler.go:260] Machine cluster-7-controlplane-0 should join the controlplane: false
I0617 23:30:43.911954       1 networkinterfaces.go:152] successfully created network interface cluster-7-controlplane-0-nic

All the pods in the kind cluster are running:

kubectl get pods --all-namespaces -o wide
NAMESPACE               NAME                                               READY   STATUS    RESTARTS   AGE   IP           NODE                       NOMINATED NODE   READINESS GATES
azure-provider-system   azure-provider-controller-manager-0                1/1     Running   0          13m   10.244.0.4   clusterapi-control-plane   <none>           <none>
cluster-api-system      cluster-api-controller-manager-0                   1/1     Running   0          13m   10.244.0.3   clusterapi-control-plane   <none>           <none>
kube-system             coredns-fb8b8dccf-pxj9b                            1/1     Running   0          13m   10.244.0.5   clusterapi-control-plane   <none>           <none>
kube-system             coredns-fb8b8dccf-qfcdb                            1/1     Running   0          13m   10.244.0.2   clusterapi-control-plane   <none>           <none>
kube-system             etcd-clusterapi-control-plane                      1/1     Running   0          12m   172.17.0.2   clusterapi-control-plane   <none>           <none>
kube-system             ip-masq-agent-qh9cl                                1/1     Running   0          13m   172.17.0.2   clusterapi-control-plane   <none>           <none>
kube-system             kindnet-8zfb9                                      1/1     Running   1          13m   172.17.0.2   clusterapi-control-plane   <none>           <none>
kube-system             kube-apiserver-clusterapi-control-plane            1/1     Running   0          13m   172.17.0.2   clusterapi-control-plane   <none>           <none>
kube-system             kube-controller-manager-clusterapi-control-plane   1/1     Running   0          12m   172.17.0.2   clusterapi-control-plane   <none>           <none>
kube-system             kube-proxy-4zs5j                                   1/1     Running   0          13m   172.17.0.2   clusterapi-control-plane   <none>           <none>
kube-system             kube-scheduler-clusterapi-control-plane            1/1     Running   0          13m   172.17.0.2   clusterapi-control-plane   <none>           <none>

In the RG (about 15 minutes after the create-cluster operation started):
Screen Shot 2019-06-17 at 4 43 54 PM

@justaugustus @alexeldeib

@alexeldeib
Copy link
Contributor

Looking. ☹️ Sorry about that...

@alexeldeib
Copy link
Contributor

Can you share the logs from cluster-api-controller-manager-0? On initial investigation I'm not able to repro this.

image

@alexeldeib
Copy link
Contributor

Might've messed up my versions trying to repro, trying a few more times

@alexeldeib
Copy link
Contributor

I am seeing the VMs after the control plane take a while...

@alexeldeib
Copy link
Contributor

alexeldeib commented Jun 18, 2019

Hmm, maybe I can see the same behavior? Still debugging. It almost seems like machine controller isn't requeing the first request to create the machine

@alexeldeib
Copy link
Contributor

Oh. I think I see the issue. This is a tad embarrassing...dangers of trying to selectively add patches to my commits.

@justaugustus
Copy link
Contributor

What's the prognosis, doc?

@alexeldeib
Copy link
Contributor

alexeldeib commented Jun 18, 2019

Let me confirm and I'll PR the fix. I didn't commit the part which advances the iterator (==infinite looping once it tries to fetch the skus, machine reconciliation never finishes).

@justaugustus
Copy link
Contributor

Gotcha. Thanks for investigating!

@alexeldeib
Copy link
Contributor

Note to self to improve unit/e2e tests to avoid the same...

@CecileRobertMichon
Copy link
Contributor Author

thanks for investigating and fixing @alexeldeib!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants