Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

Closed
ashic opened this issue May 18, 2017 · 9 comments
Closed

dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

ashic opened this issue May 18, 2017 · 9 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ashic
Copy link

ashic commented May 18, 2017

I ran kops upgraded cluster with a new kops, and at first things didn't work as expected. I replaced the weave-net stuff, created the kube dns config map (based on other reported issues), and nearly everything works. kube ui is available, not all my apps are running. However, digging deeper, I see that the dns controller rs can't find nodes to run its single pod on. Running kops edit ig [ig-name-for-a-master] shows me this:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-04-20T13:18:41Z
  labels:
    kops.k8s.io/cluster: udpmarkit.com
  name: master-eu-west-1a
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-west-1a

Running describe for a master node shows me the following:

Labels:		beta.kubernetes.io/arch=amd64
			beta.kubernetes.io/instance-type=t2.medium
			beta.kubernetes.io/os=linux
			failure-domain.beta.kubernetes.io/region=eu-west-1
			failure-domain.beta.kubernetes.io/zone=eu-west-1a
			kubernetes.io/hostname=ip-10-4-144-212.eu-west-1.compute.internal
			kubernetes.io/role=master
			node-role.kubernetes.io/master=
Annotations:		node.alpha.kubernetes.io/ttl=0
			volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:			node-role.kubernetes.io/master:NoSchedule

the dns controller rs has the following description:

Name:		dns-controller-832541570
Namespace:	kube-system
Selector:	k8s-app=dns-controller,pod-template-hash=832541570
Labels:		k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=832541570
		version=v1.6.1
Annotations:	deployment.kubernetes.io/desired-replicas=1
		deployment.kubernetes.io/max-replicas=2
		deployment.kubernetes.io/revision=3
Replicas:	1 current / 1 desired
Pods Status:	0 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:		k8s-addon=dns-controller.addons.k8s.io
			k8s-app=dns-controller
			pod-template-hash=832541570
			version=v1.6.1
  Annotations:		scheduler.alpha.kubernetes.io/critical-pod=
			scheduler.alpha.kubernetes.io/tolerations=[{"key": "dedicated", "value": "master"}]
  Service Account:	dns-controller
  Containers:
   dns-controller:
    Image:	kope/dns-controller:1.6.1
    Port:	
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/Z1GY2OSJNZX79E
      --zone=*/*
      -v=2
    Requests:
      cpu:		50m
      memory:		50Mi
    Environment:	<none>
    Mounts:		<none>
  Volumes:		<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  45m		45m		1	replicaset-controller			Normal		SuccessfulCreate	Created pod: dns-controller-832541570-nwtp5

Name:		dns-controller-945720267
Namespace:	kube-system
Selector:	k8s-app=dns-controller,pod-template-hash=945720267
Labels:		k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945720267
		version=v1.5.2
Annotations:	deployment.kubernetes.io/desired-replicas=1
		deployment.kubernetes.io/max-replicas=2
		deployment.kubernetes.io/revision=1
Replicas:	0 current / 0 desired
Pods Status:	0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:	k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945720267
		version=v1.5.2
  Annotations:	scheduler.alpha.kubernetes.io/critical-pod=
		scheduler.alpha.kubernetes.io/tolerations=[{"key": "dedicated", "value": "master"}]
  Containers:
   dns-controller:
    Image:	kope/dns-controller:1.5.2
    Port:	
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/Z1GY2OSJNZX79E
      --zone=*/*
      -v=2
    Requests:
      cpu:		50m
      memory:		50Mi
    Environment:	<none>
    Mounts:		<none>
  Volumes:		<none>
Events:			<none>

Name:		dns-controller-945982411
Namespace:	kube-system
Selector:	k8s-app=dns-controller,pod-template-hash=945982411
Labels:		k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945982411
		version=v1.6.1
Annotations:	deployment.kubernetes.io/desired-replicas=1
		deployment.kubernetes.io/max-replicas=2
		deployment.kubernetes.io/revision=2
Replicas:	0 current / 0 desired
Pods Status:	0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:	k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945982411
		version=v1.6.1
  Annotations:	scheduler.alpha.kubernetes.io/critical-pod=
		scheduler.alpha.kubernetes.io/tolerations=[{"key": "dedicated", "value": "master"}]
  Containers:
   dns-controller:
    Image:	kope/dns-controller:1.6.1
    Port:	
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/Z1GY2OSJNZX79E
      --zone=*/*
      -v=2
    Requests:
      cpu:		50m
      memory:		50Mi
    Environment:	<none>
    Mounts:		<none>
  Volumes:		<none>
Events:			<none>

I tried to kubectl replace with the file https://raw.githubusercontent.com/kubernetes/kops/release-1.6/upup/models/cloudup/resources/addons/dns-controller.addons.k8s.io/k8s-1.6.yaml.template , but it's failing stating

error: error converting YAML to JSON: yaml: line 37: found unexpected ':'

@chrislovecnm
Copy link
Contributor

I think this is a duplicate

@ashic
Copy link
Author

ashic commented May 18, 2017

It starts off similar... two other tickets seem to be about creating the config map, and redeploying weave. I've done that, and the cluster validates. However, the dns-controller replica set can't deploy it's pod due to taints (I think). Kops get ig returns this by the way:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-04-20T13:18:42Z
  labels:
    kops.k8s.io/cluster: udpmarkit.com
  name: master-eu-west-1c
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-west-1c

Is there a taint I need to apply?

@ashic
Copy link
Author

ashic commented May 18, 2017

Thanks to @justinsb 's deep diagnostic session, we found a solution. It looks like the 1.6 toleration got written, but then got overwritten by something (possibly an HA master still on 1.5).

kubectl get deployment dns-controller -oyaml -n kube-system

will have a missing tolerance section in the spec section (not the 1.5 annotation - that'll still be there).

The solution was to run

kubectl edit namespace kube-system

and simply removing the line that has the dns-cotnroller annotation. Once that's done, it self heals, and dns-controller starts running as expected.

@ashic ashic closed this as completed May 18, 2017
@chrislovecnm
Copy link
Contributor

I am going to reopen this to see if we can recreate.

@chrislovecnm chrislovecnm reopened this May 18, 2017
@chrislovecnm
Copy link
Contributor

This issue is being tracked upstream kubernetes/kubernetes#46073

@bcorijn
Copy link
Contributor

bcorijn commented May 31, 2017

Just ran into this as well!

I rolled my public topology cluster from 1.5.4 to 1.6.4 on friday, and all went quite smoothly. Yesterday I noticed two of my three masters had regular pods scheduled on them, and didn't have their taints properly set up. Stopping them caused the ASG to kick in with a new instance, which seemed fine. This had as a cause however that DNS controller got into the state described above. Removing the dns-controller add-on annotation indeed caused it to self heal.

Edit: spoke too soon, something fishy is still going on. DNS-controller spun up a new replica set, but the dashboard does not have picked up on it, and only shows the old, empty RS as "New Replica Set" on the deployment page...

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants