dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

ashic · 2017-05-18T18:58:51Z

I ran kops upgraded cluster with a new kops, and at first things didn't work as expected. I replaced the weave-net stuff, created the kube dns config map (based on other reported issues), and nearly everything works. kube ui is available, not all my apps are running. However, digging deeper, I see that the dns controller rs can't find nodes to run its single pod on. Running kops edit ig [ig-name-for-a-master] shows me this:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-04-20T13:18:41Z
  labels:
    kops.k8s.io/cluster: udpmarkit.com
  name: master-eu-west-1a
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-west-1a

Running describe for a master node shows me the following:

Labels:		beta.kubernetes.io/arch=amd64
			beta.kubernetes.io/instance-type=t2.medium
			beta.kubernetes.io/os=linux
			failure-domain.beta.kubernetes.io/region=eu-west-1
			failure-domain.beta.kubernetes.io/zone=eu-west-1a
			kubernetes.io/hostname=ip-10-4-144-212.eu-west-1.compute.internal
			kubernetes.io/role=master
			node-role.kubernetes.io/master=
Annotations:		node.alpha.kubernetes.io/ttl=0
			volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:			node-role.kubernetes.io/master:NoSchedule

the dns controller rs has the following description:

Name:		dns-controller-832541570
Namespace:	kube-system
Selector:	k8s-app=dns-controller,pod-template-hash=832541570
Labels:		k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=832541570
		version=v1.6.1
Annotations:	deployment.kubernetes.io/desired-replicas=1
		deployment.kubernetes.io/max-replicas=2
		deployment.kubernetes.io/revision=3
Replicas:	1 current / 1 desired
Pods Status:	0 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:		k8s-addon=dns-controller.addons.k8s.io
			k8s-app=dns-controller
			pod-template-hash=832541570
			version=v1.6.1
  Annotations:		scheduler.alpha.kubernetes.io/critical-pod=
			scheduler.alpha.kubernetes.io/tolerations=[{"key": "dedicated", "value": "master"}]
  Service Account:	dns-controller
  Containers:
   dns-controller:
    Image:	kope/dns-controller:1.6.1
    Port:	
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/Z1GY2OSJNZX79E
      --zone=*/*
      -v=2
    Requests:
      cpu:		50m
      memory:		50Mi
    Environment:	<none>
    Mounts:		<none>
  Volumes:		<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  45m		45m		1	replicaset-controller			Normal		SuccessfulCreate	Created pod: dns-controller-832541570-nwtp5

Name:		dns-controller-945720267
Namespace:	kube-system
Selector:	k8s-app=dns-controller,pod-template-hash=945720267
Labels:		k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945720267
		version=v1.5.2
Annotations:	deployment.kubernetes.io/desired-replicas=1
		deployment.kubernetes.io/max-replicas=2
		deployment.kubernetes.io/revision=1
Replicas:	0 current / 0 desired
Pods Status:	0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:	k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945720267
		version=v1.5.2
  Annotations:	scheduler.alpha.kubernetes.io/critical-pod=
		scheduler.alpha.kubernetes.io/tolerations=[{"key": "dedicated", "value": "master"}]
  Containers:
   dns-controller:
    Image:	kope/dns-controller:1.5.2
    Port:	
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/Z1GY2OSJNZX79E
      --zone=*/*
      -v=2
    Requests:
      cpu:		50m
      memory:		50Mi
    Environment:	<none>
    Mounts:		<none>
  Volumes:		<none>
Events:			<none>

Name:		dns-controller-945982411
Namespace:	kube-system
Selector:	k8s-app=dns-controller,pod-template-hash=945982411
Labels:		k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945982411
		version=v1.6.1
Annotations:	deployment.kubernetes.io/desired-replicas=1
		deployment.kubernetes.io/max-replicas=2
		deployment.kubernetes.io/revision=2
Replicas:	0 current / 0 desired
Pods Status:	0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:	k8s-addon=dns-controller.addons.k8s.io
		k8s-app=dns-controller
		pod-template-hash=945982411
		version=v1.6.1
  Annotations:	scheduler.alpha.kubernetes.io/critical-pod=
		scheduler.alpha.kubernetes.io/tolerations=[{"key": "dedicated", "value": "master"}]
  Containers:
   dns-controller:
    Image:	kope/dns-controller:1.6.1
    Port:	
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/Z1GY2OSJNZX79E
      --zone=*/*
      -v=2
    Requests:
      cpu:		50m
      memory:		50Mi
    Environment:	<none>
    Mounts:		<none>
  Volumes:		<none>
Events:			<none>

I tried to kubectl replace with the file https://raw.githubusercontent.com/kubernetes/kops/release-1.6/upup/models/cloudup/resources/addons/dns-controller.addons.k8s.io/k8s-1.6.yaml.template , but it's failing stating

error: error converting YAML to JSON: yaml: line 37: found unexpected ':'

The text was updated successfully, but these errors were encountered:

chrislovecnm · 2017-05-18T21:36:17Z

I think this is a duplicate

ashic · 2017-05-18T21:47:52Z

It starts off similar... two other tickets seem to be about creating the config map, and redeploying weave. I've done that, and the cluster validates. However, the dns-controller replica set can't deploy it's pod due to taints (I think). Kops get ig returns this by the way:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-04-20T13:18:42Z
  labels:
    kops.k8s.io/cluster: udpmarkit.com
  name: master-eu-west-1c
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-west-1c

Is there a taint I need to apply?

ashic · 2017-05-18T22:50:13Z

Thanks to @justinsb 's deep diagnostic session, we found a solution. It looks like the 1.6 toleration got written, but then got overwritten by something (possibly an HA master still on 1.5).

kubectl get deployment dns-controller -oyaml -n kube-system

will have a missing tolerance section in the spec section (not the 1.5 annotation - that'll still be there).

The solution was to run

kubectl edit namespace kube-system

and simply removing the line that has the dns-cotnroller annotation. Once that's done, it self heals, and dns-controller starts running as expected.

chrislovecnm · 2017-05-18T23:06:02Z

I am going to reopen this to see if we can recreate.

chrislovecnm · 2017-05-25T18:10:10Z

This issue is being tracked upstream kubernetes/kubernetes#46073

bcorijn · 2017-05-31T15:30:27Z

Just ran into this as well!

I rolled my public topology cluster from 1.5.4 to 1.6.4 on friday, and all went quite smoothly. Yesterday I noticed two of my three masters had regular pods scheduled on them, and didn't have their taints properly set up. Stopping them caused the ASG to kick in with a new instance, which seemed fine. This had as a cause however that DNS controller got into the state described above. Removing the dns-controller add-on annotation indeed caused it to self heal.

Edit: spoke too soon, something fishy is still going on. DNS-controller spun up a new replica set, but the dashboard does not have picked up on it, and only shows the old, empty RS as "New Replica Set" on the deployment page...

fejta-bot · 2017-12-25T15:53:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-24T16:41:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-23T16:47:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

ashic closed this as completed May 18, 2017

chrislovecnm reopened this May 18, 2017

apenney mentioned this issue May 30, 2017

api.internal dns record not updated on rolling-update #2634

Closed

jrnt30 mentioned this issue Jun 2, 2017

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2018

k8s-ci-robot closed this as completed Feb 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

ashic commented May 18, 2017

chrislovecnm commented May 18, 2017

ashic commented May 18, 2017

ashic commented May 18, 2017 •

edited

Loading

chrislovecnm commented May 18, 2017

chrislovecnm commented May 25, 2017

bcorijn commented May 31, 2017 •

edited

Loading

fejta-bot commented Dec 25, 2017

fejta-bot commented Jan 24, 2018

fejta-bot commented Feb 23, 2018

dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

dns controller rs fails to run pod after upgrade from 1.5 to 1.6.2 #2594

Comments

ashic commented May 18, 2017

chrislovecnm commented May 18, 2017

ashic commented May 18, 2017

ashic commented May 18, 2017 • edited Loading

chrislovecnm commented May 18, 2017

chrislovecnm commented May 25, 2017

bcorijn commented May 31, 2017 • edited Loading

fejta-bot commented Dec 25, 2017

fejta-bot commented Jan 24, 2018

fejta-bot commented Feb 23, 2018

ashic commented May 18, 2017 •

edited

Loading

bcorijn commented May 31, 2017 •

edited

Loading