Rolling update continues even when cluster does not validate #7258

eherot · 2019-07-17T19:29:34Z

1. What kops version are you running? The command kops version, will display
this information.

❯ kops version
Version 1.12.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

❯ kubectl --context=staging version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-20T04:49:16Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

❯ kops rolling-update cluster kops-cluster.staging.k8s --yes -i

Note: This is a v1.11.10 -> v1.12.8 upgrade. This happened during the upgrade of the worker nodes.

5. What happened after the commands executed?

Several nodes rolled normally, however during the upgrade process the qual.io registry went offline (!!!) so several essential pods failed to start with ImagePullBackOff. Kops waited the appropriate 5m0s for the cluster to validate (which obviously it wasn't going to), but then proceeded anyway (!!!). Here's the output from the event:

I0717 14:51:25.179160   99040 instancegroups.go:299] Stopping instance "i-0101a420a112430de", node "ip-172-30-145-161.ec2.internal", in group "stateful-nodes-us-east-1c.kops-cluster.staging.k8s" (this may take a while).
I0717 14:51:25.722198   99040 instancegroups.go:198] waiting for 4m0s after terminating instance
I0717 14:55:25.730292   99040 instancegroups.go:209] Validating the cluster.
I0717 14:55:28.103043   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:55:59.193347   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:56:29.196086   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:56:59.047951   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:57:28.948411   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:57:58.961009   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:58:28.978373   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:58:59.003760   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:59:29.104439   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:59:59.203114   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
E0717 15:00:28.113352   99040 instancegroups.go:214] Cluster did not validate within 5m0s
I0717 15:00:29.002986   99040 instancegroups.go:165] Draining the node: "ip-172-30-150-249.ec2.internal".
node/ip-172-30-150-249.ec2.internal cordoned
node/ip-172-30-150-249.ec2.internal cordoned

6. What did you expect to happen?

kops rolling-update should have stopped with an error when the cluster failed to validate.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: kops-cluster.staging.k8s
spec:
  additionalPolicies:
    node: |
      [
        {
          "Action": [
            "sts:AssumeRole"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:iam::267230788984:role/secure-payment-serv",
            "arn:aws:iam::267230788984:role/ReadLambdaLogs"
          ]
        },
        {
          "Action": [
            "ec2:*Volume"
          ],
          "Effect": "Allow",
          "Resource": [
            "*"
          ]
        }

      ]
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://gb-staging-kops-state/kops-cluster.staging.k8s
  dnsZone: staging.k8s
  docker:
    logDriver: json-file
    logLevel: warn
    logOpt:
    - max-size=10m
    - max-file=5
    storage: overlay2
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  hooks:
  - before:
    - network-pre.target
    - kubelet.service
    manifest: |
      Type=oneshot
      ExecStart=/usr/sbin/modprobe br_netfilter
      [Unit]
      Wants=network-pre.target
      [Install]
      WantedBy=multi-user.target
    name: fix-dns.service
    roles:
    - Node
    - Master
  - before:
    - locksmithd.service
    manifest: |
      Type=oneshot
      ExecStart=/usr/bin/systemctl mask --now locksmithd.service
    name: disable-locksmithd.service
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.8
  masterInternalName: api.internal.kops-cluster.staging.k8s
  masterPublicName: api.kops-cluster.staging.k8s
  networkCIDR: 172.30.0.0/16
  networkID: vpc-21945c47
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.30.128.0/21
    name: us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 172.30.136.0/21
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 172.30.144.0/21
    name: us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 172.30.32.0/23
    name: utility-us-east-1a
    type: Utility
    zone: us-east-1a
  - cidr: 172.30.34.0/23
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  - cidr: 172.30.36.0/23
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Private
    masters: private
    nodes: private
  updatePolicy: external

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: master-us-east-1a
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - us-east-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: master-us-east-1b
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1b
  role: Master
  subnets:
  - us-east-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: master-us-east-1c
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1c
  role: Master
  subnets:
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-11-30T17:17:42Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: monitoring-nodes
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: c4.large
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: monitoring-nodes
    monitoring: enabled
    run_type: ephemeral
  role: Node
  subnets:
  - us-east-1c
  taints:
  - monitoring=enabled:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: nodes
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: c4.large
  maxSize: 6
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
    run_type: ephemeral
  role: Node
  subnets:
  - us-east-1a
  - us-east-1b
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-10T19:43:18Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: stateful-nodes-us-east-1a
spec:
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: stateful-nodes-us-east-1a
    run_type: stateful
  role: Node
  subnets:
  - us-east-1a
  taints:
  - stateful=true:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-10T19:43:51Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: stateful-nodes-us-east-1b
spec:
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: stateful-nodes-us-east-1b
    run_type: stateful
  role: Node
  subnets:
  - us-east-1b
  taints:
  - stateful=true:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-10T19:45:28Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: stateful-nodes-us-east-1c
spec:
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: stateful-nodes-us-east-1c
    run_type: stateful
  role: Node
  subnets:
  - us-east-1c
  taints:
  - stateful=true:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

I would love to do this but I would no longer be starting from a stable baseline so the results would be fairly meaningless. Also, quay.io is still offline. ;-)

9. Anything else do we need to know?

This is pretty bad, I hope we can figure out what happened even without debugging!

The text was updated successfully, but these errors were encountered:

fejta-bot · 2019-10-31T11:13:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

johngmyers · 2019-11-01T05:58:26Z

Even when there is an error, such as failure to validate the cluster, when doing a rolling update of a nodes instance group, kops will proceed to do a rolling update of the next instance group.

In pkg/instancegroups/rollingupdate.go there is the comment:

				// TODO: Bail on error?

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2019

johngmyers mentioned this issue Nov 4, 2019

Don't update first node in instancegroup if cluster fails validation #7872

Merged

k8s-ci-robot closed this as completed in #7872 Nov 8, 2019

johngmyers mentioned this issue Jan 9, 2020

REQUEST: New membership for johngmyers kubernetes/org#1527

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling update continues even when cluster does not validate #7258

Rolling update continues even when cluster does not validate #7258

eherot commented Jul 17, 2019

fejta-bot commented Oct 31, 2019

johngmyers commented Nov 1, 2019 •

edited

Loading

Rolling update continues even when cluster does not validate #7258

Rolling update continues even when cluster does not validate #7258

Comments

eherot commented Jul 17, 2019

fejta-bot commented Oct 31, 2019

johngmyers commented Nov 1, 2019 • edited Loading

johngmyers commented Nov 1, 2019 •

edited

Loading