Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update continues even when cluster does not validate #7258

Closed
eherot opened this issue Jul 17, 2019 · 2 comments · Fixed by #7872
Closed

Rolling update continues even when cluster does not validate #7258

eherot opened this issue Jul 17, 2019 · 2 comments · Fixed by #7872
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@eherot
Copy link
Contributor

eherot commented Jul 17, 2019

1. What kops version are you running? The command kops version, will display
this information.

❯ kops version
Version 1.12.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

❯ kubectl --context=staging version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-20T04:49:16Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

❯ kops rolling-update cluster kops-cluster.staging.k8s --yes -i

Note: This is a v1.11.10 -> v1.12.8 upgrade. This happened during the upgrade of the worker nodes.

5. What happened after the commands executed?

Several nodes rolled normally, however during the upgrade process the qual.io registry went offline (!!!) so several essential pods failed to start with ImagePullBackOff. Kops waited the appropriate 5m0s for the cluster to validate (which obviously it wasn't going to), but then proceeded anyway (!!!). Here's the output from the event:

I0717 14:51:25.179160   99040 instancegroups.go:299] Stopping instance "i-0101a420a112430de", node "ip-172-30-145-161.ec2.internal", in group "stateful-nodes-us-east-1c.kops-cluster.staging.k8s" (this may take a while).
I0717 14:51:25.722198   99040 instancegroups.go:198] waiting for 4m0s after terminating instance
I0717 14:55:25.730292   99040 instancegroups.go:209] Validating the cluster.
I0717 14:55:28.103043   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:55:59.193347   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:56:29.196086   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:56:59.047951   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:57:28.948411   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:57:58.961009   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:58:28.978373   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:58:59.003760   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:59:29.104439   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:59:59.203114   99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
E0717 15:00:28.113352   99040 instancegroups.go:214] Cluster did not validate within 5m0s
I0717 15:00:29.002986   99040 instancegroups.go:165] Draining the node: "ip-172-30-150-249.ec2.internal".
node/ip-172-30-150-249.ec2.internal cordoned
node/ip-172-30-150-249.ec2.internal cordoned

6. What did you expect to happen?

kops rolling-update should have stopped with an error when the cluster failed to validate.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: kops-cluster.staging.k8s
spec:
  additionalPolicies:
    node: |
      [
        {
          "Action": [
            "sts:AssumeRole"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:iam::267230788984:role/secure-payment-serv",
            "arn:aws:iam::267230788984:role/ReadLambdaLogs"
          ]
        },
        {
          "Action": [
            "ec2:*Volume"
          ],
          "Effect": "Allow",
          "Resource": [
            "*"
          ]
        }

      ]
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://gb-staging-kops-state/kops-cluster.staging.k8s
  dnsZone: staging.k8s
  docker:
    logDriver: json-file
    logLevel: warn
    logOpt:
    - max-size=10m
    - max-file=5
    storage: overlay2
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  hooks:
  - before:
    - network-pre.target
    - kubelet.service
    manifest: |
      Type=oneshot
      ExecStart=/usr/sbin/modprobe br_netfilter
      [Unit]
      Wants=network-pre.target
      [Install]
      WantedBy=multi-user.target
    name: fix-dns.service
    roles:
    - Node
    - Master
  - before:
    - locksmithd.service
    manifest: |
      Type=oneshot
      ExecStart=/usr/bin/systemctl mask --now locksmithd.service
    name: disable-locksmithd.service
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.8
  masterInternalName: api.internal.kops-cluster.staging.k8s
  masterPublicName: api.kops-cluster.staging.k8s
  networkCIDR: 172.30.0.0/16
  networkID: vpc-21945c47
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.30.128.0/21
    name: us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 172.30.136.0/21
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 172.30.144.0/21
    name: us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 172.30.32.0/23
    name: utility-us-east-1a
    type: Utility
    zone: us-east-1a
  - cidr: 172.30.34.0/23
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  - cidr: 172.30.36.0/23
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Private
    masters: private
    nodes: private
  updatePolicy: external

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: master-us-east-1a
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - us-east-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: master-us-east-1b
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1b
  role: Master
  subnets:
  - us-east-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: master-us-east-1c
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1c
  role: Master
  subnets:
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-11-30T17:17:42Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: monitoring-nodes
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: c4.large
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: monitoring-nodes
    monitoring: enabled
    run_type: ephemeral
  role: Node
  subnets:
  - us-east-1c
  taints:
  - monitoring=enabled:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T23:41:24Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: nodes
spec:
  associatePublicIp: false
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: c4.large
  maxSize: 6
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
    run_type: ephemeral
  role: Node
  subnets:
  - us-east-1a
  - us-east-1b
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-10T19:43:18Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: stateful-nodes-us-east-1a
spec:
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: stateful-nodes-us-east-1a
    run_type: stateful
  role: Node
  subnets:
  - us-east-1a
  taints:
  - stateful=true:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-10T19:43:51Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: stateful-nodes-us-east-1b
spec:
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: stateful-nodes-us-east-1b
    run_type: stateful
  role: Node
  subnets:
  - us-east-1b
  taints:
  - stateful=true:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-01-10T19:45:28Z
  labels:
    kops.k8s.io/cluster: kops-cluster.staging.k8s
  name: stateful-nodes-us-east-1c
spec:
  image: aws-marketplace/CoreOS-stable-2135.5.0-hvm-0d1e0bd0-eaea-4397-9a3a-c56f861d2a14-ami-02b51824b39a1d52a.4
  machineType: m3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: stateful-nodes-us-east-1c
    run_type: stateful
  role: Node
  subnets:
  - us-east-1c
  taints:
  - stateful=true:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

I would love to do this but I would no longer be starting from a stable baseline so the results would be fairly meaningless. Also, quay.io is still offline. ;-)

9. Anything else do we need to know?

This is pretty bad, I hope we can figure out what happened even without debugging!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2019
@johngmyers
Copy link
Member

johngmyers commented Nov 1, 2019

Even when there is an error, such as failure to validate the cluster, when doing a rolling update of a nodes instance group, kops will proceed to do a rolling update of the next instance group.

In pkg/instancegroups/rollingupdate.go there is the comment:

				// TODO: Bail on error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@eherot @johngmyers @k8s-ci-robot @fejta-bot and others