You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. What kops version are you running? The command kops version, will display
this information.
❯ kops version
Version 1.12.2
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Note: This is a v1.11.10 -> v1.12.8 upgrade. This happened during the upgrade of the worker nodes.
5. What happened after the commands executed?
Several nodes rolled normally, however during the upgrade process the qual.io registry went offline (!!!) so several essential pods failed to start with ImagePullBackOff. Kops waited the appropriate 5m0s for the cluster to validate (which obviously it wasn't going to), but then proceeded anyway (!!!). Here's the output from the event:
I0717 14:51:25.179160 99040 instancegroups.go:299] Stopping instance "i-0101a420a112430de", node "ip-172-30-145-161.ec2.internal", in group "stateful-nodes-us-east-1c.kops-cluster.staging.k8s" (this may take a while).
I0717 14:51:25.722198 99040 instancegroups.go:198] waiting for 4m0s after terminating instance
I0717 14:55:25.730292 99040 instancegroups.go:209] Validating the cluster.
I0717 14:55:28.103043 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:55:59.193347 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:56:29.196086 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:56:59.047951 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:57:28.948411 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:57:58.961009 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:58:28.978373 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:58:59.003760 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:59:29.104439 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
I0717 14:59:59.203114 99040 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: node "ip-172-30-151-120.ec2.internal" is not ready.
E0717 15:00:28.113352 99040 instancegroups.go:214] Cluster did not validate within 5m0s
I0717 15:00:29.002986 99040 instancegroups.go:165] Draining the node: "ip-172-30-150-249.ec2.internal".
node/ip-172-30-150-249.ec2.internal cordoned
node/ip-172-30-150-249.ec2.internal cordoned
6. What did you expect to happen?
kops rolling-update should have stopped with an error when the cluster failed to validate.
7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
I would love to do this but I would no longer be starting from a stable baseline so the results would be fairly meaningless. Also, quay.io is still offline. ;-)
9. Anything else do we need to know?
This is pretty bad, I hope we can figure out what happened even without debugging!
The text was updated successfully, but these errors were encountered:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Even when there is an error, such as failure to validate the cluster, when doing a rolling update of a nodes instance group, kops will proceed to do a rolling update of the next instance group.
In pkg/instancegroups/rollingupdate.go there is the comment:
1. What
kops
version are you running? The commandkops version
, will displaythis information.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Note: This is a v1.11.10 -> v1.12.8 upgrade. This happened during the upgrade of the worker nodes.
5. What happened after the commands executed?
Several nodes rolled normally, however during the upgrade process the qual.io registry went offline (!!!) so several essential pods failed to start with
ImagePullBackOff
. Kops waited the appropriate 5m0s for the cluster to validate (which obviously it wasn't going to), but then proceeded anyway (!!!). Here's the output from the event:6. What did you expect to happen?
kops rolling-update should have stopped with an error when the cluster failed to validate.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
I would love to do this but I would no longer be starting from a stable baseline so the results would be fairly meaningless. Also, quay.io is still offline. ;-)
9. Anything else do we need to know?
This is pretty bad, I hope we can figure out what happened even without debugging!
The text was updated successfully, but these errors were encountered: