Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

jrnt30 · 2017-06-02T14:13:09Z

I wanted to provide an analysis on a variety of upgrade issues people have been seeing, which I believe are directly related to:

K8 Dropping Attributes: kubernetes/kubernetes#46073
DNS RS Issues: #2594
Weave issues: #2366

Process:

Currently with the KOPS 1.5 -> 1.6 rolling upgrade my understanding is that:

Kops identifies a master to take out, cordon and drains the node
Kops terminates the instance and the ASG creates a new one
As the new Master node on 1.6 bootstraps itself, it registers itself with the Master (still 1.5)
Kops determines there are several upgrades from the bootstrap manifest that needs to be updated (such as weave-net) and submits those to the cluster still a 1.5 master
After some time, health checks pass and it moves onto next node.
Kops identifies second master, cordons and drains new node as well
Kops terminates instance and ASG creates second node
Second "new" Master on 1.6 bootstraps itself and register with Master (most likely still 1.5 but not guaranteed)
Health checks pass and goes to third node
Kops cordons, drains & terminates third node
Third "new" master on 1.6 bootstraps itself and registers with Master (which is now 1.6)

Impact:

Since the first (and probably the second) 1.6 master node came up with a 1.5 master several things occur.
Any of the 1.6 specific configuration attributes (such as tolerations) are either directly ignored or stripped out when the 1.5 master reconciles the state of the cluster (this can be seen by looking at the last-applied-configuration as Justin noted and [we see as well](https://github.com/kubernetes/kops/issues/2605#issuecomment-305650544]
-- There are not taints on the 1st and 2nd Masters, but since the 3rd master comes up with a 1.6 cluster it does have them
-- Any Daemonset updates (such as weave-net) are missing their tolerations section
Since Masters 1 and 2 have no taints, Weave-Net gets scheduled. Since Master 3 does have the taints but the Weave-Net DaemonSet does not have the tolerations (they were stripped) the 3rd master will never have the DaemonSet pod placed on it.

Potential Solution/Hack

I can see this happening essentially every major change where features graduate into "real" Kubernetes supported attributes moving from annotations to attributes.

Due to the volume attachment for etcd, etc. I don't know that we could simply bring up a 4th node as the "new" master and force an election until that takes place, however as a hack it may be possible to:

Bring up the first new node.
Run the new bootstrap manifests
Wait for it to become healthy
Force an election to favor the new node (don't know the leadership election process well enough to know if it is easy to force this or just luck)
Taint and rerun the bootstrap manifests to force a re-evaluation with a new version to ensure that the attributes are respected and maintained
Continue the rolling-update for additional nodes

A more comprehensive solution which would be broader than Kops might be:

Have K8 always keep the equivalent of kubectl.kubernetes.io/last-applied-configuration for the resources
Have a validation loop that diffs the "initially applied" configuration to determine if there are structural changes it can now support, which it could not before (such as the taints on the Masters or the tolerations on the DaemonSets)

The text was updated successfully, but these errors were encountered:

rdtr · 2017-06-02T18:49:18Z

Possibly duplicate discussion?
kubernetes/kubernetes#46073

jrnt30 · 2017-06-02T19:29:47Z

I believe they are related, I mentioned that one at the top but wanted to find a way to condense several Kops related issues and outline my findings

pnegahdar · 2017-06-04T23:27:42Z

I also attempted to add a 1.6 IG to an 1.5 cluster.

The initialization task fails with:

I0603 23:20:50.690862       1 kube_boot.go:156] ensuring that kubelet systemd service is running
I0603 23:20:50.714157       1 channels.go:47] checking channel: "s3://store/addons/bootstrap-channel.yaml"
I0603 23:20:50.714221       1 channels.go:34] Running command: channels apply channel s3://store/addons/bootstrap-channel.yaml --v=4 --yes
I0603 23:20:50.783737       1 channels.go:37] error running channels apply channel s3://strore/addons/bootstrap-channel.yaml --v=4 --yes:
I0603 23:20:50.783759       1 channels.go:38] Error: error querying kubernetes version: Get https://127.0.0.1/version: dial tcp 127.0.0.1:443: getsockopt: connection refused
Usage:
  channels apply channel [flags]

Flags:
  -f, --filename stringSlice   Apply from a local file
      --yes                    Apply update

Global Flags:
      --alsologtostderr                  log to standard error as well as files
      --config string                    config file (default is $HOME/.channels.yaml)
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files (default false)
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs (default 0)
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging


error querying kubernetes version: Get https://127.0.0.1/version: dial tcp 127.0.0.1:443: getsockopt: connection refused
I0603 23:20:50.783777       1 channels.go:50] apply channel output was: Error: error querying kubernetes version: Get https://127.0.0.1/version: dial tcp 127.0.0.1:443: getsockopt: connection refused
Usage:

And the kubelet fails to launch with error:

Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Current Cluster version: 1.5.2 (masters + nodes)
New Instance group node version: 1.6.4
Networking: Calico

Aside: given the backwards compatibility of kube it seems like the safest way to upgrade a cluster (or at least what I'm attempting) is:

Add new upgraded IGs to cluster with similar configuration as current IGs
Cordon the old IGs
Wait to go through a bit of natural deployment cycles and things to get rescheduled to the new IGs
Drain the remaining pods on the old IGs
Upgrade the masters
Delete the old IGs

jrnt30 · 2017-09-08T22:08:20Z

Closing as others have discussed this and it's not particularly actionable

kaspernissen mentioned this issue Jul 14, 2017

Kops 1.5.3 to 1.6.2 and cluster upgrade problems - Master node does not become ready #2827

Closed

jrnt30 closed this as completed Sep 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

jrnt30 commented Jun 2, 2017

rdtr commented Jun 2, 2017

jrnt30 commented Jun 2, 2017

pnegahdar commented Jun 4, 2017 •

edited

Loading

jrnt30 commented Sep 8, 2017

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

Comments

jrnt30 commented Jun 2, 2017

Process:

Impact:

Potential Solution/Hack

rdtr commented Jun 2, 2017

jrnt30 commented Jun 2, 2017

pnegahdar commented Jun 4, 2017 • edited Loading

jrnt30 commented Sep 8, 2017

pnegahdar commented Jun 4, 2017 •

edited

Loading