Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

Closed
jrnt30 opened this issue Jun 2, 2017 · 4 comments
Closed

Rolling Upgrade 1.5 -> 1.6 Issues & Evaluation #2674

jrnt30 opened this issue Jun 2, 2017 · 4 comments

Comments

@jrnt30
Copy link
Contributor

jrnt30 commented Jun 2, 2017

I wanted to provide an analysis on a variety of upgrade issues people have been seeing, which I believe are directly related to:

K8 Dropping Attributes: kubernetes/kubernetes#46073
DNS RS Issues: #2594
Weave issues: #2366

Process:

Currently with the KOPS 1.5 -> 1.6 rolling upgrade my understanding is that:

  • Kops identifies a master to take out, cordon and drains the node
  • Kops terminates the instance and the ASG creates a new one
  • As the new Master node on 1.6 bootstraps itself, it registers itself with the Master (still 1.5)
  • Kops determines there are several upgrades from the bootstrap manifest that needs to be updated (such as weave-net) and submits those to the cluster still a 1.5 master
  • After some time, health checks pass and it moves onto next node.
  • Kops identifies second master, cordons and drains new node as well
  • Kops terminates instance and ASG creates second node
  • Second "new" Master on 1.6 bootstraps itself and register with Master (most likely still 1.5 but not guaranteed)
  • Health checks pass and goes to third node
  • Kops cordons, drains & terminates third node
  • Third "new" master on 1.6 bootstraps itself and registers with Master (which is now 1.6)

Impact:

  • Since the first (and probably the second) 1.6 master node came up with a 1.5 master several things occur.
  • Any of the 1.6 specific configuration attributes (such as tolerations) are either directly ignored or stripped out when the 1.5 master reconciles the state of the cluster (this can be seen by looking at the last-applied-configuration as Justin noted and [we see as well](https://github.com/kubernetes/kops/issues/2605#issuecomment-305650544]
    -- There are not taints on the 1st and 2nd Masters, but since the 3rd master comes up with a 1.6 cluster it does have them
    -- Any Daemonset updates (such as weave-net) are missing their tolerations section
  • Since Masters 1 and 2 have no taints, Weave-Net gets scheduled. Since Master 3 does have the taints but the Weave-Net DaemonSet does not have the tolerations (they were stripped) the 3rd master will never have the DaemonSet pod placed on it.

Potential Solution/Hack

I can see this happening essentially every major change where features graduate into "real" Kubernetes supported attributes moving from annotations to attributes.

Due to the volume attachment for etcd, etc. I don't know that we could simply bring up a 4th node as the "new" master and force an election until that takes place, however as a hack it may be possible to:

  • Bring up the first new node.
  • Run the new bootstrap manifests
  • Wait for it to become healthy
  • Force an election to favor the new node (don't know the leadership election process well enough to know if it is easy to force this or just luck)
  • Taint and rerun the bootstrap manifests to force a re-evaluation with a new version to ensure that the attributes are respected and maintained
  • Continue the rolling-update for additional nodes

A more comprehensive solution which would be broader than Kops might be:

  • Have K8 always keep the equivalent of kubectl.kubernetes.io/last-applied-configuration for the resources
  • Have a validation loop that diffs the "initially applied" configuration to determine if there are structural changes it can now support, which it could not before (such as the taints on the Masters or the tolerations on the DaemonSets)
@rdtr
Copy link
Contributor

rdtr commented Jun 2, 2017

Possibly duplicate discussion?
kubernetes/kubernetes#46073

@jrnt30
Copy link
Contributor Author

jrnt30 commented Jun 2, 2017

I believe they are related, I mentioned that one at the top but wanted to find a way to condense several Kops related issues and outline my findings

@pnegahdar
Copy link

pnegahdar commented Jun 4, 2017

I also attempted to add a 1.6 IG to an 1.5 cluster.

The initialization task fails with:

I0603 23:20:50.690862       1 kube_boot.go:156] ensuring that kubelet systemd service is running
I0603 23:20:50.714157       1 channels.go:47] checking channel: "s3://store/addons/bootstrap-channel.yaml"
I0603 23:20:50.714221       1 channels.go:34] Running command: channels apply channel s3://store/addons/bootstrap-channel.yaml --v=4 --yes
I0603 23:20:50.783737       1 channels.go:37] error running channels apply channel s3://strore/addons/bootstrap-channel.yaml --v=4 --yes:
I0603 23:20:50.783759       1 channels.go:38] Error: error querying kubernetes version: Get https://127.0.0.1/version: dial tcp 127.0.0.1:443: getsockopt: connection refused
Usage:
  channels apply channel [flags]

Flags:
  -f, --filename stringSlice   Apply from a local file
      --yes                    Apply update

Global Flags:
      --alsologtostderr                  log to standard error as well as files
      --config string                    config file (default is $HOME/.channels.yaml)
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files (default false)
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs (default 0)
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging


error querying kubernetes version: Get https://127.0.0.1/version: dial tcp 127.0.0.1:443: getsockopt: connection refused
I0603 23:20:50.783777       1 channels.go:50] apply channel output was: Error: error querying kubernetes version: Get https://127.0.0.1/version: dial tcp 127.0.0.1:443: getsockopt: connection refused
Usage:

And the kubelet fails to launch with error:

Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Current Cluster version: 1.5.2 (masters + nodes)
New Instance group node version: 1.6.4
Networking: Calico

Aside: given the backwards compatibility of kube it seems like the safest way to upgrade a cluster (or at least what I'm attempting) is:

  • Add new upgraded IGs to cluster with similar configuration as current IGs
  • Cordon the old IGs
  • Wait to go through a bit of natural deployment cycles and things to get rescheduled to the new IGs
  • Drain the remaining pods on the old IGs
  • Upgrade the masters
  • Delete the old IGs

@jrnt30
Copy link
Contributor Author

jrnt30 commented Sep 8, 2017

Closing as others have discussed this and it's not particularly actionable

@jrnt30 jrnt30 closed this as completed Sep 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants