Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Update 4.11 -> 4.12] Master node becomes unreachable and loses network connectivity #1657

Closed
elluvium opened this issue Jul 14, 2023 · 5 comments

Comments

@elluvium
Copy link

Describe the bug
Upgrading OKD from 4.11.0-0.okd-2022-12-02-14564 to 4.12.0-0.okd-2023-03-18-084815 makes the first master node unreachable and the overall updating process fails making the cluster not reachable at all.
Logs on the master node show that NetworkManager is unable to create all required interfaces and the node loses internet connection therefore kubelet fails to start.

Jul 14 10:30:14 localhost.localdomain kubenswrapper[3874]: failed to run Kubelet: could not init cloud provider "aws": unable to determine AWS zone from cloud provider config or EC2 instance metadata: RequestError: send request failed
Jul 14 10:30:14 localhost.localdomain kubenswrapper[3874]: caused by: Get "http://169.254.169.254/latest/meta-data/placement/availability-zone": dial tcp 169.254.169.254:80: connect: network is unreachable
Jul 14 10:30:14 localhost.localdomain kubenswrapper[3874]: >
Jul 14 10:30:14 localhost.localdomain systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 10:30:14 localhost.localdomain systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 14 10:30:14 localhost.localdomain systemd[1]: Failed to start kubelet.service - Kubernetes Kubelet.

NetworkManager:

Jul 14 10:09:36 localhost.localdomain NetworkManager[964]: [1689329376.6711] settings: could not load plugin 'ifcfg-rh' from file '/usr/lib64/NetworkManager/1.40.10-1.fc37/libnm-settings-plugin-ifcfg-rh.so': No such file or directory
Jul 14 10:09:36 localhost.localdomain NetworkManager[964]: [1689329376.7220] device (ens5): device ens5 could not be added to a ovs port: disconnected from ovsdb
Jul 14 10:09:36 localhost.localdomain NetworkManager[964]: [1689329376.7240] device (ens5): Activation: failed for connection 'ovs-if-phys0'

OVSDB:

Jul 14 10:09:36 localhost.localdomain systemd[1]: Starting ovsdb-server.service - Open vSwitch Database Unit...
Jul 14 10:09:36 localhost.localdomain chown[1113]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 14 10:09:36 localhost.localdomain sh[1121]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 14 10:09:36 localhost.localdomain ovs-ctl[1173]: id: 'openvswitch': no such user
Jul 14 10:09:36 localhost.localdomain ovs-ctl[1180]: setpriv: failed to parse reuid: ''
Jul 14 10:09:36 localhost.localdomain ovs-ctl[1196]: install: invalid user 'openvswitch'
Jul 14 10:09:36 localhost.localdomain ovsdb-server[1204]: ovs|00001|daemon_unix|EMER|(null): user openvswitch not found, aborting.

Version
IPI on AWS, 4.12.0-0.okd-2023-03-18-084815, OVNKubernetes.

How reproducible
100% so far with OVNKubernetes CNI. With OpenShiftSDN update process happens successfully.

Log bundle
https://drive.google.com/file/d/1cxP6CITJdMfixgDx6NirPxCK2ZSnHnb7/view?usp=sharing

@vrutkovs vrutkovs reopened this Jul 15, 2023
@elluvium
Copy link
Author

elluvium commented Jul 21, 2023

Today I've tried to take another upgrade path starting from 4.11.0-0.okd-2022-08-20-022919 -> 4.11.0-0.okd-2022-10-28-153352 -> 4.11.0-0.okd-2022-12-02-145640 -> 4.11.0-0.okd-2023-01-14-152430 -> 4.12.0-0.okd-2023-04-16-041331

And during the last part (4.11.0-0.okd-2023-01-14-152430 -> 4.12.0-0.okd-2023-04-16-041331) got absolutely the same behavior with a failing master node losing network connectivity. Additionally, I have tried to replace that unhealthy etcd member (master node) with a new one but it had no effect at all, simply next master node started failing. Attaching another must-gather log (which stuck on etcd):

oc adm must-gather
[must-gather ] OUT Using must-gather plug-in image: quay.io/openshift/okd-content@sha256:5b649183c0c550cdfd9f164a70c46f1e23b9e5a7e5af05fc6836bdd5280fbd79
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 4e55b4d8-6222-4d3e-bf34-f79f6118350b
ClusterVersion: Updating to "4.12.0-0.okd-2023-04-16-041331" from "4.11.0-0.okd-2023-01-14-152430" for 2 hours: Working towards 4.12.0-0.okd-2023-04-16-041331: 105 of 837 done (12% complete), waiting up to 40 minutes on etcd, kube-apiserver
ClusterOperators:
clusteroperator/authentication is progressing: APIServerDeploymentProgressing: deployment/apiserver.openshift-oauth-apiserver: 1/3 pods have been updated to the latest generation
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
clusteroperator/etcd is degraded because ClusterMemberControllerDegraded: unhealthy members found during reconciling members
EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-203-237.eu-central-1.compute.internal is unhealthy
NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/image-registry is progressing: Progressing: The registry is ready
NodeCADaemonProgressing: The daemon set node-ca is deploying node pods
clusteroperator/kube-apiserver is degraded because NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/kube-controller-manager is degraded because NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/kube-scheduler is degraded because NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/machine-config is degraded because Unable to apply 4.12.0-0.okd-2023-04-16-041331: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, pool master has not progressed to latest configuration: controller version mismatch for rendered-master-0f5921caf80e630a51bf91367ba3e4a0 expected 87fedee690ae487f8ae044ac416000172c9576a5 has b36482885ba1304e122e7c01c26cd671dfdd0418: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-8aa689552aaec6284345f6657be77c5e, retrying]
clusteroperator/network is progressing: DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" update is rolling out (2 out of 3 updated)
DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
DaemonSet "/openshift-multus/multus-additional-cni-plugins" is not available (awaiting 1 nodes)
clusteroperator/openshift-apiserver is progressing: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation
clusteroperator/storage is progressing: AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
[must-gather ] OUT namespace/openshift-must-gather-tr7nz created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-fqphk created
W0721 19:05:41.687865 50018 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gather", "copy" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gather", "copy" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gather", "copy" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gather", "copy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
[must-gather ] OUT pod for plug-in image quay.io/openshift/okd-content@sha256:5b649183c0c550cdfd9f164a70c46f1e23b9e5a7e5af05fc6836bdd5280fbd79 created
[must-gather-mwcgz] POD 2023-07-21T16:05:43.935601332Z Gathering data for ns/openshift-cluster-version...
[must-gather-mwcgz] POD 2023-07-21T16:05:44.579581563Z Gathering data for ns/default...
[must-gather-mwcgz] POD 2023-07-21T16:05:45.379331101Z Gathering data for ns/openshift...
[must-gather-mwcgz] POD 2023-07-21T16:05:45.739692498Z Gathering data for ns/kube-system...
[must-gather-mwcgz] POD 2023-07-21T16:05:46.226478513Z Gathering data for ns/openshift-etcd...
\

From Amazon EC2 instance status it's clear that the node became unreachable
image

It looks like I'm missing something important here but cannot figure out what exactly.

UPD: worker nodes have updated succesfully

ip-10-0-175-8.eu-central-1.compute.internal Ready master 6h54m v1.24.6+5658434
ip-10-0-203-237.eu-central-1.compute.internal NotReady,SchedulingDisabled master 6h54m v1.24.6+5658434
ip-10-0-237-138.eu-central-1.compute.internal Ready worker 6h43m v1.25.8+27e744f
ip-10-0-241-149.eu-central-1.compute.internal Ready worker 6h47m v1.25.8+27e744f
ip-10-0-245-72.eu-central-1.compute.internal Ready master 33m v1.24.6+5658434
ip-10-0-254-52.eu-central-1.compute.internal Ready,SchedulingDisabled worker 6h43m v1.25.8+27e744f

@elluvium
Copy link
Author

elluvium commented Sep 6, 2023

As per further investigation, I have discovered the following things:

  1. This behavior does not correlate with the type of network plugin (reproducible on both).
  2. It became quite clear that it correlated with the initial AMI image used.
    • When upgrade from 4.11 to 4.12 starts from freshly installed 4.11.0-0.okd-2022-12-02-145640 cluster to any 4.12 version, it finishes without any issues.
    • When an upgrade starts from an old 4.11.0-0.okd-2022-08-20-022919 version. All the long chain to the latest 4.11 updates successfully but at the same time update to any 4.12 version fails with the issue described above.
  3. Only master nodes are affected.
  4. A possible root cause is the failing sdn pod on that node with OpenShiftSDN plugin (in this case node reports it's status as available).
  5. With OVNKubernetes pod lost it's connection at all (as described in my first message it's status not ready).
  6. Update from 4.11.0-0.okd-2022-08-20-022919 to 4.12 not possible at the moment.

As stated here:

Q: I upgraded OpenShift and noticed that my AMI hasn't changed, is this normal?
Yes, see: openshift/enhancements#201 (As well as the rest of this document - we do in-place updates without changing the bootimage).

@vrutkovs could it be an incompatibility between 4.11.0-0.okd-2022-08-20-022919 AMI OS bootimage and 4.12 release-image?

@vrutkovs
Copy link
Member

vrutkovs commented Sep 7, 2023

It certainly could, but during upgrade we also update the OS (after network plugin however).

When an upgrade starts from an old 4.11.0-0.okd-2022-08-20-022919 version

Well, its possible its an issue, as this path has not been tested. 4.11.0-0.okd-2022-12-02-145640 is definitely upgradable to 4.12 (see CI test results in "Upgrades to")

@elluvium
Copy link
Author

elluvium commented Sep 7, 2023

Thanks for the confirmation. We'll continue the investigation since we have to update our old cluster anyway. One of the ideas is to replace all master nodes with the AMI from 4.11.0-0.okd-2022-12-02-145640 version before the update.
If you suddenly have any other advice on some workaround with this update, we would be really grateful

@elluvium
Copy link
Author

So, manual AMI replacement on master nodes helped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants