[Update 4.11 -> 4.12] Master node becomes unreachable and loses network connectivity #1657

elluvium · 2023-07-14T13:04:38Z

Describe the bug
Upgrading OKD from 4.11.0-0.okd-2022-12-02-14564 to 4.12.0-0.okd-2023-03-18-084815 makes the first master node unreachable and the overall updating process fails making the cluster not reachable at all.
Logs on the master node show that NetworkManager is unable to create all required interfaces and the node loses internet connection therefore kubelet fails to start.

Jul 14 10:30:14 localhost.localdomain kubenswrapper[3874]: failed to run Kubelet: could not init cloud provider "aws": unable to determine AWS zone from cloud provider config or EC2 instance metadata: RequestError: send request failed
Jul 14 10:30:14 localhost.localdomain kubenswrapper[3874]: caused by: Get "http://169.254.169.254/latest/meta-data/placement/availability-zone": dial tcp 169.254.169.254:80: connect: network is unreachable
Jul 14 10:30:14 localhost.localdomain kubenswrapper[3874]: >
Jul 14 10:30:14 localhost.localdomain systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 10:30:14 localhost.localdomain systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 14 10:30:14 localhost.localdomain systemd[1]: Failed to start kubelet.service - Kubernetes Kubelet.

NetworkManager:

Jul 14 10:09:36 localhost.localdomain NetworkManager[964]: [1689329376.6711] settings: could not load plugin 'ifcfg-rh' from file '/usr/lib64/NetworkManager/1.40.10-1.fc37/libnm-settings-plugin-ifcfg-rh.so': No such file or directory
Jul 14 10:09:36 localhost.localdomain NetworkManager[964]: [1689329376.7220] device (ens5): device ens5 could not be added to a ovs port: disconnected from ovsdb
Jul 14 10:09:36 localhost.localdomain NetworkManager[964]: [1689329376.7240] device (ens5): Activation: failed for connection 'ovs-if-phys0'

OVSDB:

Jul 14 10:09:36 localhost.localdomain systemd[1]: Starting ovsdb-server.service - Open vSwitch Database Unit...
Jul 14 10:09:36 localhost.localdomain chown[1113]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 14 10:09:36 localhost.localdomain sh[1121]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 14 10:09:36 localhost.localdomain ovs-ctl[1173]: id: 'openvswitch': no such user
Jul 14 10:09:36 localhost.localdomain ovs-ctl[1180]: setpriv: failed to parse reuid: ''
Jul 14 10:09:36 localhost.localdomain ovs-ctl[1196]: install: invalid user 'openvswitch'
Jul 14 10:09:36 localhost.localdomain ovsdb-server[1204]: ovs|00001|daemon_unix|EMER|(null): user openvswitch not found, aborting.

Version
IPI on AWS, 4.12.0-0.okd-2023-03-18-084815, OVNKubernetes.

How reproducible
100% so far with OVNKubernetes CNI. With OpenShiftSDN update process happens successfully.

Log bundle
https://drive.google.com/file/d/1cxP6CITJdMfixgDx6NirPxCK2ZSnHnb7/view?usp=sharing

The text was updated successfully, but these errors were encountered:

elluvium · 2023-07-21T16:19:13Z

Today I've tried to take another upgrade path starting from 4.11.0-0.okd-2022-08-20-022919 -> 4.11.0-0.okd-2022-10-28-153352 -> 4.11.0-0.okd-2022-12-02-145640 -> 4.11.0-0.okd-2023-01-14-152430 -> 4.12.0-0.okd-2023-04-16-041331

And during the last part (4.11.0-0.okd-2023-01-14-152430 -> 4.12.0-0.okd-2023-04-16-041331) got absolutely the same behavior with a failing master node losing network connectivity. Additionally, I have tried to replace that unhealthy etcd member (master node) with a new one but it had no effect at all, simply next master node started failing. Attaching another must-gather log (which stuck on etcd):

oc adm must-gather
[must-gather ] OUT Using must-gather plug-in image: quay.io/openshift/okd-content@sha256:5b649183c0c550cdfd9f164a70c46f1e23b9e5a7e5af05fc6836bdd5280fbd79
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 4e55b4d8-6222-4d3e-bf34-f79f6118350b
ClusterVersion: Updating to "4.12.0-0.okd-2023-04-16-041331" from "4.11.0-0.okd-2023-01-14-152430" for 2 hours: Working towards 4.12.0-0.okd-2023-04-16-041331: 105 of 837 done (12% complete), waiting up to 40 minutes on etcd, kube-apiserver
ClusterOperators:
clusteroperator/authentication is progressing: APIServerDeploymentProgressing: deployment/apiserver.openshift-oauth-apiserver: 1/3 pods have been updated to the latest generation
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
clusteroperator/etcd is degraded because ClusterMemberControllerDegraded: unhealthy members found during reconciling members
EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-203-237.eu-central-1.compute.internal is unhealthy
NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/image-registry is progressing: Progressing: The registry is ready
NodeCADaemonProgressing: The daemon set node-ca is deploying node pods
clusteroperator/kube-apiserver is degraded because NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/kube-controller-manager is degraded because NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/kube-scheduler is degraded because NodeControllerDegraded: The master nodes not ready: node "ip-10-0-203-237.eu-central-1.compute.internal" not ready since 2023-07-21 15:57:30 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
clusteroperator/machine-config is degraded because Unable to apply 4.12.0-0.okd-2023-04-16-041331: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, pool master has not progressed to latest configuration: controller version mismatch for rendered-master-0f5921caf80e630a51bf91367ba3e4a0 expected 87fedee690ae487f8ae044ac416000172c9576a5 has b36482885ba1304e122e7c01c26cd671dfdd0418: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-8aa689552aaec6284345f6657be77c5e, retrying]
clusteroperator/network is progressing: DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" update is rolling out (2 out of 3 updated)
DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
DaemonSet "/openshift-multus/multus-additional-cni-plugins" is not available (awaiting 1 nodes)
clusteroperator/openshift-apiserver is progressing: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation
clusteroperator/storage is progressing: AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
[must-gather ] OUT namespace/openshift-must-gather-tr7nz created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-fqphk created
W0721 19:05:41.687865 50018 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gather", "copy" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gather", "copy" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gather", "copy" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gather", "copy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
[must-gather ] OUT pod for plug-in image quay.io/openshift/okd-content@sha256:5b649183c0c550cdfd9f164a70c46f1e23b9e5a7e5af05fc6836bdd5280fbd79 created
[must-gather-mwcgz] POD 2023-07-21T16:05:43.935601332Z Gathering data for ns/openshift-cluster-version...
[must-gather-mwcgz] POD 2023-07-21T16:05:44.579581563Z Gathering data for ns/default...
[must-gather-mwcgz] POD 2023-07-21T16:05:45.379331101Z Gathering data for ns/openshift...
[must-gather-mwcgz] POD 2023-07-21T16:05:45.739692498Z Gathering data for ns/kube-system...
[must-gather-mwcgz] POD 2023-07-21T16:05:46.226478513Z Gathering data for ns/openshift-etcd...
\

From Amazon EC2 instance status it's clear that the node became unreachable

It looks like I'm missing something important here but cannot figure out what exactly.

UPD: worker nodes have updated succesfully

ip-10-0-175-8.eu-central-1.compute.internal Ready master 6h54m v1.24.6+5658434
ip-10-0-203-237.eu-central-1.compute.internal NotReady,SchedulingDisabled master 6h54m v1.24.6+5658434
ip-10-0-237-138.eu-central-1.compute.internal Ready worker 6h43m v1.25.8+27e744f
ip-10-0-241-149.eu-central-1.compute.internal Ready worker 6h47m v1.25.8+27e744f
ip-10-0-245-72.eu-central-1.compute.internal Ready master 33m v1.24.6+5658434
ip-10-0-254-52.eu-central-1.compute.internal Ready,SchedulingDisabled worker 6h43m v1.25.8+27e744f

elluvium · 2023-09-06T14:41:01Z

As per further investigation, I have discovered the following things:

This behavior does not correlate with the type of network plugin (reproducible on both).
It became quite clear that it correlated with the initial AMI image used.
- When upgrade from 4.11 to 4.12 starts from freshly installed 4.11.0-0.okd-2022-12-02-145640 cluster to any 4.12 version, it finishes without any issues.
- When an upgrade starts from an old 4.11.0-0.okd-2022-08-20-022919 version. All the long chain to the latest 4.11 updates successfully but at the same time update to any 4.12 version fails with the issue described above.
Only master nodes are affected.
A possible root cause is the failing sdn pod on that node with OpenShiftSDN plugin (in this case node reports it's status as available).
With OVNKubernetes pod lost it's connection at all (as described in my first message it's status not ready).
Update from 4.11.0-0.okd-2022-08-20-022919 to 4.12 not possible at the moment.

As stated here:

Q: I upgraded OpenShift and noticed that my AMI hasn't changed, is this normal?
Yes, see: openshift/enhancements#201 (As well as the rest of this document - we do in-place updates without changing the bootimage).

@vrutkovs could it be an incompatibility between 4.11.0-0.okd-2022-08-20-022919 AMI OS bootimage and 4.12 release-image?

vrutkovs · 2023-09-07T07:37:25Z

It certainly could, but during upgrade we also update the OS (after network plugin however).

When an upgrade starts from an old 4.11.0-0.okd-2022-08-20-022919 version

Well, its possible its an issue, as this path has not been tested. 4.11.0-0.okd-2022-12-02-145640 is definitely upgradable to 4.12 (see CI test results in "Upgrades to")

elluvium · 2023-09-07T14:19:37Z

Thanks for the confirmation. We'll continue the investigation since we have to update our old cluster anyway. One of the ideas is to replace all master nodes with the AMI from 4.11.0-0.okd-2022-12-02-145640 version before the update.
If you suddenly have any other advice on some workaround with this update, we would be really grateful

elluvium · 2023-09-11T17:36:22Z

So, manual AMI replacement on master nodes helped

vrutkovs closed this as completed Jul 15, 2023

vrutkovs reopened this Jul 15, 2023

elluvium closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update 4.11 -> 4.12] Master node becomes unreachable and loses network connectivity #1657

[Update 4.11 -> 4.12] Master node becomes unreachable and loses network connectivity #1657

elluvium commented Jul 14, 2023

elluvium commented Jul 21, 2023 •

edited

Loading

elluvium commented Sep 6, 2023 •

edited

Loading

vrutkovs commented Sep 7, 2023

elluvium commented Sep 7, 2023

elluvium commented Sep 11, 2023

[Update 4.11 -> 4.12] Master node becomes unreachable and loses network connectivity #1657

[Update 4.11 -> 4.12] Master node becomes unreachable and loses network connectivity #1657

Comments

elluvium commented Jul 14, 2023

elluvium commented Jul 21, 2023 • edited Loading

elluvium commented Sep 6, 2023 • edited Loading

vrutkovs commented Sep 7, 2023

elluvium commented Sep 7, 2023

elluvium commented Sep 11, 2023

elluvium commented Jul 21, 2023 •

edited

Loading

elluvium commented Sep 6, 2023 •

edited

Loading