-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kcp-controller stuck during worker cluster upgrade #8106
Comments
This issue is currently awaiting triage. If CAPI contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @jelmersnoeck do you have any insight on the root cause for PR 🌱 Add configurable etcd call timeout by jelmersnoeck · Pull Request #7841 · kubernetes-sigs/cluster-api |
@jessehu we haven't root caused this yet, but saw the exact same behavior. |
For more insight and for being able to take a further look into this issue it would be super helpful to have more information. Some first ideas where we could take a look at is:
|
thx @chrischdi ! Here's more information about an issue: KCP yaml
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
creationTimestamp: "2023-02-01T22:11:00Z"
finalizers:
- kubeadm.controlplane.cluster.x-k8s.io
generation: 2
labels:
cluster.x-k8s.io/cluster-name: sks-e2e-test-owk3ls-k8s-upgrade
managedFields:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"kubeadm.controlplane.cluster.x-k8s.io": {}
f:labels:
.: {}
f:cluster.x-k8s.io/cluster-name: {}
f:ownerReferences:
.: {}
k:{"uid":"697e015e-acfc-41ef-b29d-b87397eb0928"}: {}
manager: manager
operation: Update
time: "2023-02-01T22:11:29Z"
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:kubeadmConfigSpec:
.: {}
f:clusterConfiguration:
.: {}
f:apiServer:
.: {}
f:extraArgs:
.: {}
f:admission-control-config-file: {}
f:audit-log-maxage: {}
f:audit-log-maxbackup: {}
f:audit-log-maxsize: {}
f:audit-log-path: {}
f:audit-policy-file: {}
f:enable-admission-plugins: {}
f:profiling: {}
f:request-timeout: {}
f:tls-cipher-suites: {}
f:extraVolumes: {}
f:clusterName: {}
f:controllerManager:
.: {}
f:extraArgs:
.: {}
f:allocate-node-cidrs: {}
f:bind-address: {}
f:profiling: {}
f:terminated-pod-gc-threshold: {}
f:dns:
.: {}
f:imageRepository: {}
f:etcd: {}
f:imageRepository: {}
f:networking: {}
f:scheduler:
.: {}
f:extraArgs:
.: {}
f:bind-address: {}
f:profiling: {}
f:files: {}
f:initConfiguration:
.: {}
f:localAPIEndpoint: {}
f:nodeRegistration:
.: {}
f:kubeletExtraArgs:
.: {}
f:event-qps: {}
f:pod-infra-container-image: {}
f:protect-kernel-defaults: {}
f:tls-cipher-suites: {}
f:name: {}
f:joinConfiguration:
.: {}
f:discovery: {}
f:nodeRegistration:
.: {}
f:kubeletExtraArgs:
.: {}
f:event-qps: {}
f:pod-infra-container-image: {}
f:protect-kernel-defaults: {}
f:tls-cipher-suites: {}
f:name: {}
f:preKubeadmCommands: {}
f:useExperimentalRetryJoin: {}
f:users: {}
f:machineTemplate:
.: {}
f:infrastructureRef: {}
f:nodeDrainTimeout: {}
f:replicas: {}
f:rolloutStrategy:
.: {}
f:rollingUpdate:
.: {}
f:maxSurge: {}
f:type: {}
f:version: {}
manager: sks-manager
operation: Update
time: "2023-02-01T22:22:40Z"
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:initialized: {}
f:observedGeneration: {}
f:ready: {}
f:readyReplicas: {}
f:replicas: {}
f:selector: {}
f:unavailableReplicas: {}
f:updatedReplicas: {}
f:version: {}
manager: manager
operation: Update
subresource: status
time: "2023-02-01T22:25:33Z"
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane
namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: Cluster
name: sks-e2e-test-owk3ls-k8s-upgrade
uid: 697e015e-acfc-41ef-b29d-b87397eb0928
resourceVersion: "6649"
uid: 07081de2-ab3b-4aa0-b2d9-99f248d74810
spec:
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
admission-control-config-file: /etc/kubernetes/admission.yaml
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
audit-log-path: /var/log/apiserver/audit.log
audit-policy-file: /etc/kubernetes/auditpolicy.yaml
enable-admission-plugins: AlwaysPullImages,EventRateLimit
profiling: "false"
request-timeout: 300s
tls-cipher-suites: TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_3DES_EDE_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384
extraVolumes:
- hostPath: /var/log/apiserver
mountPath: /var/log/apiserver
name: apiserver-log
pathType: DirectoryOrCreate
- hostPath: /etc/kubernetes/admission.yaml
mountPath: /etc/kubernetes/admission.yaml
name: admission-config
pathType: FileOrCreate
readOnly: true
- hostPath: /etc/kubernetes/auditpolicy.yaml
mountPath: /etc/kubernetes/auditpolicy.yaml
name: audit-policy
pathType: FileOrCreate
readOnly: true
clusterName: sks-e2e-test-owk3ls-k8s-upgrade
controllerManager:
extraArgs:
allocate-node-cidrs: "false"
bind-address: 0.0.0.0
profiling: "false"
terminated-pod-gc-threshold: "10"
dns:
imageRepository: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/coredns
etcd: {}
imageRepository: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191
networking: {}
scheduler:
extraArgs:
bind-address: 0.0.0.0
profiling: "false"
files:
- content: |
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: kube-vip
namespace: kube-system
spec:
containers:
- args:
- manager
env:
- name: cp_enable
value: "true"
- name: vip_interface
value: ens4
- name: address
value: 10.255.24.5
- name: port
value: "6443"
- name: vip_arp
value: "true"
- name: vip_leaderelection
value: "true"
- name: vip_leaseduration
value: "15"
- name: vip_renewdeadline
value: "10"
- name: vip_retryperiod
value: "2"
image: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/kube-vip/kube-vip:v0.5.7
imagePullPolicy: IfNotPresent
name: kube-vip
resources: {}
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_RAW
volumeMounts:
- mountPath: /etc/kubernetes/admin.conf
name: kubeconfig
hostAliases:
- hostnames:
- kubernetes
ip: 127.0.0.1
hostNetwork: true
volumes:
- hostPath:
path: /etc/kubernetes/admin.conf
type: FileOrCreate
name: kubeconfig
status: {}
owner: root:root
path: /etc/kubernetes/manifests/kube-vip.yaml
- content: |
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: EventRateLimit
configuration:
apiVersion: eventratelimit.admission.k8s.io/v1alpha1
kind: Configuration
limits:
- type: Server
burst: 20000
qps: 5000
owner: root:root
path: /etc/kubernetes/admission.yaml
- content: |
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: None
userGroups:
- system:nodes
- level: None
users:
- system:kube-scheduler
- system:volume-scheduler
- system:kube-controller-manager
- level: None
nonResourceURLs:
- /healthz*
- /version
- /swagger*
- level: Metadata
resources:
- resources: ["secrets", "configmaps", "tokenreviews"]
- level: Metadata
omitStages:
- RequestReceived
resources:
- resources: ["pods", "deployments"]
owner: root:root
path: /etc/kubernetes/auditpolicy.yaml
format: cloud-config
initConfiguration:
localAPIEndpoint: {}
nodeRegistration:
kubeletExtraArgs:
event-qps: "0"
pod-infra-container-image: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/pause:3.6
protect-kernel-defaults: "true"
tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
name: '{{ ds.meta_data.hostname }}'
joinConfiguration:
discovery: {}
nodeRegistration:
kubeletExtraArgs:
event-qps: "0"
pod-infra-container-image: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/pause:3.6
protect-kernel-defaults: "true"
tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
name: '{{ ds.meta_data.hostname }}'
preKubeadmCommands:
- if [ -d /etc/sysconfig/network-scripts/ ]; then
- echo DEFROUTE=no | tee -a /etc/sysconfig/network-scripts/ifcfg-* > /dev/null
- echo PEERDNS=no | tee -a /etc/sysconfig/network-scripts/ifcfg-* > /dev/null
- sed -i '/DEFROUTE\|PEERDNS/d' /etc/sysconfig/network-scripts/ifcfg-ens4
- nmcli connection reload && nmcli networking off && nmcli networking on
- fi
- '[ -f "/run/kubeadm/kubeadm.yaml" ] && printf "\n---\napiVersion: kubeproxy.config.k8s.io/v1alpha1\nkind:
KubeProxyConfiguration\nmetricsBindAddress: 0.0.0.0:10249\n" >> /run/kubeadm/kubeadm.yaml'
- hostname "{{ ds.meta_data.hostname }}"
- echo "::1 ipv6-localhost ipv6-loopback" >/etc/hosts
- echo "127.0.0.1 localhost" >>/etc/hosts
- echo "127.0.0.1 {{ ds.meta_data.hostname }}" >>/etc/hosts
- echo "{{ ds.meta_data.hostname }}" >/etc/hostname
- sed -i 's|".*/pause.*"|"registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/pause:3.6"|'
/etc/containerd/config.toml
- systemctl restart containerd
- while ! systemctl status containerd; do sleep 1; done
useExperimentalRetryJoin: true
users:
- lockPassword: false
name: smartx
passwd: $6$fbz1r5cI$ISJj4xg/DSGLlsIHYWFyuhrSkAtX76AH7RTRTs7h/WFbhVazjq51zNQGsuWYu1JsF1tfYCLCbo9xD3Nr8UY5w0
sudo: ALL=(ALL) NOPASSWD:ALL
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: ElfMachineTemplate
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-v1-23-14-c7w63y
namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz
metadata: {}
nodeDrainTimeout: 5m0s
replicas: 3
rolloutStrategy:
rollingUpdate:
maxSurge: 1
type: RollingUpdate
version: v1.23.14
status:
conditions:
- lastTransitionTime: "2023-02-01T22:22:46Z"
message: Rolling 3 replicas with outdated spec (1 replicas up to date)
reason: RollingUpdateInProgress
severity: Warning
status: "False"
type: Ready
- lastTransitionTime: "2023-02-01T22:12:59Z"
status: "True"
type: Available
- lastTransitionTime: "2023-02-01T22:11:31Z"
status: "True"
type: CertificatesAvailable
- lastTransitionTime: "2023-02-01T22:25:33Z"
status: "True"
type: ControlPlaneComponentsHealthy
- lastTransitionTime: "2023-02-01T22:25:33Z"
status: "True"
type: EtcdClusterHealthy
- lastTransitionTime: "2023-02-01T22:17:30Z"
status: "True"
type: MachinesCreated
- lastTransitionTime: "2023-02-01T22:25:09Z"
status: "True"
type: MachinesReady
- lastTransitionTime: "2023-02-01T22:22:46Z"
message: Rolling 3 replicas with outdated spec (1 replicas up to date)
reason: RollingUpdateInProgress
severity: Warning
status: "False"
type: MachinesSpecUpToDate
- lastTransitionTime: "2023-02-01T22:22:43Z"
message: Scaling down control plane to 3 replicas (actual 4)
reason: ScalingDown
severity: Warning
status: "False"
type: Resized
initialized: true
observedGeneration: 2
ready: true
readyReplicas: 3
replicas: 4
selector: cluster.x-k8s.io/cluster-name=sks-e2e-test-owk3ls-k8s-upgrade,cluster.x-k8s.io/control-plane
unavailableReplicas: 1
updatedReplicas: 1
version: v1.22.16 4 Machines(3 low version + 1 high version)
low version Machine1: apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
annotations:
controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{"extraArgs":{"admission-control-config-file":"/etc/kubernetes/admission.yaml","audit-log-maxage":"30","audit-log-maxbackup":"10","audit-log-maxsize":"100","audit-log-path":"/var/log/apiserver/audit.log","audit-policy-file":"/etc/kubernetes/auditpolicy.yaml","enable-admission-plugins":"AlwaysPullImages,EventRateLimit","profiling":"false","request-timeout":"300s","tls-cipher-suites":"TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_3DES_EDE_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384"},"extraVolumes":[{"name":"apiserver-log","hostPath":"/var/log/apiserver","mountPath":"/var/log/apiserver","pathType":"DirectoryOrCreate"},{"name":"admission-config","hostPath":"/etc/kubernetes/admission.yaml","mountPath":"/etc/kubernetes/admission.yaml","readOnly":true,"pathType":"FileOrCreate"},{"name":"audit-policy","hostPath":"/etc/kubernetes/auditpolicy.yaml","mountPath":"/etc/kubernetes/auditpolicy.yaml","readOnly":true,"pathType":"FileOrCreate"}]},"controllerManager":{"extraArgs":{"allocate-node-cidrs":"false","bind-address":"0.0.0.0","profiling":"false","terminated-pod-gc-threshold":"10"}},"scheduler":{"extraArgs":{"bind-address":"0.0.0.0","profiling":"false"}},"dns":{"imageRepository":"registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/coredns"},"imageRepository":"registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191","clusterName":"sks-e2e-test-owk3ls-k8s-upgrade"}'
creationTimestamp: "2023-02-01T22:11:32Z"
finalizers:
- machine.cluster.x-k8s.io
generation: 3
labels:
cluster.x-k8s.io/cluster-name: sks-e2e-test-owk3ls-k8s-upgrade
cluster.x-k8s.io/control-plane: ""
cluster.x-k8s.io/control-plane-name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane
managedFields:
- apiVersion: cluster.x-k8s.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: {}
f:finalizers:
.: {}
v:"machine.cluster.x-k8s.io": {}
f:labels:
.: {}
f:cluster.x-k8s.io/cluster-name: {}
f:cluster.x-k8s.io/control-plane: {}
f:cluster.x-k8s.io/control-plane-name: {}
f:ownerReferences:
.: {}
k:{"uid":"07081de2-ab3b-4aa0-b2d9-99f248d74810"}: {}
f:spec:
.: {}
f:bootstrap:
.: {}
f:configRef: {}
f:dataSecretName: {}
f:clusterName: {}
f:infrastructureRef: {}
f:nodeDrainTimeout: {}
f:providerID: {}
f:version: {}
manager: manager
operation: Update
time: "2023-02-01T22:13:56Z"
- apiVersion: cluster.x-k8s.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:addresses: {}
f:bootstrapReady: {}
f:conditions: {}
f:infrastructureReady: {}
f:lastUpdated: {}
f:nodeInfo:
.: {}
f:architecture: {}
f:bootID: {}
f:containerRuntimeVersion: {}
f:kernelVersion: {}
f:kubeProxyVersion: {}
f:kubeletVersion: {}
f:machineID: {}
f:operatingSystem: {}
f:osImage: {}
f:systemUUID: {}
f:nodeRef: {}
f:observedGeneration: {}
f:phase: {}
manager: manager
operation: Update
subresource: status
time: "2023-02-01T22:29:05Z"
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-9cg5m
namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz
ownerReferences:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: KubeadmControlPlane
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane
uid: 07081de2-ab3b-4aa0-b2d9-99f248d74810
resourceVersion: "7816"
uid: b7c17f00-18cd-42e6-910b-cae9a4983f5d
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-hx9ps
namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz
uid: 973f9b62-6e3b-45fb-9fab-93c18bbdd4a6
dataSecretName: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-hx9ps
clusterName: sks-e2e-test-owk3ls-k8s-upgrade
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: ElfMachine
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-qtrrf
namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz
uid: 054600bd-e693-4017-ae17-620280cf3abb
nodeDeletionTimeout: 10s
nodeDrainTimeout: 5m0s
providerID: elf://6b4dfb57-e3fb-4d67-bf5c-c5d87d92eb50
version: v1.22.16
status:
addresses:
- address: 10.255.1.149
type: InternalIP
- address: 240.255.0.1
type: InternalIP
- address: 10.244.7.146
type: InternalIP
- address: 100.64.254.254
type: InternalIP
- address: 10.255.35.0
type: InternalIP
- address: 100.64.254.254
type: InternalIP
- address: 10.255.35.1
type: InternalIP
- address: 100.64.254.254
type: InternalIP
- address: 10.255.35.9
type: InternalIP
- address: 100.64.254.254
type: InternalIP
- address: 10.255.35.34
type: InternalIP
bootstrapReady: true
conditions:
- lastTransitionTime: "2023-02-01T22:13:56Z"
status: "True"
type: Ready
- lastTransitionTime: "2023-02-01T22:17:13Z"
status: "True"
type: APIServerPodHealthy
- lastTransitionTime: "2023-02-01T22:11:32Z"
status: "True"
type: BootstrapReady
- lastTransitionTime: "2023-02-01T22:13:57Z"
status: "True"
type: ControllerManagerPodHealthy
- lastTransitionTime: "2023-02-01T22:13:57Z"
status: "True"
type: EtcdMemberHealthy
- lastTransitionTime: "2023-02-01T22:13:57Z"
status: "True"
type: EtcdPodHealthy
- lastTransitionTime: "2023-02-01T22:13:56Z"
status: "True"
type: InfrastructureReady
- lastTransitionTime: "2023-02-01T22:13:56Z"
status: "True"
type: NodeHealthy
- lastTransitionTime: "2023-02-01T22:13:57Z"
status: "True"
type: SchedulerPodHealthy
infrastructureReady: true
lastUpdated: "2023-02-01T22:13:56Z"
nodeInfo:
architecture: amd64
bootID: db973ed3-7a84-480c-a62f-4f0f4d3893e6
containerRuntimeVersion: containerd://1.6.4
kernelVersion: 4.18.0-372.9.1.el8.x86_64
kubeProxyVersion: v1.22.16
kubeletVersion: v1.22.16
machineID: 89e1bbf094cb41519341c3040fbef924
operatingSystem: linux
osImage: Rocky Linux 8.6 (Green Obsidian)
systemUUID: 89e1bbf0-94cb-4151-9341-c3040fbef924
nodeRef:
apiVersion: v1
kind: Node
name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-qtrrf
uid: c9e33a7c-e9e3-4fee-9569-70906836fe6c
observedGeneration: 3
phase: Running low version Machine2: low version Machine3: high version Machine: etcd member
the etcd leader log
I see that the etcd leader has received the request of transfering leadership. But I don't see any relevant logs of executing about it. |
For finding the real root-cause I think it would be required to go to the etcd level and try to debug it on that side (e.g. see if the leader could get moved manually or so, check if the disk speed is enough to run etcd or this could be a cause). From CAPI perspective: I think #7841 could already improve on this particular issue. However, it would be interesting if it solves this issue by retrying the leader move (which would happen via #7841 if I understood it correctly, at least the KCP controller would surface a timeout error and retry). |
As an FYI, going back through some of these occurrences of this, we've seen etcd client calls block in multiple spots. The initial issue we ran into was on the etcd members call. |
/triage needs-information |
@fabriziopandini we didn't find anything pointing us to a concrete issue and haven't investigated this further. With the etcd client timeouts in place, this hasn't come up for us again. I'm not sure if @huaqing1994 has found any further leads? |
thanks for the feedback we can eventually re-open or open a new issue if new evidence are collected |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
I've found a occasionally issue in our CI environment(3 times a month):
We use CAPI to create and upgrade a worker cluster(3 master + 3 worker). I've found that the upgrade occasionally gets stuck after the first new control plane node is created successfully. After the problem occurs, kcp-controller no longer prints any logs about this cluster.
The goroutine stack trace of kcp-controller shows that a certain reconcile has been blocked in ETCD MoveLeader().
I found a related PR #7841 which fixes kcp-controller getting stuck.
But I want to know the reason and mechanism of this problem, has anyone encountered it?
Environment:
kubectl version
): 1.25.0/etc/os-release
): CentOS 8/kind bug
The text was updated successfully, but these errors were encountered: