tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

hoyhbx · 2023-03-28T19:53:51Z

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-05-19T19:53:08Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

pingcap/tidb-operator:v1.3.2

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

default

What's the status of the TiDB cluster pods?

NAME                                      READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
test-cluster-discovery-779bb58fc7-q2kqv   1/1     Running   0          3m45s   10.244.4.11   kind-worker4   <none>           <none>
test-cluster-pd-0                         1/1     Running   0          3m45s   10.244.1.10   kind-worker3   <none>           <none>
test-cluster-pd-1                         1/1     Running   0          3m45s   10.244.4.12   kind-worker4   <none>           <none>
test-cluster-pd-2                         1/1     Running   0          3m45s   10.244.2.8    kind-worker    <none>           <none>
test-cluster-tidb-0                       2/2     Running   0          2m58s   10.244.2.9    kind-worker    <none>           <none>
test-cluster-tidb-1                       2/2     Running   0          2m58s   10.244.1.12   kind-worker3   <none>           <none>
test-cluster-tidb-2                       0/2     Pending   0          53s     <none>        <none>         <none>           <none>
test-cluster-tikv-0                       2/2     Running   0          3m31s   10.244.1.11   kind-worker3   <none>           <none>
test-cluster-tikv-1                       2/2     Running   0          3m31s   10.244.4.13   kind-worker4   <none>           <none>
test-cluster-tikv-2                       2/2     Running   0          3m31s   10.244.3.8    kind-worker2   <none>           <none>

What did you do?

We first have a healthy tidb cluster.
Then we changed the affinity rule for scheduling the tidb pods. The tidb statefulset is updated with the new affinity rule and performed a rolling upgrade. Then we realized that the new affinity rule cannot be satisfied for the current topology of the cluster, there is always one replica cannot be scheduled.
We then reverted the CR, to remove the new affinity rule.
But then the operator fails to update the tidb statefulset, leaving the cluster under unhealthy state.

What did you expect to see?
tidb-operator should be able to recover the cluster, with manual revert.

What did you see instead?
tidb-operator refuses to update the statefulset, before all replicas in the cluster become ready.
The log indicating this behavior is
tidb_cluster_controller.go:124] TidbCluster: default/test-cluster, still need sync: tidbcluster: [default/test-cluster]'s upgraded tidb pod: [test-cluster-tidb-2] is not ready, requeuing

The text was updated successfully, but these errors were encountered:

yiduoyunQ · 2023-03-30T08:34:34Z

hi, you can first scale-in tidb to replica:2, let cluster become normal, then operator can proceed the sync logic, after changing the affinity rules, you can scale-out back tidb-2 if you want.

hoyhbx · 2023-03-31T23:37:17Z

Thanks @yiduoyunQ for the suggestion!
Is there a possible way to fix this fundamentally in operator? Because revert is probably the most intuitive way to recover the cluster after a misoperation. Or what is the challenge for the operator to recognize the statefulSet is never going to be entirely ready and start processing updates to statefulSet again?

yiduoyunQ · 2023-04-01T00:47:24Z

hi hoyhbx, this error message is reported from https://github.com/pingcap/tidb-operator/blob/v1.4.4/pkg/manager/member/tidb_upgrader.go#L123-L133, it block Operator continue do upgradeTiDBPod.

Current Operator sync logic is sync each component type one by one in an order, so we need one determinate condition to check if it can continue sync the next component, the component pod's status is used here in upgrade sync logic.

For example if we change the pod's status check after upgradeTiDBPod, it may lead Operator continuously try upgrade TiDB pod if there is one or more TiDB pod's status always can not be ready.

What do you think about this?

yiduoyunQ · 2023-04-01T01:03:24Z

if we already in the clear upgrade logic https://github.com/pingcap/tidb-operator/blob/v1.4.4/pkg/manager/member/tidb_member_manager.go#L305 , do we still need check the pod's status? or we can just let it start upgrade. @hanlins

hoyhbx · 2023-04-01T18:11:27Z

Yeah I think it is a good practice to wait for the component to get ready and then apply next upgrade. The operator is doing a good job here.
But this conservative wait is sort of a double edged sword and causes this issue, I think the fundamental challenge to fix this issue here is to tell whether

the Pods are not ready because the upgrade is in progress, and the pods will eventually become ready
the Pods are not ready because of some unsatisfiable conditions, e.g. not schedulable

I think the status of the Pod may contain some more useful information to distinguish the two cases. For example, if the Pod has been in Pending state for over 5 minutes, then the operator can determine the upgrade has failed. Although the hardcoded timeout may be flaky, because in large clusters it may take a Pod over 5 minutes to get scheduled.

csuzhangxc · 2023-04-03T06:29:18Z

yes. If we want to apply the next upgrade, we must tell every unhealthy case whether it's safe to process.

hoyhbx · 2023-04-11T16:36:15Z

@csuzhangxc Do you think this would be a feature request for the statefulSet controller? Having a field in statefulSet status would greatly help operators to recognize failed upgrade

csuzhangxc · 2023-04-12T02:41:53Z

@csuzhangxc Do you think this would be a feature request for the statefulSet controller? Having a field in statefulSet status would greatly help operators to recognize failed upgrade

Yes, this may be helpful.

hoyhbx · 2023-04-27T20:52:45Z

@yiduoyunQ @csuzhangxc Can we have an intermediate fix for this bug before the upstream Kubernetes reacts? We are drafting an issue to send to Kubernetes at the same time.

A potential fix for this bug is to recognize the error condition that the cluster is under, and proceed to upgrade the statefulSet if the Pods have been in the error condition for a timeout.

The symptom of this bug is that there are some pods in "Pending" state, with error conditions, e.g.:

{
      "last_probe_time": null,
      "last_transition_time": "2023-04-02T12:10:37+00:00",
      "message": "0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
}

Note that the condition has the last_transition_time field which can be used to determine how long the Pod has been in this error state.
There are other error states which the Pod can be under which may manifest in other Conditions, e.g. ContainersReady, Initialized, Ready. All of them would contain the last_transition_time which we can use to decide if the next upgrade should proceed.

We propose to add a condition branch at

tidb-operator/pkg/manager/member/tidb_upgrader.go

Lines 123 to 133 in 4b1c55c

    
           if !podutil.IsPodAvailable(pod, int32(minReadySeconds), metav1.Now()) { 
        
           	readyCond := podutil.GetPodReadyCondition(pod.Status) 
        
           	if readyCond == nil || readyCond.Status != corev1.ConditionTrue { 
        
           		return controller.RequeueErrorf("tidbcluster: [%s/%s]'s upgraded tidb pod: [%s] is not ready", ns, tcName, podName) 
        
           	} 
        
           	return controller.RequeueErrorf("tidbcluster: [%s/%s]'s upgraded tidb pod: [%s] is not available, last transition time is %v", ns, tcName, podName, readyCond.LastTransitionTime) 
        
           } 
        
           if member, exist := tc.Status.TiDB.Members[podName]; !exist || !member.Health { 
        
           	return controller.RequeueErrorf("tidbcluster: [%s/%s]'s tidb upgraded pod: [%s] is not ready", ns, tcName, podName) 
        
           }

to check if the previous upgrade has failed (through timeout by checking the last_transition_time field in conditions). If the previous upgrade has failed, the operator can proceed to next upgrade. The timeout can be set with conservatively large number, or be a field in the CRD so users can configure.

csuzhangxc · 2023-04-28T08:56:56Z

@hoyhbx this is something like a force upgrade, right? I think adding a new field may be better and wouldn't break the existing behavior. And may need to address which conditions this behavior will take effect.

hoyhbx · 2023-04-28T17:24:40Z

@csuzhangxc , yes, a feature like force upgrade would fix this. Having a boolean property force for each statefulSet, and if the force is set to true, the wait conditions will be skipped and directly proceed to upgrade.

tylergu mentioned this issue Mar 28, 2023

tidb-operator unable to recover an unhealthy cluster even with manual revert xlab-uiuc/acto#202

Closed

tylergu mentioned this issue Apr 7, 2023

[TidbOp] Setting unsatisfiable affinity requirements leads to broken TiDB cluster xlab-uiuc/acto#212

Closed

hoyhbx mentioned this issue Apr 7, 2023

Setting unsatisfiable affinity requirements leads to broken TiDB cluster #4960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

hoyhbx commented Mar 28, 2023

yiduoyunQ commented Mar 30, 2023

hoyhbx commented Mar 31, 2023

yiduoyunQ commented Apr 1, 2023 •

edited

Loading

yiduoyunQ commented Apr 1, 2023

hoyhbx commented Apr 1, 2023 •

edited

Loading

csuzhangxc commented Apr 3, 2023

hoyhbx commented Apr 11, 2023

csuzhangxc commented Apr 12, 2023

hoyhbx commented Apr 27, 2023

csuzhangxc commented Apr 28, 2023

hoyhbx commented Apr 28, 2023

tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

Comments

hoyhbx commented Mar 28, 2023

Bug Report

yiduoyunQ commented Mar 30, 2023

hoyhbx commented Mar 31, 2023

yiduoyunQ commented Apr 1, 2023 • edited Loading

yiduoyunQ commented Apr 1, 2023

hoyhbx commented Apr 1, 2023 • edited Loading

csuzhangxc commented Apr 3, 2023

hoyhbx commented Apr 11, 2023

csuzhangxc commented Apr 12, 2023

hoyhbx commented Apr 27, 2023

csuzhangxc commented Apr 28, 2023

hoyhbx commented Apr 28, 2023

yiduoyunQ commented Apr 1, 2023 •

edited

Loading

hoyhbx commented Apr 1, 2023 •

edited

Loading