Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

Open
hoyhbx opened this issue Mar 28, 2023 · 11 comments
Open

Comments

@hoyhbx
Copy link
Contributor

hoyhbx commented Mar 28, 2023

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-05-19T19:53:08Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

pingcap/tidb-operator:v1.3.2

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

default

What's the status of the TiDB cluster pods?

NAME                                      READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
test-cluster-discovery-779bb58fc7-q2kqv   1/1     Running   0          3m45s   10.244.4.11   kind-worker4   <none>           <none>
test-cluster-pd-0                         1/1     Running   0          3m45s   10.244.1.10   kind-worker3   <none>           <none>
test-cluster-pd-1                         1/1     Running   0          3m45s   10.244.4.12   kind-worker4   <none>           <none>
test-cluster-pd-2                         1/1     Running   0          3m45s   10.244.2.8    kind-worker    <none>           <none>
test-cluster-tidb-0                       2/2     Running   0          2m58s   10.244.2.9    kind-worker    <none>           <none>
test-cluster-tidb-1                       2/2     Running   0          2m58s   10.244.1.12   kind-worker3   <none>           <none>
test-cluster-tidb-2                       0/2     Pending   0          53s     <none>        <none>         <none>           <none>
test-cluster-tikv-0                       2/2     Running   0          3m31s   10.244.1.11   kind-worker3   <none>           <none>
test-cluster-tikv-1                       2/2     Running   0          3m31s   10.244.4.13   kind-worker4   <none>           <none>
test-cluster-tikv-2                       2/2     Running   0          3m31s   10.244.3.8    kind-worker2   <none>           <none>

What did you do?

We first have a healthy tidb cluster.
Then we changed the affinity rule for scheduling the tidb pods. The tidb statefulset is updated with the new affinity rule and performed a rolling upgrade. Then we realized that the new affinity rule cannot be satisfied for the current topology of the cluster, there is always one replica cannot be scheduled.
We then reverted the CR, to remove the new affinity rule.
But then the operator fails to update the tidb statefulset, leaving the cluster under unhealthy state.

What did you expect to see?
tidb-operator should be able to recover the cluster, with manual revert.

What did you see instead?
tidb-operator refuses to update the statefulset, before all replicas in the cluster become ready.
The log indicating this behavior is
tidb_cluster_controller.go:124] TidbCluster: default/test-cluster, still need sync: tidbcluster: [default/test-cluster]'s upgraded tidb pod: [test-cluster-tidb-2] is not ready, requeuing

@yiduoyunQ
Copy link
Contributor

hi, you can first scale-in tidb to replica:2, let cluster become normal, then operator can proceed the sync logic, after changing the affinity rules, you can scale-out back tidb-2 if you want.

@hoyhbx
Copy link
Contributor Author

hoyhbx commented Mar 31, 2023

Thanks @yiduoyunQ for the suggestion!
Is there a possible way to fix this fundamentally in operator? Because revert is probably the most intuitive way to recover the cluster after a misoperation. Or what is the challenge for the operator to recognize the statefulSet is never going to be entirely ready and start processing updates to statefulSet again?

@yiduoyunQ
Copy link
Contributor

yiduoyunQ commented Apr 1, 2023

hi hoyhbx, this error message is reported from https://github.com/pingcap/tidb-operator/blob/v1.4.4/pkg/manager/member/tidb_upgrader.go#L123-L133, it block Operator continue do upgradeTiDBPod.

Current Operator sync logic is sync each component type one by one in an order, so we need one determinate condition to check if it can continue sync the next component, the component pod's status is used here in upgrade sync logic.

For example if we change the pod's status check after upgradeTiDBPod, it may lead Operator continuously try upgrade TiDB pod if there is one or more TiDB pod's status always can not be ready.

What do you think about this?

@yiduoyunQ
Copy link
Contributor

if we already in the clear upgrade logic https://github.com/pingcap/tidb-operator/blob/v1.4.4/pkg/manager/member/tidb_member_manager.go#L305 , do we still need check the pod's status? or we can just let it start upgrade. @hanlins

@hoyhbx
Copy link
Contributor Author

hoyhbx commented Apr 1, 2023

Yeah I think it is a good practice to wait for the component to get ready and then apply next upgrade. The operator is doing a good job here.
But this conservative wait is sort of a double edged sword and causes this issue, I think the fundamental challenge to fix this issue here is to tell whether

  1. the Pods are not ready because the upgrade is in progress, and the pods will eventually become ready
  2. the Pods are not ready because of some unsatisfiable conditions, e.g. not schedulable

I think the status of the Pod may contain some more useful information to distinguish the two cases. For example, if the Pod has been in Pending state for over 5 minutes, then the operator can determine the upgrade has failed. Although the hardcoded timeout may be flaky, because in large clusters it may take a Pod over 5 minutes to get scheduled.

@csuzhangxc
Copy link
Member

yes. If we want to apply the next upgrade, we must tell every unhealthy case whether it's safe to process.

@hoyhbx
Copy link
Contributor Author

hoyhbx commented Apr 11, 2023

@csuzhangxc Do you think this would be a feature request for the statefulSet controller? Having a field in statefulSet status would greatly help operators to recognize failed upgrade

@csuzhangxc
Copy link
Member

@csuzhangxc Do you think this would be a feature request for the statefulSet controller? Having a field in statefulSet status would greatly help operators to recognize failed upgrade

Yes, this may be helpful.

@hoyhbx
Copy link
Contributor Author

hoyhbx commented Apr 27, 2023

@yiduoyunQ @csuzhangxc Can we have an intermediate fix for this bug before the upstream Kubernetes reacts? We are drafting an issue to send to Kubernetes at the same time.

A potential fix for this bug is to recognize the error condition that the cluster is under, and proceed to upgrade the statefulSet if the Pods have been in the error condition for a timeout.

The symptom of this bug is that there are some pods in "Pending" state, with error conditions, e.g.:

{
      "last_probe_time": null,
      "last_transition_time": "2023-04-02T12:10:37+00:00",
      "message": "0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
}

Note that the condition has the last_transition_time field which can be used to determine how long the Pod has been in this error state.
There are other error states which the Pod can be under which may manifest in other Conditions, e.g. ContainersReady, Initialized, Ready. All of them would contain the last_transition_time which we can use to decide if the next upgrade should proceed.

We propose to add a condition branch at

if !podutil.IsPodAvailable(pod, int32(minReadySeconds), metav1.Now()) {
readyCond := podutil.GetPodReadyCondition(pod.Status)
if readyCond == nil || readyCond.Status != corev1.ConditionTrue {
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s upgraded tidb pod: [%s] is not ready", ns, tcName, podName)
}
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s upgraded tidb pod: [%s] is not available, last transition time is %v", ns, tcName, podName, readyCond.LastTransitionTime)
}
if member, exist := tc.Status.TiDB.Members[podName]; !exist || !member.Health {
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s tidb upgraded pod: [%s] is not ready", ns, tcName, podName)
}
to check if the previous upgrade has failed (through timeout by checking the last_transition_time field in conditions). If the previous upgrade has failed, the operator can proceed to next upgrade. The timeout can be set with conservatively large number, or be a field in the CRD so users can configure.

@csuzhangxc
Copy link
Member

@hoyhbx this is something like a force upgrade, right? I think adding a new field may be better and wouldn't break the existing behavior. And may need to address which conditions this behavior will take effect.

@hoyhbx
Copy link
Contributor Author

hoyhbx commented Apr 28, 2023

@csuzhangxc , yes, a feature like force upgrade would fix this. Having a boolean property force for each statefulSet, and if the force is set to true, the wait conditions will be skipped and directly proceed to upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants