You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We find that sometimes the cassandraDatacenter controller fails to scale down the Cassandra cluster because it decommissions the Cassandra node wrongly. In scaleStatefulSet the last pod returned by List is picked as the one to decommission:
newestPod := podsInRack[len(podsInRack)-1]
The List fetches the pods from the local indexer. However, the indexer is implemented as a map[string]interface{} so no order is guaranteed for the keys in the indexer. If there are two Cassandra pods ca-0 and ca-1 running, the List can either return [ca-0, ca-1] or [ca-1, ca-0]. If the latter, the controller will decommission ca-0. Later when the Kubernetes statefulset controller tries to reconcile the statefulset, it will choose the pod with the largest ordinal to delete which is ca-1 in this case. So the decommissioning could be inconsistent with the pod deletion. More importantly, the statefulset controller only deletes the pod when all predecessor pods are in good status ("Running and Ready"). Since ca-0's Cassandra node is decommissioned, the deletion of ca-1 will be blocked forever.
To Reproduce
Create a cassandradatecenter cdc with node=2 (there are two pods: ca-0 and ca-1 now)
Scale cdc down by changing node to 1. Now if the List happens to return [ca-1, ca-0], ca-0 will be decommissioned and we will observe the deletion of ca-1 gets stuck forever and the scale down will never succeed.
Note that the bug is nondeterministic since the order in map is not guaranteed. if [ca-0, ca-1] is returned then everything is fine. But sometimes we indeed observe different orders get returned and cause the problem mentioned above.
Expected behavior
A potential fix is to use the same way in statefulset controller to get the ordinal of each pod just like below and pick the pod with the largest ordinal to decommission.
Code in statefulset controller to extract the ordinal for a pod:
// getParentNameAndOrdinal gets the name of pod's parent StatefulSet and pod's ordinal as extracted from its Name. If
// the Pod was not created by a StatefulSet, its parent is considered to be empty string, and its ordinal is considered
// to be -1.
func getParentNameAndOrdinal(pod *v1.Pod) (string, int) {
parent := ""
ordinal := -1
subMatches := statefulPodRegex.FindStringSubmatch(pod.Name)
if len(subMatches) < 3 {
return parent, ordinal
}
parent = subMatches[1]
if i, err := strconv.ParseInt(subMatches[2], 10, 32); err == nil {
ordinal = int(i)
}
return parent, ordinal
}
Environment
OS Linux
Kubernetes version v1.18.9
kubectl version v1.20.1
Go version 1.13.9
Cassandra version 3
Additional context
We are willing to file a patch for this issue.
The text was updated successfully, but these errors were encountered:
yes, the patch would be very good! I am very sorry this one slipped through. If you have some cycles to fix this it would be awesome.
I have not forgotten your first patch, I am just doing something in the background so I will cut a release sooner or later as what I do depends on operator but I ve already merged that locally. If you manage to fix this, I will release the images with both issues.
Describe the bug
We find that sometimes the cassandraDatacenter controller fails to scale down the Cassandra cluster because it decommissions the Cassandra node wrongly. In
scaleStatefulSet
the last pod returned byList
is picked as the one to decommission:The
List
fetches the pods from the local indexer. However, the indexer is implemented as amap[string]interface{}
so no order is guaranteed for the keys in the indexer. If there are two Cassandra pods ca-0 and ca-1 running, theList
can either return[ca-0, ca-1]
or[ca-1, ca-0]
. If the latter, the controller will decommission ca-0. Later when the Kubernetes statefulset controller tries to reconcile the statefulset, it will choose the pod with the largest ordinal to delete which is ca-1 in this case. So the decommissioning could be inconsistent with the pod deletion. More importantly, the statefulset controller only deletes the pod when all predecessor pods are in good status ("Running and Ready"). Since ca-0's Cassandra node is decommissioned, the deletion of ca-1 will be blocked forever.To Reproduce
node=2
(there are two pods: ca-0 and ca-1 now)node
to 1. Now if theList
happens to return[ca-1, ca-0]
, ca-0 will be decommissioned and we will observe the deletion of ca-1 gets stuck forever and the scale down will never succeed.Note that the bug is nondeterministic since the order in map is not guaranteed. if
[ca-0, ca-1]
is returned then everything is fine. But sometimes we indeed observe different orders get returned and cause the problem mentioned above.Expected behavior
A potential fix is to use the same way in statefulset controller to get the ordinal of each pod just like below and pick the pod with the largest ordinal to decommission.
Code in statefulset controller to extract the ordinal for a pod:
Environment
kubectl version
v1.20.1Additional context
We are willing to file a patch for this issue.
The text was updated successfully, but these errors were encountered: