[BUG] Cassandra node can be decommissioned wrongly which blocks scale down #400

srteam2020 · 2021-01-14T03:59:43Z

Describe the bug

We find that sometimes the cassandraDatacenter controller fails to scale down the Cassandra cluster because it decommissions the Cassandra node wrongly. In scaleStatefulSet the last pod returned by List is picked as the one to decommission:

newestPod := podsInRack[len(podsInRack)-1]

The List fetches the pods from the local indexer. However, the indexer is implemented as a map[string]interface{} so no order is guaranteed for the keys in the indexer. If there are two Cassandra pods ca-0 and ca-1 running, the List can either return [ca-0, ca-1] or [ca-1, ca-0]. If the latter, the controller will decommission ca-0. Later when the Kubernetes statefulset controller tries to reconcile the statefulset, it will choose the pod with the largest ordinal to delete which is ca-1 in this case. So the decommissioning could be inconsistent with the pod deletion. More importantly, the statefulset controller only deletes the pod when all predecessor pods are in good status ("Running and Ready"). Since ca-0's Cassandra node is decommissioned, the deletion of ca-1 will be blocked forever.

To Reproduce

Create a cassandradatecenter cdc with node=2 (there are two pods: ca-0 and ca-1 now)
Scale cdc down by changing node to 1. Now if the List happens to return [ca-1, ca-0], ca-0 will be decommissioned and we will observe the deletion of ca-1 gets stuck forever and the scale down will never succeed.

Note that the bug is nondeterministic since the order in map is not guaranteed. if [ca-0, ca-1] is returned then everything is fine. But sometimes we indeed observe different orders get returned and cause the problem mentioned above.

Expected behavior

A potential fix is to use the same way in statefulset controller to get the ordinal of each pod just like below and pick the pod with the largest ordinal to decommission.

Code in statefulset controller to extract the ordinal for a pod:

// getParentNameAndOrdinal gets the name of pod's parent StatefulSet and pod's ordinal as extracted from its Name. If
// the Pod was not created by a StatefulSet, its parent is considered to be empty string, and its ordinal is considered
// to be -1.
func getParentNameAndOrdinal(pod *v1.Pod) (string, int) {
	parent := ""
	ordinal := -1
	subMatches := statefulPodRegex.FindStringSubmatch(pod.Name)
	if len(subMatches) < 3 {
		return parent, ordinal
	}
	parent = subMatches[1]
	if i, err := strconv.ParseInt(subMatches[2], 10, 32); err == nil {
		ordinal = int(i)
	}
	return parent, ordinal
}

Environment

OS Linux
Kubernetes version v1.18.9
kubectl version v1.20.1
Go version 1.13.9
Cassandra version 3

Additional context
We are willing to file a patch for this issue.

The text was updated successfully, but these errors were encountered:

smiklosovic · 2021-01-14T18:18:54Z

hi @srteam2020

yes, the patch would be very good! I am very sorry this one slipped through. If you have some cycles to fix this it would be awesome.

I have not forgotten your first patch, I am just doing something in the background so I will cut a release sooner or later as what I do depends on operator but I ve already merged that locally. If you manage to fix this, I will release the images with both issues.

Regards

srteam2020 · 2021-01-17T05:01:18Z

PR issued here #401
Borrowing the code from statefulset controller

smiklosovic · 2021-01-25T17:36:08Z

merged / release thanks

srteam2020 added the bug Something isn't working label Jan 14, 2021

srteam2020 assigned smiklosovic Jan 14, 2021

srteam2020 mentioned this issue Jan 17, 2021

decommission the pod with the max ordinal #401

Closed

smiklosovic closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cassandra node can be decommissioned wrongly which blocks scale down #400

[BUG] Cassandra node can be decommissioned wrongly which blocks scale down #400

srteam2020 commented Jan 14, 2021

smiklosovic commented Jan 14, 2021 •

edited

Loading

srteam2020 commented Jan 17, 2021

smiklosovic commented Jan 25, 2021

[BUG] Cassandra node can be decommissioned wrongly which blocks scale down #400

[BUG] Cassandra node can be decommissioned wrongly which blocks scale down #400

Comments

srteam2020 commented Jan 14, 2021

smiklosovic commented Jan 14, 2021 • edited Loading

srteam2020 commented Jan 17, 2021

smiklosovic commented Jan 25, 2021

smiklosovic commented Jan 14, 2021 •

edited

Loading