You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 20, 2022. It is now read-only.
We find that scaling down a single dc rack (by reducing nodesPerRacks) might end up in a dirty state (the pod is deleted but the pvc is still there) if the operator crashes in the middle of a reconcile and restarts. The accidental dirty state will also prevent the operator from handling any future user request.
More concretely, when scaling down the dc rack (statefulset), casskop will do the following:
detect there is a decommission task, set the in-memory CR.podLastOperation.status (previously StatusOngoing) to StatusFinalizing (but have not issued Update to k8s yet)
update statefulset's replica to delete the decommissioned pod
update CR at the end of reconcile, which will persist the change topodLastOperation (to StatusFinalizing)
in the next round of reconcile:
if podLastOperation is StatusOngoing, try to get the decommissioned pod. If there is an error, the operator directly returns the error. Otherwise, the operator continues its reconcile.
if podLastOperation is StatusFinalizing, try to get the decommissioned pod. If encounter NotFound error, delete the PVC and set podLastOperation.status to statusDone
Say we set nodesPerRacks from 2 to 1. The operator will run the above steps. If the operator pod crashes after step 2, the decommissioned pod will be deleted (as the statefulset is resized), but podLastOperation is still StatusOngoing (since 3 is not executed yet). After the operator pod restarts, it will go to branch 4.i, and since the last pod is already deleted, there will be NotFound error when trying to get the last pod. The operator simply ends this round of reconcile by returning with the error and is never able to clean up the PVC or serve further user requests.
What did you do?
Set the nodesPerRacks from 2 to 1
What did you expect to see?
The pod and the pvc get deleted.
What did you see instead? Under which circumstances?
The pod is deleted but the pvc is still there. And any further user operation is refused by the operator.
Possible Solution
A potential solution is to directly issue Update after changing CR.podLastOperation.status to StatusFinalizing in step 1. So that even if the operator crashes in the middle of reconcile, it should still be able to resize the statefulset and delete the pvc, and move to StatusDone eventually.
Additional context
We are willing to send a PR to help fix this issue.
The text was updated successfully, but these errors were encountered:
Can you confirm it happens when deletePVC is set to true because otherwise it's expected
Yes we set deletePVC to true and the PVC is supposed to be deleted.
The PVC does not get deleted because the controller crashes at a particular point and cannot fullfill all the reconcile updates. We have read the source code very carefully to draw the conclusion. More concretely, decommissioned pod is deleted and podLastOperation is still StatusOngoing. Although the controller can restart, it cannot make progress to delete the PVC from this inconsistent state.
We are currently trying to send a PR to fix it. A potential approach is to switch the update/delete order to avoid the inconsistent state.
This bug is hard to trigger as it only manifest when crash happens at particular timing. But once triggered, the controller will not be able to recover.
We actually have an open-sourced tool that can reliably reproduce this bug (when deletePVC set to true) which helps us diagnosis the problem. Please let us know if you also want to reliably reproduce the bug and we can help you on that.
Bug Report
We find that scaling down a single dc rack (by reducing
nodesPerRacks
) might end up in a dirty state (the pod is deleted but the pvc is still there) if the operator crashes in the middle of a reconcile and restarts. The accidental dirty state will also prevent the operator from handling any future user request.More concretely, when scaling down the dc rack (statefulset), casskop will do the following:
CR.podLastOperation.status
(previouslyStatusOngoing
) toStatusFinalizing
(but have not issuedUpdate
to k8s yet)podLastOperation
(toStatusFinalizing
)podLastOperation
isStatusOngoing
, try to get the decommissioned pod. If there is an error, the operator directly returns the error. Otherwise, the operator continues its reconcile.podLastOperation
isStatusFinalizing
, try to get the decommissioned pod. If encounterNotFound
error, delete the PVC and setpodLastOperation.status
tostatusDone
Say we set
nodesPerRacks
from 2 to 1. The operator will run the above steps. If the operator pod crashes after step 2, the decommissioned pod will be deleted (as the statefulset is resized), butpodLastOperation
is stillStatusOngoing
(since 3 is not executed yet). After the operator pod restarts, it will go to branch 4.i, and since the last pod is already deleted, there will beNotFound
error when trying to get the last pod. The operator simply ends this round of reconcile by returning with the error and is never able to clean up the PVC or serve further user requests.What did you do?
Set the
nodesPerRacks
from 2 to 1What did you expect to see?
The pod and the pvc get deleted.
What did you see instead? Under which circumstances?
The pod is deleted but the pvc is still there. And any further user operation is refused by the operator.
Environment
casskop version:
f87c8e0 (master branch)
Kubernetes version information:
1.18.9
Cassandra version:
3.11
Possible Solution
A potential solution is to directly issue
Update
after changingCR.podLastOperation.status
toStatusFinalizing
in step 1. So that even if the operator crashes in the middle of reconcile, it should still be able to resize the statefulset and delete the pvc, and move toStatusDone
eventually.Additional context
We are willing to send a PR to help fix this issue.
The text was updated successfully, but these errors were encountered: