You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We find that reading stale pod information from the apiserver will make the controller accidentally delete the PVC used by the current pod. More concretely, if we scale down and scale up a CassandraDataCenter, the controller (after a restart) may read a stale view of the pods and believe one of the cassandra pod is going to be deleted (due to the previous scale down). The controller will try to delete the PVC used by the "to-be-deleted pod", while the PVC is actually used by the current Cassandra pod. Currently, the controller lists the PVC using the datacenter, datacenterUID, and cluster. All three fields remain unchanged after scaling down/up, so the controller is not able to differentiate between the PVC used by the "to-be-deleted pod" and the current pod. One potential solution is to use the pod UID to list the PVC, so that when seeing stale pod information the controller will never try to delete the PVC held by the current pod.
This issue is a little bit similar to #402, but the corresponding solution #403 (list PVC using CassandraDataCenter UID) can only help to differentiate between CassandraDataCenters that share the same name, which cannot help when reading stale pod information. As mentioned above, using pod UID to differentiate between pods sharing the same name when listing the PVC can help here.
To Reproduce
The issue happened in a HA k8s cluster:
Create a CassandraDataCenter cdc with nodes=2 and deletePVCs=true. There will be two cassandra pods ca1 and ca2 now. The controller is talking to apiserver1 now.
Scale cdc down (by setting nodes=1). ca2 and its PVC will be deleted. Meanwhile, apiserver2 gets partitioned so its watch cache stops at the moment that ca2 is tagged with a deletion timestamp.
Scale cdc up (by setting nodes=2). Now a new ca2 and its PVC is back. Note that the new ca2 shares the same name as the previously deleted one, so as its PVC.
After experiencing a crash, the restarted controller talks to the stale apiserver2. From apiserver2's watch cache, the controller finds that ca2 is tagged with a deletion timestamp. The controller cannot differentiate between the previous PVC (which is already deleted) used by the old ca2 and the existing one. As a result, the controller will delete ca2's PVC, which is unexpected behavior.
Expected behavior
The controller should not delete the PVC if the pod is not going to be deleted. This can be avoided by using pod UID to list the PVC as mentioned before. Each pod always has a unique UID even sharing the same name with others, so the controller will not mistakenly delete PVC which belongs to another pod.
Additional context
We are willing to file a patch for this issue, similar to what we did in #403.
The text was updated successfully, but these errors were encountered:
Hi @srteam2020 , I was wondering if you do not want to say hi to us (we would love to say hi to you!) in a more private manner. Do not hesitate to reach me at stefan dot miklosovic at instaclustr dot com with anything (but not only) Cassandra related, we like to talk to people, we dont try to push anything on you, dont worry :)
srteam2020
added a commit
to srteam2020/cassandra-operator
that referenced
this issue
Mar 24, 2021
Describe the bug
We find that reading stale pod information from the apiserver will make the controller accidentally delete the PVC used by the current pod. More concretely, if we scale down and scale up a CassandraDataCenter, the controller (after a restart) may read a stale view of the pods and believe one of the cassandra pod is going to be deleted (due to the previous scale down). The controller will try to delete the PVC used by the "to-be-deleted pod", while the PVC is actually used by the current Cassandra pod. Currently, the controller lists the PVC using the
datacenter
,datacenterUID
, andcluster
. All three fields remain unchanged after scaling down/up, so the controller is not able to differentiate between the PVC used by the "to-be-deleted pod" and the current pod. One potential solution is to use the pod UID to list the PVC, so that when seeing stale pod information the controller will never try to delete the PVC held by the current pod.This issue is a little bit similar to #402, but the corresponding solution #403 (list PVC using CassandraDataCenter UID) can only help to differentiate between CassandraDataCenters that share the same name, which cannot help when reading stale pod information. As mentioned above, using pod UID to differentiate between pods sharing the same name when listing the PVC can help here.
To Reproduce
The issue happened in a HA k8s cluster:
cdc
withnodes=2
anddeletePVCs=true
. There will be two cassandra podsca1
andca2
now. The controller is talking to apiserver1 now.cdc
down (by settingnodes=1
).ca2
and its PVC will be deleted. Meanwhile, apiserver2 gets partitioned so its watch cache stops at the moment thatca2
is tagged with a deletion timestamp.cdc
up (by settingnodes=2
). Now a newca2
and its PVC is back. Note that the newca2
shares the same name as the previously deleted one, so as its PVC.ca2
is tagged with a deletion timestamp. The controller cannot differentiate between the previous PVC (which is already deleted) used by the oldca2
and the existing one. As a result, the controller will deleteca2
's PVC, which is unexpected behavior.Expected behavior
The controller should not delete the PVC if the pod is not going to be deleted. This can be avoided by using pod UID to list the PVC as mentioned before. Each pod always has a unique UID even sharing the same name with others, so the controller will not mistakenly delete PVC which belongs to another pod.
Additional context
We are willing to file a patch for this issue, similar to what we did in #403.
The text was updated successfully, but these errors were encountered: