K8SSAND-1837 ⁃ Succeeded field in CassandraTask status does not get updated for restart task #431

jsanda · 2022-10-15T21:48:29Z

What happened?
I created a CassandraTask to perform a rolling restart on a 3 node DC. The .status.succeeded field of the CassandraTask was never updated. I assume the same problem exists for .status.failed but the restarts were all successful so I can't say for certain.

Did you expect to see something different?
.status.succeeded should get updated as per the API docs. I do see this behavior for other CasssandraTask types.

How to reproduce it (as minimally and precisely as possible):
Create a 3-node CassandraDatacenter in a namespace named test:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: test
  serverType: cassandra
  serverVersion: 4.0.3
  size: 3
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: standard

Once the CassandraDatacenter is ready create a CassandraTask:

apiVersion: control.k8ssandra.io/v1alpha1
kind: CassandraTask
metadata:
  name: restart-dc
spec:
  datacenter:
    name: dc1
    namespace: test
  jobs:
    - name: restart-dc1
      command: restart

Environment

Cass Operator version:

v1.13.0

┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1837
┆priority: Medium

The text was updated successfully, but these errors were encountered:

Miles-Garnsey · 2022-10-31T07:11:20Z

Replicating a discussion that @burmanm and I are having offline in the interests of transparency.

Certain features were not completed when the CassandraTask stuff was closed out (link). The last list item 'CassandraTaskStatus has still fields from Job definition that will not be correct' is probably the relevant one here.

The problem (as near as I can ascertain it) is that the restart task effectively operates almost on a 'fire and forget' basis, it is hard to track completion beyond an OK from the API. The restart itself is handled by the STS controller, which AFAIK doesn't provide a UID for a given restart attempt which we could use to track its outcome.

We do currently appear to track the observed generation (which increments on restart), but this information is not propagated to the job status either. Even if it was, I'm not sure that would be sufficient to keep track of success/failure for this thing, because observed generation actually seems to update at the commencement of an STS restart instead of when every pod is restarted.

Thoughts folks?

burmanm · 2022-10-31T07:30:43Z

I don't think we can see failures in any sane way with StS restart, that would basically mean StS was unable to update the Pod to the newest generation. But would something like this patch work (this only tracks completed):

diff --git a/controllers/control/cassandratask_controller.go b/controllers/control/cassandratask_controller.go
index 249602d..075cb41 100644
--- a/controllers/control/cassandratask_controller.go
+++ b/controllers/control/cassandratask_controller.go
@@ -101,6 +101,9 @@ type TaskConfiguration struct {
        // Functions not targeting the pod
        ValidateFunc   ValidatorFunc
        PreProcessFunc ProcessFunc
+
+       // Status tracking
+       Completed int
 }
 
 func (t *TaskConfiguration) Validate() error {
@@ -285,6 +288,9 @@ JobDefinition:
                        if err != nil {
                                return ctrl.Result{}, err
                        }
+
+                       completed = taskConfig.Completed
+
                        break JobDefinition
                case api.CommandReplaceNode:
                        r.replace(taskConfig)
diff --git a/controllers/control/jobs.go b/controllers/control/jobs.go
index 7d2b702..5b2a9bd 100644
--- a/controllers/control/jobs.go
+++ b/controllers/control/jobs.go
@@ -70,6 +70,7 @@ func (r *CassandraTaskReconciler) restartSts(ctx context.Context, sts []appsv1.S
                }
        }
 
+       restartedPods := 0
        for _, st := range sts {
                if st.Spec.Template.ObjectMeta.Annotations == nil {
                        st.Spec.Template.ObjectMeta.Annotations = make(map[string]string)
@@ -83,10 +84,18 @@ func (r *CassandraTaskReconciler) restartSts(ctx context.Context, sts []appsv1.S
                                status.CurrentReplicas == status.Replicas &&
                                status.ReadyReplicas == status.Replicas &&
                                status.ObservedGeneration == st.GetObjectMeta().GetGeneration() {
+
                                // This one has been updated, move on to the next one
+                               restartedPods += int(status.ReadyReplicas)
+
                                continue
                        }
 
+                       // TODO Verify the revision has updated before doing this..
+                       restartedPods += int(status.UpdatedReplicas)
+
+                       taskConfig.Completed = restartedPods
+
                        // This is still restarting
                        return ctrl.Result{RequeueAfter: jobRunningRequeue}, nil
                }

Somewhat cruel, but I don't think we should spend too much time on this type of status update..

Miles-Garnsey · 2022-11-03T06:07:38Z

I've been reading the k8s source code that deals with STS restarts and also doing some testing. It looks to me as though the following things happen:

During a restart, readyReplicas only decreases while the pod is bouncing. So you'll only ever see this diminish by one at a time (because the restart is rolling, and "readiness" here just seems to mean that the replica is available, not necessarily of the right generation).
There is another field updatedReplicas which might be suitable (EDIT: just saw that you also used that in your suggested patch). When the restart is triggered, it seems to first disappear entirely, and then when it comes back it does track the number of pods that have restarted.

Because of (1) I'm not sure if we can do as you suggest here:

restartedPods += int(status.ReadyReplicas)

Because initially, when the first pod is restarted, you'll end up with restartedPods == 2, when in reality, only the first pod is starting its restart cycle.

Having said that maybe the updatedReplicas field could be used to achieve something similar? There is still a problem there if another restart is triggered after ours (which may make ours appear to fail/take longer), but I don't know if that is avoidable.

Somewhat cruel, but I don't think we should spend too much time on this type of status update..

I tend to agree here. What we're trying to do here doesn't appear to be idiomatic for K8s...

burmanm · 2022-11-03T13:10:50Z

In the first if {}, the restart of the StS has finished, to ReadyReplicas == UpdatedReplicas, but of course consistency and using UpdatedReplicas is better.

The restart task has Parallelism denied, so there can't be two tasks which could restart at the same time without this being marked as completed first, thus even if you assign 2 restart tasks, the counting should function fine.

jsanda added the bug Something isn't working label Oct 15, 2022

sync-by-unito bot changed the title ~~Succeeded field in CassandraTask does not updated for restart task~~ K8SSAND-1837 ⁃ Succeeded field in CassandraTask does not updated for restart task Oct 15, 2022

jsanda changed the title ~~K8SSAND-1837 ⁃ Succeeded field in CassandraTask does not updated for restart task~~ K8SSAND-1837 ⁃ Succeeded field in CassandraTask status does not get updated for restart task Oct 15, 2022

Miles-Garnsey self-assigned this Oct 27, 2022

adejanovski added the zh:Ready label Oct 27, 2022

Miles-Garnsey mentioned this issue Nov 7, 2022

Restart task status.Succeeded fixes #440

Merged

5 tasks

adejanovski added zh:Ready-For-Review and removed zh:Ready labels Nov 7, 2022

adejanovski moved this to To Groom in K8ssandra Nov 8, 2022

adejanovski added this to K8ssandra Nov 8, 2022

adejanovski moved this from To Groom to Ready in K8ssandra Nov 8, 2022

adejanovski moved this from Ready to Ready For Review in K8ssandra Nov 8, 2022

adejanovski moved this from Ready For Review to Review in K8ssandra Nov 10, 2022

burmanm closed this as completed in #440 Nov 14, 2022

Repository owner moved this from Review to Done in K8ssandra Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SSAND-1837 ⁃ Succeeded field in CassandraTask status does not get updated for restart task #431

K8SSAND-1837 ⁃ Succeeded field in CassandraTask status does not get updated for restart task #431

jsanda commented Oct 15, 2022 •

edited by sync-by-unito bot

Loading

Miles-Garnsey commented Oct 31, 2022

burmanm commented Oct 31, 2022

Miles-Garnsey commented Nov 3, 2022 •

edited

Loading

burmanm commented Nov 3, 2022

K8SSAND-1837 ⁃ Succeeded field in CassandraTask status does not get updated for restart task #431

K8SSAND-1837 ⁃ Succeeded field in CassandraTask status does not get updated for restart task #431

Comments

jsanda commented Oct 15, 2022 • edited by sync-by-unito bot Loading

Miles-Garnsey commented Oct 31, 2022

burmanm commented Oct 31, 2022

Miles-Garnsey commented Nov 3, 2022 • edited Loading

burmanm commented Nov 3, 2022

jsanda commented Oct 15, 2022 •

edited by sync-by-unito bot

Loading

Miles-Garnsey commented Nov 3, 2022 •

edited

Loading