ROX-20872: fix fleetshard reconciler race condition #1629

ludydoo · 2024-01-31T09:48:37Z

Description

Fixing of the fleetshard cache bug.

How to reproduce:

Reconcile central v1
Reconcile centra v2
Deployment is unhealthy (restarting)
Reconcile central v1
Reconcile central v2 (ignored/skipped)

The reason that the reconcile loop at step 5 is ignored is that the hash is not stored at step 4, even if the central CR was updated. At step 5, the reconciler will be skipped because the hash (v2) matches the last stored hash (step 2).

The effect is that v1 is deployed, but the reconciler thinks it reconciled v2.

The fix is to unset the hash if the reconciler exited early.

As an added precaution, trigger the reconcile loop if central was reconciled more than 15 minutes ago.

SimonBaeumer

Nice job on the testing site and debugging 🚀

SimonBaeumer · 2024-02-01T13:54:25Z

fleetshard/pkg/central/reconciler/reconciler.go

+			r.lastCentralHash = centralHash
+			r.lastCentralHashTime = time.Now()
+		} else {
+			r.lastCentralHash = [16]byte{}


When the Central hash should not be updated, delete it? I would have assumed it just doesn't set the last central hash instead of removing it.

That was the source of the bug, because when the deployment is not ready, the hash is not updated, even though the central CR has been updated.

openshift-ci · 2024-02-01T13:56:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ludydoo, SimonBaeumer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [SimonBaeumer,ludydoo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-02-05T06:29:37Z

New changes are detected. LGTM label has been removed.

ROX-20872 fix fleetshard reconciler race condition

bed0af1

ludydoo requested review from SimonBaeumer and johannes94 January 31, 2024 09:48

ludydoo temporarily deployed to development January 31, 2024 09:48 — with GitHub Actions Inactive

openshift-ci bot added the approved label Jan 31, 2024

ROX-20872 fix fleetshard reconciler race condition

d285803

ludydoo temporarily deployed to development January 31, 2024 09:59 — with GitHub Actions Inactive

SimonBaeumer approved these changes Feb 1, 2024

View reviewed changes

openshift-ci bot assigned SimonBaeumer Feb 1, 2024

openshift-ci bot added the lgtm label Feb 1, 2024

ROX-20872 fix fleetshard reconciler race condition

f9d51d5

openshift-ci bot removed the lgtm label Feb 5, 2024

ludydoo temporarily deployed to development February 5, 2024 06:29 — with GitHub Actions Inactive

ludydoo merged commit c48e249 into main Feb 6, 2024
9 checks passed

ludydoo deleted the ROX-20872-fleetshard-race-condition branch February 6, 2024 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-20872: fix fleetshard reconciler race condition #1629

ROX-20872: fix fleetshard reconciler race condition #1629

ludydoo commented Jan 31, 2024 •

edited

Loading

SimonBaeumer left a comment

SimonBaeumer Feb 1, 2024

ludydoo Feb 2, 2024

openshift-ci bot commented Feb 1, 2024

openshift-ci bot commented Feb 5, 2024

ROX-20872: fix fleetshard reconciler race condition #1629

ROX-20872: fix fleetshard reconciler race condition #1629

Conversation

ludydoo commented Jan 31, 2024 • edited Loading

Description

SimonBaeumer left a comment

Choose a reason for hiding this comment

SimonBaeumer Feb 1, 2024

Choose a reason for hiding this comment

ludydoo Feb 2, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Feb 1, 2024

openshift-ci bot commented Feb 5, 2024

ludydoo commented Jan 31, 2024 •

edited

Loading