-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile related drpcs when drcluster is deleted #1168
Conversation
- When filtering update events, consider also the update when the new object is marked for deletion. - When filtering drpcs consider all drpcs referencing a deleted drcluster. With this change, when a drcluster is deleted, the drpc controller updates the VRG ManifestWork on the remaining cluster. Previously this happened minutes after a drcluster was deleted. Notes: - We don't get a delete event when the drcluster is deleted, but an update event. I don't know if this is a bug in controller runtime or expected behavior. Signed-off-by: Nir Soffer <[email protected]>
Testing with drenv
VRG diff show that minio-on-dr1 was removed
Interesting events from ramen log
Timeline
Logs and resources |
Testing multiple appsDeploy and enable dr for 3 appsDeploy 3 apps and relocate if need so all run on clsuter dr1.
State before simulating disaster:
Simulate disaster
Failover all apps
The failover will timeout waiting for PeerReady condition - expected State after all apps failed over (Available condition met):
Dump VRG before deleting the drcluster
Delete drcluster
Dump VRG after deleting the drcluster
VRGs diff:
Ramen hub logs: Filtering DRPCs during update event with deleted drcluster:
Updating the VRG manifest work
Disabling DR for all apps
State after disabling dr:
Logs and resources |
Tested Nir's ramen image (quay.io/nirsof/ramen-operator:update-vrg-v1) today with fix described in this PR using downstream ODF 4.15 build 112, OCP 4.14.6, ACM 2.9.1. Log collected for ramen hub pod ~10-15 mins after drcluster perf2 deleted.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one observation on the delete event that may need to be resolved.
// Exhausted all failover activation checks, this update is NOT of interest | ||
return false | ||
// Exhausted all failover activation checks, the only interesting update is deleting a drcluster. | ||
return drClusterIsDeleted(newDRCluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if a delete of a resource is causing an Update event to trigger as well. I would have just returned true from line 269 above, which is where the predicate function called when the watched resource is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Shyam. You are better off returning true on line 269 and allowing all drpcs to reconcile. Deleting a drcluster should NOT be a common use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started this change by returning true in line 269, but unfortunately this does not work.
Deleting creates an update event, when the new object has a deletionTimeStamp. When the object is actually removed from the system, we get a delete event.
In our case this event is not relevant. It will happen when the drpolicy is deleted the the ramen removes the finalizers from the the drcluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so adding the deletionTimeStamp
ends up firing for Updated
. Got it.
So how about if in line 307, you return true
every time the objectNew
(newDRCluster) has a non-zero deletionTimeStamp?
I think avoiding adding the new code (DRPCsUsingDRCluster
and DRPCsUsingDRPolicy
) is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try the change without modifying FilterDRCluster - but if we don't modify it we will trigger a reconcile only in the DRPCs which are failing over to the deleted cluster, which is always no DRPC.
We an simplify by returning all DPPCS but if we have to add code it makes sense to add the right code which is only few more lines.
With this change, when a drcluster is deleted, the drpc controller should update the VRG ManifestWork on the remaining cluster. Previously this happened minutes after a drcluster was deleted.
Testing:
Status: