-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update the vr condition #691
Conversation
It would be good to have an example in the PR description. Ideally do not abbreviate things like "vr" in commit subjects. |
cc55d78
to
4e4dac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Pull request has been modified.
Type: v1alpha1.ConditionDegraded, | ||
Reason: v1alpha1.Error, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionTrue, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "failed to resync", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define a const/type for all these messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have those defined , which is printed as the reason in the conditions.
These message should be more descriptive in my opinion hence waiting for shyam's response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShyamsundarR please have a look and share your opinion here.
@ShyamsundarR I have updated the pr to include specific errors for failed conditions, do take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the messages are very short and not descriptive. The VR .Status.Message looks like good error text (maybe too verbose sometimes).
Do we assume that users will use .Status.Message for reporting errors? How do we know which condition is related to .Status.Message?
I think we need to include .Status.Message in the relevant .Conditions[N].Message instead of the short and non descriptive messages.
Ramen looks only at the VR conditions, and it will be very easy to use if we can use the condition message in the ramen condition. I don't see how we can use the global .Status.Message.
Type: v1alpha1.ConditionCompleted, | ||
Reason: v1alpha1.Promoted, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionTrue, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "Volume is healthy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message does not add enough info toe understand the meaning of the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a sucessful condition, we dont have any information to be printed, VR is in healthy state. Can you please share some examples of what other information you want to get printed here?
Type: v1alpha1.ConditionDegraded, | ||
Reason: v1alpha1.Healthy, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionFalse, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "Volume is not resyncing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, it is jut the text version of NotResynching or Resynching=false. The text should explain the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same explanation as above, for this case, resync is not trigerred, hence volume is not resyncing, not sure what would be the detailed status for this?
@@ -46,26 +49,30 @@ func setPromotedCondition(conditions *[]metav1.Condition, observedGeneration int | |||
} | |||
|
|||
// sets conditions when volume promotion was failed. | |||
func setFailedPromotionCondition(conditions *[]metav1.Condition, observedGeneration int64) { | |||
func setFailedPromotionCondition(conditions *[]metav1.Condition, observedGeneration int64, err string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err is not a good name for an error message, this is the common name for objects implementing the error interface. Maybe use errorMesssage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes can be, will update
Type: v1alpha1.ConditionDegraded, | ||
Reason: v1alpha1.Error, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionTrue, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "Volume is not resyncing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we repeating the same value multiple times? If we need to repeat the same message, they should be constants.
Type: v1alpha1.ConditionCompleted, | ||
Reason: v1alpha1.FailedToPromote, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionFalse, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "error detected while promotion", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange that we set the standard message for this condition, but custom message for the previous condition. Based on the function name this should affect only the PromotedCondition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no here, even the volume is not healthy as some error has been detected, the reason for this state says that error has been detected, now what we can do here is print the error that has caused the failure, but status.Message has that information, hence I prefered not duplication the information. if you feel it is important, we can add the same error here as well.
Type: v1alpha1.ConditionCompleted, | ||
Reason: v1alpha1.FailedToPromote, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionFalse, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "error detected while promotion", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is not related to promotion. Why do we need to talk about promotion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition is reached while we were trying to promote and hence the error was detected right?
Type: v1alpha1.ConditionDegraded, | ||
Reason: v1alpha1.Error, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionTrue, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "Volume is not resyncing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is capitalised, while other are not. I'm not sure what is the wanted format but all messages must be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh missed this, will update
Type: v1alpha1.ConditionResyncing, | ||
Reason: v1alpha1.NotResyncing, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionFalse, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: "failed to meet prerequisite", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good summary of the error but we really want the reason that will help the admin debug this problem. What is the actual issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will be in status.message
@@ -124,7 +124,7 @@ func (r *VolumeReplicationReconciler) Reconcile(ctx context.Context, req ctrl.Re | |||
err = validatePrefixedParameters(vrcObj.Spec.Parameters) | |||
if err != nil { | |||
logger.Error(err, "failed to validate parameters of volumeReplicationClass", "VRCName", instance.Spec.VolumeReplicationClass) | |||
setFailureCondition(instance) | |||
setFailureCondition(instance, "failed to validate parameters of volumeReplicationClass") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start to look helpful, but which parameter was invalid?
@@ -160,7 +160,7 @@ func (r *VolumeReplicationReconciler) Reconcile(ctx context.Context, req ctrl.Re | |||
pvc, pv, pvErr = r.getPVCDataSource(logger, nameSpacedName) | |||
if pvErr != nil { | |||
logger.Error(pvErr, "failed to get PVC", "PVCName", instance.Spec.DataSource.Name) | |||
setFailureCondition(instance) | |||
setFailureCondition(instance, "failed to get PVC") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failed to find the PVC? "Get" is a technical term, the function used to read an object from the api server, the human term is finding or looking up.
But it is not clear if the issue is missing pvc (ErrNotExist) or an error accessing the api server (maybe temporary error?). When we fix the code to handle both case we can provide a more useful error message.
internal/controller/replication.storage/volumereplication_controller.go
Outdated
Show resolved
Hide resolved
The following comments have been addressed in the above PR:
The VR after this change looks like below:
cc @nirs @Madhu-1 @nixpanic @ShyamsundarR please add your reviews to the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we create useful message now, but it is not clear to me when and why messageFromDriver apply to which condition.
Can you share example VR status in all the error cases? It can be simulated by faking failures, or reproducing failure conditions on a real system (e.g. trying to replicate a cloned pvc with the wrong dr policy).
If you provide an image, I can test this in ramen testing env.
Type: v1alpha1.ConditionCompleted, | ||
Reason: v1alpha1.FailedToPromote, | ||
ObservedGeneration: observedGeneration, | ||
Status: metav1.ConditionFalse, | ||
}) | ||
setStatusCondition(conditions, &metav1.Condition{ | ||
Message: fmt.Sprintf("%s: %s", v1alpha1.MessageFailedPromoted, messageFromDriver), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is little bit confusing - we have:
- message
- messageFromDriver
It seems that message is related to Completed condition, and messageFromDriver is related to DegradtedCondition. But the name do not give any hint about these relations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message is any error message that is being detected in csiaddons, and messagefromdriver is the rpc msgs that comes from ceph csi driver
@@ -112,7 +112,7 @@ func (r *VolumeReplicationReconciler) Reconcile(ctx context.Context, req ctrl.Re | |||
// Get VolumeReplicationClass | |||
vrcObj, err := r.getVolumeReplicationClass(logger, instance.Spec.VolumeReplicationClass) | |||
if err != nil { | |||
setFailureCondition(instance) | |||
setFailureCondition(instance, "failed to get volumeReplication class", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very hard to tell where the message and messageFromDriver will be set looking at this code. Maybe the function should renamed?
sure, let me create an image and share with you |
I install the image provided by @yati1998 using this: It passed ramen e2e job: I'm testing the message for validation issue now. |
I does not work. I deployed a deployment using pvc restored from snapshot, and tried to protect is. This is expected to fail since the dr policy is not enabled for flattening. I get empty messages for all conditions instead of the expected error message.
|
Changes to reflect message in condition to include errors from the RPCs is good, moves it closer to the condition that faced the error. This still does not catch certain errors as @yati1998 pointed out, e.g MirrorEnable returns the -22 and no description, so we miss the context further that the error was actually something like this:
Which is being addressed in ceph/ceph-csi#4941 we may need other corner cases as it may arise to report better failures/errors. Otherwise the move to report known errors and failures in messages with this PR looks good. (not adding a +1 as I am not looking at specific of the change overall) |
Hey @ShyamsundarR this error was updated further, I think you are using older version of cephcsi, due to which it seems just the error code is returned. https://github.com/ceph/ceph-csi/pull/4678/files#diff-d3713dcbb075d877c69230700478ec60c5fe029d742dae0f72499d9f6833617f |
Quite possible, the builds in use were downstream and possibly ones in release-4.16. Overall, the new PR to improve the error message from Ceph-CSI is helpful and relates to this PR, hence cross posting it here. |
The test was wrong, we used the manifests form 0.10.0. Testing again with the manifests from this branch and a new version of the image: |
Works now with fixed ramen change:
The duplicate .Status.message can be eliminating, but maybe someone is depending on this? |
Yes we can't eliminate that. |
since @nirs has also tested the change is statisfied with the current modifications, I would request a final review on this. |
Use a development build of csi-addons adding .Message to the all conditions: csi-addons/kubernetes-csi-addons#691 This is only needed for testing, I'll remove this before merging. Signed-off-by: Nir Soffer <[email protected]>
Based on the messages addded in csi-addons/kubernetes-csi-addons#691. We want to propagate the error messages to the protected pvcs conditions. Signed-off-by: Nir Soffer <[email protected]>
this commit updates the volumereplication conditions to include descriptive message for every operations Signed-off-by: yati1998 <[email protected]>
@nirs @nixpanic @ShyamsundarR can you please give your final review here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Based on the messages addded in csi-addons/kubernetes-csi-addons#691. We want to propagate the error messages to the protected pvcs conditions. Signed-off-by: Nir Soffer <[email protected]>
Based on the messages addded in csi-addons/kubernetes-csi-addons#691. We want to propagate the error messages to the protected pvcs conditions. Signed-off-by: Nir Soffer <[email protected]>
this commit updates the Volume Replication(VR) conditions to
include descriptive message for every
operations
Example of the VR status: