-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve log during restore PVs and PVS for VolRep to reflect the core error #1460
base: main
Are you sure you want to change the base?
Conversation
controllers/vrg_volrep.go
Outdated
@@ -1877,6 +1878,8 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error) | |||
// Restore all PVs found in the s3 store. If any failure, the next profile will be retried | |||
pvCount, err = v.restorePVsFromObjectStore(objectStore, s3ProfileName) | |||
if err != nil { | |||
v.log.Error(err, fmt.Sprintf("error fetching PV cluster data from S3 profile %s", s3ProfileName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the failing function name, this should be: "cannot restore PVs from s3 store $q"
Since we retry using the next profile, is this really an error or we can use INFO level message?
(We should really use WARNING log level but the logger we used does not have that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to warning, added new context
controllers/vrg_volrep.go
Outdated
@@ -1889,7 +1892,13 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error) | |||
// Ignoring PVC restore errors helps with the upgrade from ODF-4.12.x to 4.13 | |||
pvcCount, err = v.restorePVCsFromObjectStore(objectStore, s3ProfileName) | |||
|
|||
if err != nil || pvCount != pvcCount { | |||
if err != nil { | |||
v.log.Error(err, fmt.Sprintf("error fetching PVC cluster data from S3 profile %s", s3ProfileName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the message and log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to warning, added new context
controllers/vrg_volrep.go
Outdated
continue | ||
} | ||
|
||
if pvCount != pvcCount { | ||
v.log.Info(fmt.Sprintf("Warning: Mismatch in PV/PVC count %d/%d (%v)", | ||
pvCount, pvcCount, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the same style to do WARNING level messages for pv/pvc restore failures - this is the same use case - we try to use one profile and something failed, so we move to next profile.
Using log.Error() we get a traceback and I'm not sure it is helpful when expected errors and the error message is not enough.
controllers/vrg_volrep.go
Outdated
@@ -1847,6 +1847,7 @@ func (v *VRGInstance) restorePVsAndPVCsForVolRep(result *ctrl.Result) (int, erro | |||
return count, nil | |||
} | |||
|
|||
//nolint:funlen | |||
func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error) { | |||
err := errors.New("s3Profiles empty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This err will contain the last error calling one of:
- v.reconciler.ObjStoreGetter.ObjectStore()
- v.restorePVsFromObjectStore()
- v.restorePVCsFromObjectStore()
If we fail to restore pvs, pvcs or pvcount/pvccount do not match, we return this err at line 1919.
Since we log all errors during the loop, this change does not fix the issue of logging the same error twice.
I think this function need a rewrite - it is too long and it is too hard to get error handling right with some much code.
I think extracting a helper for restoring from single s3 profile will make it much easier to handle errors.
The helper for restoring from single s3 store can return error early:
func (v *VRGInstance) restorePVsAndPVCsFromS3Profile(profileName string) error {
get object store...
if failed:
return err
restore pvs...
if failed:
return err
restore pvcs...
if failed:
return err
check pvs/pvcs counts...
if failed:
return err
return nil
}
No logs are needed in this function - it just return an error. The function for restoring from all s3 stores can call it twice, logging error for each call.
If we failed to restore from both profiles, we need to return a new error about failing to restore from both profiles, since we already logged the error for every s3 profile.
controllers/vrg_volrep.go
Outdated
@@ -1913,19 +1922,13 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error) | |||
func (v *VRGInstance) restorePVsFromObjectStore(objectStore ObjectStorer, s3ProfileName string) (int, error) { | |||
pvList, err := downloadPVs(objectStore, v.s3KeyPrefix()) | |||
if err != nil { | |||
v.log.Error(err, fmt.Sprintf("error fetching PV cluster data from S3 profile %s", s3ProfileName)) | |||
|
|||
return 0, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add context to the error?
return 0, fmt.Errorf("error restoring PVs from profile %q: %w", s3ProfileName, err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirs I am adding context in the caller function. It is now the same in both restorePVsFromObjectStore and restorePVCsFromObjectStore - these functions are returning core error.
controllers/vrg_volrep.go
Outdated
v.log.Error(err, fmt.Sprintf("Resolve PV conflict in the S3 store %s to deploy the application", s3ProfileName)) | ||
|
||
return 0, fmt.Errorf("%s: %w", errMsg, err) | ||
return 0, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add context?
return 0, fmt.Errorf("error restoring PVs from profile %q: %w", s3ProfileName, err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added context in caller function
controllers/vrg_volrep.go
Outdated
@@ -1934,8 +1937,6 @@ func (v *VRGInstance) restorePVsFromObjectStore(objectStore ObjectStorer, s3Prof | |||
func (v *VRGInstance) restorePVCsFromObjectStore(objectStore ObjectStorer, s3ProfileName string) (int, error) { | |||
pvcList, err := downloadPVCs(objectStore, v.s3KeyPrefix()) | |||
if err != nil { | |||
v.log.Error(err, fmt.Sprintf("error fetching PVC cluster data from S3 profile %s", s3ProfileName)) | |||
|
|||
return 0, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, context for the error may help.
Maybe we get good enough errors - the best way to tell is to simulate failure and look at the logs with the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding context in the caller function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share logs showing real failures with this change?
We can simulate issues by deleting pvs/pvcs from s3 store (e.g using mc) or making the s3 storage inaccessible (e.g. scale down minio deployment).
controllers/vrg_volrep.go
Outdated
@@ -1877,6 +1878,8 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error) | |||
// Restore all PVs found in the s3 store. If any failure, the next profile will be retried | |||
pvCount, err = v.restorePVsFromObjectStore(objectStore, s3ProfileName) | |||
if err != nil { | |||
v.log.Info(fmt.Sprintf("Warning: cannot restore PVs from s3 profile %s (%v)", s3ProfileName, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! new logs are consistent and provide good context for the actual error.
@nirs the real problem is not in PV/PVC restore, but in underline failure (for example, Kube objects group recover error), but inside function restorePVsAndPVCsFromS3 we were suppressed the real error and printed general error. My fix is adding real error message to be propagated to the top |
… error Signed-off-by: Elena Gershkovich <[email protected]>
During restore PVs and PVCs for VolRep we are not reflecting the real errors, but printing more general ones. This PR is fixing the problem, now we are able to see, for example, if failure is due to error in restoring k8s resources or due to error in restoring PV/PVCs.
Fixes: #1452