Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data mover fail early #7052

Merged
merged 1 commit into from
Dec 5, 2023

Conversation

qiuming-best
Copy link
Contributor

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)
#6562

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

Copy link

codecov bot commented Nov 2, 2023

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (6ac7ff1) 61.70% compared to head (03dff10) 61.77%.
Report is 3 commits behind head on main.

Files Patch % Lines
pkg/controller/data_download_controller.go 44.82% 15 Missing and 1 partial ⚠️
pkg/controller/data_upload_controller.go 48.38% 15 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7052      +/-   ##
==========================================
+ Coverage   61.70%   61.77%   +0.07%     
==========================================
  Files         258      259       +1     
  Lines       27740    27903     +163     
==========================================
+ Hits        17117    17238     +121     
- Misses       9420     9460      +40     
- Partials     1203     1205       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/controller/data_download_controller.go Outdated Show resolved Hide resolved
pkg/controller/data_download_controller.go Outdated Show resolved Hide resolved
pkg/controller/data_upload_controller.go Outdated Show resolved Hide resolved
pkg/controller/data_upload_controller.go Outdated Show resolved Hide resolved
} else if pod.Status.Phase == corev1api.PodPending {
// Check the conditions for Pending reason to see if it's unschedulable
for _, condition := range pod.Status.Conditions {
if condition.Reason == corev1api.PodReasonUnschedulable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// PodReasonUnschedulable reason in PodScheduled PodCondition means that the scheduler
// can't schedule the pod right now, for example due to insufficient resources in the cluster.

I don't think we can always fail earlier for this case, if the resources fill back later, the pod should have started successfully, but we failed it unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the unschedulable check first for a later more specific scenario we could add it back

pkg/util/kube/pod.go Outdated Show resolved Hide resolved
@qiuming-best qiuming-best force-pushed the data-mover-fail-early branch from 481dc65 to 6d34ace Compare November 3, 2023 03:03
pkg/util/kube/pod.go Outdated Show resolved Hide resolved
@qiuming-best
Copy link
Contributor Author

In order to prevent potential misjudgments under uncertain conditions, function named IsPodUnrecoverable has been left here for now. In the future, we can add more specific handling for cases that are deemed more certain

pkg/util/kube/pod.go Outdated Show resolved Hide resolved
pkg/util/kube/pod.go Outdated Show resolved Hide resolved
err := UpdateDataUploadWithRetry(context.Background(), r.client, types.NamespacedName{Namespace: du.Namespace, Name: du.Name}, r.logger.WithField("dataupload", du.Name),
func(dataUpload *velerov2alpha1api.DataUpload) {
dataUpload.Spec.Cancel = true
dataUpload.Status.Message = fmt.Sprintf("dataupload mark as cancel to failed early for exposing pod %s/%s is in abnormal status", pod.Namespace, pod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to append the reason why the pod is unrecoverable into dataUpload.Status.Message, which is helpful for troubleshooting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

err := UpdateDataDownloadWithRetry(context.Background(), r.client, types.NamespacedName{Namespace: dd.Namespace, Name: dd.Name}, r.logger.WithField("datadownlad", dd.Name),
func(dataDownload *velerov2alpha1api.DataDownload) {
dataDownload.Spec.Cancel = true
dataDownload.Status.Message = fmt.Sprintf("datadownload mark as cancel to failed early for exposing pod %s/%s is in abnormal status", pod.Namespace, pod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to append the reason why the pod is unrecoverable into dataDownload.Status.Message, which is helpful for troubleshooting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pkg/controller/data_download_controller.go Show resolved Hide resolved
pkg/controller/data_download_controller.go Outdated Show resolved Hide resolved
}
log.Debug("Exposed pod is in abnormal status, and datadownload is marked as cancel")
} else {
log.Debug("Waiting for exposed pod running...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to remove this log, this doesn't look necessary but will be printed multiple times. We should keep the quality of the logs even for debug logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pkg/controller/data_upload_controller.go Outdated Show resolved Hide resolved
pkg/controller/data_upload_controller.go Outdated Show resolved Hide resolved
@reasonerjt reasonerjt added this to the v1.13 milestone Dec 4, 2023
@reasonerjt reasonerjt removed their request for review December 5, 2023 02:58
@qiuming-best qiuming-best merged commit 2fa785a into vmware-tanzu:main Dec 5, 2023
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants