Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 6913: Velero Built-in Datamover: Backup stucks in phase WaitingForPluginOperations when Node Agent pod gets restarted #6914

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

shubham-pampattiwar
Copy link
Collaborator

Thank you for contributing to Velero!

Please add a summary of your change

Adds changes proposed here: #6913 (comment)

Does your change fix a particular issue?

Fixes #6913

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

sseago
sseago previously approved these changes Oct 3, 2023
@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Merging #6914 (ee271b7) into main (ad114f8) will increase coverage by 0.25%.
Report is 34 commits behind head on main.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##             main    #6914      +/-   ##
==========================================
+ Coverage   60.78%   61.04%   +0.25%     
==========================================
  Files         250      252       +2     
  Lines       26629    26855     +226     
==========================================
+ Hits        16187    16394     +207     
- Misses       9293     9306      +13     
- Partials     1149     1155       +6     
Files Coverage Δ
pkg/controller/data_download_controller.go 75.27% <100.00%> (ø)
pkg/controller/data_upload_controller.go 68.38% <66.66%> (-0.07%) ⬇️

... and 20 files with indirect coverage changes

Copy link
Member

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to future data movers to keep Canceling phase

pkg/apis/velero/v2alpha1/data_upload_types.go Outdated Show resolved Hide resolved
@sseago
Copy link
Collaborator

sseago commented Oct 4, 2023

@kaovilai The DU/DD controllers don't do anything with this phase, so I'm not sure what value there is in defining it as valid.

sseago
sseago previously approved these changes Oct 10, 2023
original := du.DeepCopy()
du.Status.Phase = velerov2alpha1api.DataUploadPhaseCanceling
Copy link
Contributor

@Lyndon-Li Lyndon-Li Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases:

  1. If the dataPath is still there (fsBackup is not nil), we cannot update the DUCR to cancelled directly. Instead, we need to notify the dataPath (by calling fsBackup.Cancel()), wait for it to cancel the data transfer and then it will call back OnDataUploadCancelled
  2. If the dataPath is not there, we need to cancel the DUCR in the main flow

Both of the cases must be supported. However:

  • For the existing code, case 2 is not supported
  • For the changes in this PR, case 1 will not be supported

So please reconstruct the code to support the both case

original := du.DeepCopy()
du.Status.Phase = velerov2alpha1api.DataUploadPhaseCanceling
du.Status.Phase = velerov2alpha1api.DataUploadPhaseCanceled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case 2, instead of patching the DUCR directly here, let's call OnDataUploadCancelled instead, it does something more:

  • Update the completionTimeStamp
  • Cleanup intermediate object

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham-pampattiwar I think the updated version still doesn't handle case 2.
When there is no fsbackup, then we set Canceled directly, but when there is, we need to cancel the fsbackup rather than just setting the phase.

Copy link
Contributor

@Lyndon-Li Lyndon-Li Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham-pampattiwar
For 1, we need to call fsBackup.Cancel() and NOT set du.Status.Phase (the same as the original code).
For 2, we just need to call OnDataUploadCancelled

@Lyndon-Li
Copy link
Contributor

Please also rebase the code, the codespell error will disappear.

@Lyndon-Li
Copy link
Contributor

@shubham-pampattiwar As discussed, please also help to take care of the DataDownload part.

@shubham-pampattiwar
Copy link
Collaborator Author

@Lyndon-Li @sseago Added fix for data download controller too, PTAL, Thanks !

@shubham-pampattiwar
Copy link
Collaborator Author

@Lyndon-Li Updated the PR, Please take another look.

return ctrl.Result{}, nil
}

log.Info("Data download is being canceled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to line 290, to be similar to Data upload case?

original := dd.DeepCopy()
dd.Status.Phase = velerov2alpha1api.DataDownloadPhaseCanceling
dd.Status.Phase = velerov2alpha1api.DataDownloadPhaseCanceled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before fsRestore cancels itself, we should not set DD to DataDownloadPhaseCanceled. We need to wait fsRestore to cancel itself and then it calls OnDataDownloadCancelled.

So just rollback to the original code for this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, Updated the PR.

@@ -289,18 +289,17 @@ func (r *DataDownloadReconciler) Reconcile(ctx context.Context, req ctrl.Request
if dd.Spec.Cancel {
fsRestore := r.dataPathMgr.GetAsyncBR(dd.Name)
if fsRestore == nil {
r.OnDataDownloadCancelled(ctx, dd.GetNamespace(), dd.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a log here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-positioned the earlier log

original := du.DeepCopy()
du.Status.Phase = velerov2alpha1api.DataUploadPhaseCanceling
du.Status.Phase = velerov2alpha1api.DataUploadPhaseCanceled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as DD, rollback to the original code and wait fsBackup to cancel itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Shubham Pampattiwar <[email protected]>

add changelog file

Signed-off-by: Shubham Pampattiwar <[email protected]>

keep canceling phase const

Signed-off-by: Shubham Pampattiwar <[email protected]>

fix data download as well

Signed-off-by: Shubham Pampattiwar <[email protected]>

address PR feedback

Signed-off-by: Shubham Pampattiwar <[email protected]>

minor fixes

Signed-off-by: Shubham Pampattiwar <[email protected]>
@sseago sseago merged commit 1e0fc77 into vmware-tanzu:main Oct 26, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants