Volume Snapshot Data Movement Design #5968

Lyndon-Li · 2023-03-08T04:46:07Z

Add the design for Volume Snapshot Data Movement

codecov-commenter · 2023-03-08T04:57:05Z

Codecov Report

Merging #5968 (d58d187) into main (94fec66) will decrease coverage by 0.19%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #5968      +/-   ##
==========================================
- Coverage   39.94%   39.75%   -0.19%     
==========================================
  Files         254      256       +2     
  Lines       22361    23237     +876     
==========================================
+ Hits         8932     9239     +307     
- Misses      12773    13300     +527     
- Partials      656      698      +42

Impacted Files	Coverage Δ
pkg/backup/backup.go	`52.58% <0.00%> (-26.30%)`	⬇️
pkg/backup/item_collector.go	`54.01% <0.00%> (-5.87%)`	⬇️
...kupitemaction/v2/restartable_backup_item_action.go	`61.64% <0.00%> (-3.58%)`	⬇️
pkg/controller/download_request_controller.go	`68.22% <0.00%> (-2.08%)`	⬇️
...k/backupitemaction/v2/backup_item_action_server.go	`31.29% <0.00%> (-1.25%)`	⬇️
internal/hook/item_hook_handler.go	`87.11% <0.00%> (-0.81%)`	⬇️
pkg/controller/backup_controller.go	`55.15% <0.00%> (-0.73%)`	⬇️
pkg/persistence/object_store.go	`52.63% <0.00%> (-0.56%)`	⬇️
pkg/cmd/server/server.go	`6.14% <0.00%> (-0.49%)`	⬇️
pkg/backup/request.go	`100.00% <0.00%> (ø)`
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

reasonerjt · 2023-03-14T08:53:10Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+For backup, we intend to create an extensive architecture for various snapshot types, snapshot accesses and various data accesses. For example, the snapshot specific operations are isolated in Data Mover Plugin and Exposer. In this way, we only need to change the two modules for variations. Likely, the data access details are isolated into uploaders, so different uploaders could be plugged into the workflow smoothly.  
+
+For restore, we intend to create a generic workflow that could for all backups. This means the restore is backup source independent. Therefore, for example, we can restore a CSI snapshot backup to another cluster with no CSI facilities or a CSI driver the same as the source cluster.  
+We still have the Exposer module for restore and it is to expose the target volume to the data path. Therefore, we still have the flexibility to introduce different ways to expose the target volume.  


In particular, in the diagram, it looks like the data mover controller should be responsible of creating the PV, this has to be clarified, i.e. the data mover provider will handle provisioning the PV and velero will not do it.

Yes, the PV is created by the data mover. This is clarified in the Create Target PV section, I have added one more line to clarify that Velero should not create the PV if data movement restore is involved.

reasonerjt · 2023-03-14T08:53:56Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+
+## Components
+**Velero**: Velero controls the backup/restore workflow, it calls BIA/RIA V2 to backup/restore an object that involves data movement, specifically, a PVC or a PV.  
+**BIA/RIA V2**: BIA/RIA V2 are the protocols between Velero and the data mover plugins. They support asynchronized operations so that Velero backup/restore is not marked as completion until the data movement is done and in the meantime, Velero is free to process other backups during the data movement.  


Not sure if we need to separate the BIA v2 and DMP.

Categorize CSI plugin as a Data Mover Plugin doesn't sound quite right to me.

Here BIA/RIA V2 means the interface/protocol and framework. And Data Mover Plugin means the module that implements the interface.
CSI plugin may not be the only Data Mover Plugin, for example, volume snapshotter could be integrated with data movement in future as another Data Mover Plugin.

reasonerjt · 2023-03-14T08:58:09Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+Below is the restore workflow:  
+![restore-workflow.png](restore-workflow.png)  
+
+## Components


I suggest we separate the components that are only relevant to internal DM, which are not interesting to other DM providers.

Like Node-Agent, Exposer, VGDP, Uploader, they seem to fail within the scope of Velero internal Data Mover.

Yes, we should make clear of the components that are built-in DM specific. I have modified the doc to clarify this.

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

reasonerjt · 2023-03-14T09:14:03Z

I wish to suggest we put a section to introduce the higher level of the interaction between data movement controller and velero, and then in a separate section we may zoom into the details of the internal data mover, such that other data mover providers will focus on the first section and the contract is more clarified.

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

blackpiglet · 2023-04-12T13:13:33Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+
+**Acquire Object Lock**  
+**Release Object Lock**  
+There are multiple instances of Data Uploader Controllers and when a DUCR is created, there should be only one of the instances handle the CR.  


The Data Uploader Controllers should be implemented in the Node-Agent, so the CR should be handled by the Node-Agent pod that shares the same node as the uploading volume, then there is only one candidate, and there is no need to have the lock.

Actually, the Data Uploader Controller starts to process the DUCR once it is created (its phase is New or ""), then the controller creates the backupPVC, so it means the early phases before the backupPV is provisioned need the lock to make a consensus of which controller is responsible to create the backupPVC.

Lyndon-Li · 2023-04-20T11:58:32Z

@shubham-pampattiwar
After a discussion, we think the workflow for the data mover backup deletion is too complicated, since a DataUploadDelete CRD and controller need to be involved.
Therefore, I modified the workflow in the design, please help to review it and let us know for any concerns or suggestions.

@reasonerjt @ywk253100 Please also help to review the same.

shubham-pampattiwar · 2023-04-20T17:02:37Z

@Lyndon-Li Can we not just DeleteItemAction plugin for DataUploadCRs ? so whenever a backup is deleted the CRs also get deleted.

Lyndon-Li · 2023-04-21T01:49:02Z

@shubham-pampattiwar

Can we not just DeleteItemAction plugin for DataUploadCRs ? so whenever a backup is deleted the CRs also get deleted.

Firstly, The aim here is to delete the backup data stored in the backup repo. And only the specific DM knows where the backup data is and how to delete it. So, the question is how do we let the DM know the backup is to be deleted.
Secondly, we cannot let the controller to monitor the deletion event of DataUpload CRs, because kubectl delete DataUpload -n velero could also generate the same event, but we don't want delete the backup data once some user run this command by mistake.

Therefore, Velero needs a private mechanism to notify the DM when it handling a deletebackup request, then the DM does its own work to delete the backup data

shubham-pampattiwar · 2023-04-21T16:54:23Z

@shubham-pampattiwar

Can we not just DeleteItemAction plugin for DataUploadCRs ? so whenever a backup is deleted the CRs also get deleted.

Firstly, The aim here is to delete the backup data stored in the backup repo. And only the specific DM knows where the backup data is and how to delete it. So, the question is how do we let the DM know the backup is to be deleted. Secondly, we cannot let the controller to monitor the deletion event of DataUpload CRs, because kubectl delete DataUpload -n velero could also generate the same event, but we don't want delete the backup data once some user run this command by mistake.

Therefore, Velero needs a private mechanism to notify the DM when it handling a deletebackup request, then the DM does its own work to delete the backup data

@Lyndon-Li Thanks for the elaborate explanation, I think I understand now, got confused with the deletion of in-cluster CRs vs Data in backup repository. The modified delete workflow looks sane to me 👍

Signed-off-by: Lyndon-Li <[email protected]>

reasonerjt

Let's approve it.

If we find minor changes required during the implementation, let's make sure they are also reflected in incremental change to this doc.

reasonerjt · 2023-04-19T07:57:03Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+**Node-Agent**: Node-Agent is an existing Velero module that will be used to host VBDM.  
+**Exposer**: Exposer is to expose the snapshot/target volume as a path/device name/endpoint that are recognizable by Velero generic data path. For different snapshot types/snapshot accesses, the Exposer may be different. This isolation guarantees that when we want to support other snapshot types/snapshot accesses, we only need to replace with a new Exposer and keep other components as is.  
+**Velero Generic Data Path (VGDP)**: VGDP is the collective of modules that is introduced in [Unified Repository design][1]. Velero uses these modules to finish data transmission for various purposes. In includes uploaders and the backup repository.  
+**Uploader**: Uploader is the module in VGDP that reads data from the source and writes to backup repository for backup; while read data from backup repository and write to the restore target for restore. At present, only file system uploader is supported. In future, the block level uploader will be added. For file system uploader, only Kopia uploader will be used, Restic will not be integrated with VBDM.   


Restic will not be integrated with VBDM.

We wanna highlight this, b/c it means if user wanna use data mover kopia is THE uploader for all fs-level backup.

reasonerjt · 2023-04-19T07:58:41Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+
+**Velero**: Velero controls the backup/restore workflow, it calls BIA/RIA V2 to backup/restore an object that involves data movement, specifically, a PVC or a PV.  
+**BIA/RIA V2**: BIA/RIA V2 are the protocols between Velero and the data mover plugins. They support asynchronized operations so that Velero backup/restore is not marked as completion until the data movement is done and in the meantime, Velero is free to process other backups during the data movement.  
+**Data Mover Plugin (DMP)**: DMP implement BIA/RIA V2 and it invokes the corresponding data mover by creating the DataUpload/DataDownload CRs. DMP is also responsible to take snapshot of the source volume, so it is a snapshot type specific module. For CSI snapshot data movement, the CSI plugin could be extended as a DMP, this also means that the CSI plugin will fully implement BIA/RIA V2 and support some more methods like Progress, Cancel, etc.  


This is trivial but theoretically DMP may not have to be responsible to take snapshot.

For example, a developer may create a BIA plugin A to take snapshot X and return it as an additional item.
Then there's BIA plugin B to handle the snapshot X and move it.

This requirement could be met technically, the implementation will be like:

The DMP we are talking about in this design exposes a BIA for a PVC, which takes snapshot of the PVC and the submit a DUCR to launch a DM

Another kind of plugin (we call it DMP or not) could exposes a BIA for a VS (wherever the VS comes from), then this kind of DMP doesn't take snapshot but directly launch a DM

The current design doesn't cover the second case, if required, we can add in an incremental design.

reasonerjt · 2023-04-20T08:28:56Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+Besides ```additionalItem``` (as the 2nd return value), Execute method will return one more resource list called ```itemToUpdate```, which means the items to be updated  and persisted when the async operation completes. For details, visit [general progress monitoring design][2].   
+Specifically, this mechanism will be used to persist DUCR into the persisted backup data, in another words, DUCR will be returned as ```itemToUpdate``` from Execute method. DUCR contains all the information the restore requires, so during restore, DUCR will be extracted from the backup data.  
+Additionally, in the same way, a DMP could add any other items into the persisted backup data.  
+Execute method also returns the ```operationID``` which uniquely identifies the asynchronized operation. This ```operationID``` is generated by plugins. The [general progress monitoring design][2] doesn't restrict the format of the ```operationID```, for Velero CSI plugin, the ```operationID``` is a combination of the backup CR UID and the source PVC (represented by the ```item``` parameter) UID.  


So this means there's re-try for the upload within one backup?

Actually, the progress monitoring design doesn't have a retry mechanism itself, so whether or not a retry happens is decided by the DM. At present, VBDM doesn't have retry mechanism.

shubham-pampattiwar

LGTM ! Thank you @Lyndon-Li

jglick · 2023-07-12T20:14:39Z

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

+
+
+[1]: ../unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md
+[2]: ../general-progress-monitoring.md


Broken link; now at https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/general-progress-monitoring.md (missing implemented/ infix)

Thanks for noticing this.
I tried to add the Implemented infix, but finally realized that we cannot do this. At the release time of 1.12, we will also move this design to the Implemented folder. We will do the same for unified-repo-and-kopia-integration folder (we have missed to do so in the release of 1.11), then all the links will be fixed.

Therefore, let's leave it as is for now, and it will be fixed itself at the release time.

Lyndon-Li force-pushed the velero-data-movement-design branch from d80d1b6 to c45a735 Compare March 8, 2023 04:47

github-actions bot added the has-changelog label Mar 8, 2023

Lyndon-Li force-pushed the velero-data-movement-design branch 2 times, most recently from 0913403 to c909c1d Compare March 8, 2023 04:53

Lyndon-Li force-pushed the velero-data-movement-design branch from c909c1d to 5dfc612 Compare March 8, 2023 05:07

Lyndon-Li self-assigned this Mar 8, 2023

Lyndon-Li force-pushed the velero-data-movement-design branch 7 times, most recently from 35241d3 to 7b44c0c Compare March 10, 2023 03:38

Lyndon-Li marked this pull request as ready for review March 10, 2023 03:51

github-actions bot requested review from blackpiglet and ywk253100 March 10, 2023 03:51

Lyndon-Li requested review from reasonerjt, qiuming-best, shubham-pampattiwar and sseago March 10, 2023 03:51

Lyndon-Li force-pushed the velero-data-movement-design branch 3 times, most recently from f3b503e to 8dc8fa6 Compare March 14, 2023 09:05

reasonerjt reviewed Mar 14, 2023

View reviewed changes

Lyndon-Li force-pushed the velero-data-movement-design branch 2 times, most recently from f6c8b5a to 4e4950a Compare March 14, 2023 10:56

yanji09 requested changes Mar 20, 2023

View reviewed changes

design/volume-snapshot-data-movement/volume-snapshot-data-movement.md Show resolved Hide resolved

Lyndon-Li force-pushed the velero-data-movement-design branch from 4e4950a to 800de5f Compare March 20, 2023 03:51

Lyndon-Li force-pushed the velero-data-movement-design branch 4 times, most recently from 325b710 to 6f827fc Compare March 29, 2023 04:21

Lyndon-Li force-pushed the velero-data-movement-design branch 3 times, most recently from 354a2e9 to 07fef22 Compare April 10, 2023 04:10

This was referenced Apr 12, 2023

Volume Snapshot Data Movement - CRD changes #6112

Closed

[Epic] Volume Snapshot Data Movement #6113

Closed

blackpiglet reviewed Apr 12, 2023

View reviewed changes

Lyndon-Li force-pushed the velero-data-movement-design branch 2 times, most recently from 38b511a to df32c43 Compare April 20, 2023 11:55

Lyndon-Li force-pushed the velero-data-movement-design branch from df32c43 to 8e5a6a4 Compare April 20, 2023 12:00

Lyndon-Li mentioned this pull request Apr 28, 2023

Add data mover CRD under v2alpha1 #6176

Merged

kaovilai mentioned this pull request May 2, 2023

OADP-1668: Volumesnapshot related CR’s namely Volumesnapshot and VolumeSnapshotcontent are not being included by the OADP Version of Velero server in the backup bundle openshift/oadp-operator#974

Closed

Lyndon-Li mentioned this pull request May 8, 2023

Data Mover restore integrate to VolumePopulator #6239

Closed

velero data movement design

dd40f7b

Signed-off-by: Lyndon-Li <[email protected]>

Lyndon-Li force-pushed the velero-data-movement-design branch from 8e5a6a4 to dd40f7b Compare May 16, 2023 10:43

reasonerjt approved these changes May 23, 2023

View reviewed changes

sseago approved these changes May 24, 2023

View reviewed changes

shubham-pampattiwar approved these changes May 24, 2023

View reviewed changes

yanji09 approved these changes May 24, 2023

View reviewed changes

Lyndon-Li merged commit 3ad091d into vmware-tanzu:main May 24, 2023

Lyndon-Li mentioned this pull request Jul 11, 2023

Design doc for data movement layer #4112

Closed

jglick reviewed Jul 12, 2023

View reviewed changes

Lyndon-Li mentioned this pull request Aug 20, 2024

Data Mover - Low level integration with 3rd data movers #8130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume Snapshot Data Movement Design #5968

Volume Snapshot Data Movement Design #5968

Lyndon-Li commented Mar 8, 2023

codecov-commenter commented Mar 8, 2023 •

edited

Loading

reasonerjt Mar 14, 2023

Lyndon-Li Mar 14, 2023

reasonerjt Mar 14, 2023

Lyndon-Li Mar 14, 2023

reasonerjt Mar 14, 2023

Lyndon-Li Mar 14, 2023

reasonerjt commented Mar 14, 2023 •

edited

Loading

blackpiglet Apr 12, 2023

Lyndon-Li Apr 12, 2023

Lyndon-Li commented Apr 20, 2023

shubham-pampattiwar commented Apr 20, 2023

Lyndon-Li commented Apr 21, 2023

shubham-pampattiwar commented Apr 21, 2023

reasonerjt left a comment

reasonerjt Apr 19, 2023

reasonerjt Apr 19, 2023

Lyndon-Li May 24, 2023

reasonerjt Apr 20, 2023

Lyndon-Li May 23, 2023

shubham-pampattiwar left a comment

jglick Jul 12, 2023

Lyndon-Li Jul 13, 2023 •

edited

Loading



		[1]: ../unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md
		[2]: ../general-progress-monitoring.md

Volume Snapshot Data Movement Design #5968

Volume Snapshot Data Movement Design #5968

Conversation

Lyndon-Li commented Mar 8, 2023

codecov-commenter commented Mar 8, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reasonerjt commented Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lyndon-Li commented Apr 20, 2023

shubham-pampattiwar commented Apr 20, 2023

Lyndon-Li commented Apr 21, 2023

shubham-pampattiwar commented Apr 21, 2023

reasonerjt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham-pampattiwar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lyndon-Li Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Mar 8, 2023 •

edited

Loading

reasonerjt commented Mar 14, 2023 •

edited

Loading

Lyndon-Li Jul 13, 2023 •

edited

Loading