Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for Handling backup of volumes by resources filters #5773

Merged
merged 4 commits into from
Mar 10, 2023

Conversation

qiuming-best
Copy link
Contributor

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)
#5035

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@github-actions github-actions bot added the Area/Design Design Documents label Jan 17, 2023
@qiuming-best qiuming-best changed the title Design for Handle backup of volumes by resources filters Design for Handling backup of volumes by resources filters Jan 17, 2023
@qiuming-best qiuming-best marked this pull request as draft January 17, 2023 07:18
@codecov-commenter
Copy link

codecov-commenter commented Jan 17, 2023

Codecov Report

Merging #5773 (516be60) into main (598333d) will increase coverage by 0.23%.
The diff coverage is n/a.

❗ Current head 516be60 differs from pull request most recent head 7759424. Consider uploading reports for the commit 7759424 to get more accurate results

@@            Coverage Diff             @@
##             main    #5773      +/-   ##
==========================================
+ Coverage   40.67%   40.90%   +0.23%     
==========================================
  Files         243      248       +5     
  Lines       21035    21554     +519     
==========================================
+ Hits         8555     8817     +262     
- Misses      11857    12100     +243     
- Partials      623      637      +14     
Impacted Files Coverage Δ
pkg/util/kube/predicate.go 49.09% <0.00%> (-16.77%) ⬇️
pkg/restore/change_pvc_node_selector.go 62.50% <0.00%> (-4.97%) ⬇️
pkg/persistence/object_store.go 53.09% <0.00%> (-3.10%) ⬇️
pkg/podvolume/restorer.go 9.09% <0.00%> (-2.48%) ⬇️
internal/storage/storagelocation.go 78.78% <0.00%> (-1.77%) ⬇️
pkg/install/resources.go 51.13% <0.00%> (-1.19%) ⬇️
pkg/plugin/framework/action_resolver.go 5.00% <0.00%> (-0.76%) ⬇️
pkg/controller/pod_volume_backup_controller.go 38.74% <0.00%> (-0.41%) ⬇️
pkg/controller/pod_volume_restore_controller.go 25.00% <0.00%> (-0.40%) ⬇️
pkg/controller/backup_controller.go 55.64% <0.00%> (-0.20%) ⬇️
... and 28 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@qiuming-best qiuming-best force-pushed the volumes-filter-design branch 2 times, most recently from 063b507 to a388d6c Compare January 17, 2023 07:35
@qiuming-best qiuming-best force-pushed the volumes-filter-design branch 2 times, most recently from 943f15c to 7f44611 Compare January 17, 2023 13:27
@qiuming-best
Copy link
Contributor Author

hi @pradeepkchaturvedi, Here is one draft design pr for handling backup volume by resources filters. If you have any opinion, please comment on it.

@qiuming-best qiuming-best force-pushed the volumes-filter-design branch from 7f44611 to 7833ea5 Compare January 18, 2023 03:29
design/handle-backup-of-volumes-by-resources-filters.md Outdated Show resolved Hide resolved
- option to skip folders

### B. Accurately select target volume to backup
Some volumes are only used for logging while others volumes are for DBs, and only need to backup DBs data. users need `one specific resource seletor` to accurately select target volumes to backup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we support that if these two PVs have the same storage class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those PVs has the same attributes and are hard to distinguish, we could use GVRN ways to accurately select PV to backup or skip. Which would be a group of combinations of groupResource + namespacedNames

design/handle-backup-of-volumes-by-resources-filters.md Outdated Show resolved Hide resolved
design/handle-backup-of-volumes-by-resources-filters.md Outdated Show resolved Hide resolved
@qiuming-best qiuming-best force-pushed the volumes-filter-design branch 9 times, most recently from dfe780e to 641b48e Compare February 15, 2023 07:43
#### Versioning
Here we introduced the version field in the YAML config to contain break changes in case of some compatibility problems:
```yaml
version: v1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason to choose to ConfigMap over a new CRD for resource policy? CRD would allow API versioning of the policy resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more details to the design now.

the reason we choose Configmap is:

CRD is more like one kind of resource with status, Kubernates API Server handles the lifecycle of a CR and handles it in different statuses. Compared to CRD, Configmap is more focused to store data. And we only want store policies, so Configmap is more suitable.

CRD would allow API versioning also needs our Velero codes to explain it, so it may be no different to use Configmap, we don't change the version of Configmap, but maintain the version in data part of Configmap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did think about adding CR, but favor CM over CR for 2 reasons:

  1. We don't want to introduce the full set of subcommands immediately to manage the lifecycle of these CRs, we may make it a CR in the future if this approach is verified.
  2. There's also debate that, this piece of data is purely for reference, we don't want to introduce a controller to reconcile the status if we make it a CR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status subresource is optional. Policy engines like OPA and Kyverno use CRD to represent policies. With CRD, at least users are familiar with the corresponding K8s versioning scheme.

Comment on lines +98 to +118
- conditions:
csi:
driver: aws.efs.csi.driver

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there will be a catch-all condition that we can use at the end? For example if user wants to skip all PVs except storage class abc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether should we have a catch-all condition is depend on user configuration. If users configured it, velero would support it.

The current situation would be the ignored volumes will not be backed up.
suppose we have volumes A, B, C, and D, through all kinds of resources filters, A and B are defined to include, D is excluded, Velero will backup volumes A and B, and D won't be backed up. the ignored C will not be backed up by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good enhancement, thanks @rnarenpujari

cc @qiuming-best

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current situation would be the ignored volumes will not be backed up.
suppose we have volumes A, B, C, and D, through all kinds of resources filters, A and B are defined to include, D is excluded, Velero will backup volumes A and B, and D won't be backed up. the ignored C will not be backed up by default.

OK skip was not a good example. If we want to snapshot storage class abc but use restic for everything else, then a catch-all/wildcard condition at the end would be nice to have I think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume in that case, you just defined 2 conditions; one for storage class abc with volume-snapshot, and the wildcard condition with restic.

@qiuming-best how would one represent a policy with wildcard conditions?

design/handle-backup-of-volumes-by-resources-filters.md Outdated Show resolved Hide resolved
@ihcsim
Copy link

ihcsim commented Feb 15, 2023

@reasonerjt @qiuming-best I am not convinced that we need to special-case the filtering of persistent volumes. The introduction of a non-K8s API and corresponding versioning scheme to represent volume policies feels a bit much. Feels like all the conditions described in the proposal can be represented using the new include/exclude resource filters in #5838, plus the field selector proposal at #5842. With the exception of the proposed non-skip action types, the rest of the volume policies are really just about filtering resources, right? Why not try to integrate with the new filtering mechanism in #5838?

Can you elaborate a bit more on the file-system-backup and volume-snapshot action type and their parameters? Feels like this is something that can be expressed using the existing CRD, independent of the filtering mechanism.

The field selector mechanism proposed in #5842 uses the fields.Fields to represent the filter attributes. It is possible to provide our implementation of fields.Selector to satisfy the conditions example in this proposal. Then the Backup resource can look something like:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: my-backup
  namespace: velero
spec:
  includedClusterScopeResources:
  - groupResource:
    group: "v1"
    resource: "persistentvolumes"
    selectors:
      capacity: "0,100Gi"
      csi:
        driver: aws.ebs.csi.driver
        fsType: ext4
      storageClass:
      - gp2
      - ebs-sc
  excludedClusterScopeResources:
  - groupResource:
      group: "v1"
      resource: "persistentvolumes"
      selectors:
        csi:
          driver: aws.efs.csi.driver
  - groupResource:
      group: "v1"
      resource: "persistentvolumes"
      selectors:
        nfs: {}

Building on top of #5838, I believe this approach can significantly reduce the implementation and maintenance scope, where we just need to focus on the handling and conversion of the field selectors. (K8s only supports a handful of field selectors on the server-side.)

Even if we think the proposed volume policy is necessary, I don't think #5842 contradicts this proposal. It addresses issues like #5152 and #5118 which this proposal doesn't. Based on our discussion on the community call, I would like to get a sense of how this policy can be extended to include other cluster-scoped resources like CRDs, APIService, ClusterRole, filtering by their names. I suspect we might end up with supporting different sets of conditions for different resources (unlike field selectors where K8s provide a sensible default).

@qiuming-best qiuming-best force-pushed the volumes-filter-design branch 5 times, most recently from cfcf707 to bcb501b Compare February 16, 2023 08:55
@reasonerjt
Copy link
Contributor

@ihcsim
Thanks for the comment.

The reason there's the action and parameters fields is that we wanna make sure there're flexibilities so that the user can choose to do more actions against them in addition to include/exclude. For example in one backup, for volumes provisioned via storageClass A, with certain capacities, the user may wanna choose to use file-system-back approach (restic/kopia) to backup the data, and for some volumes that meet certain criteria, the user wanna choose to take snapshots, and some certain volumes will be skipped, etc.

The fieldSelector in #5842 may be sufficient for filtering resources based on names, but as for PV there are quite a few fields, and nested structures in the spec. In our internal discussion, we found matching the values of the fields does not satisfy the use case, for example, capacity is a range in the policy data, and to apply the policy against a PV we do not simply check if the value in the selector is the same as in the spec, therefore we introduce the condition which does not have to be a 1:1 mapping to each field in the spec.

I agree the policy proposed in the PR does not contradict #5842. I was thinking we may extend the policy to add resourcePolicies which support more general conditions like type: "customresourcedefinitions", name: "name1, name2". I slightly prefer not to add too many fields into the spec of backup CR, but I'm open for discussion in this part.

// VolumePolicy defined policy to conditions to match Volumes and related action to handle matched Volumes
type VolumePolicy struct {
// Conditions defined list of conditions to match Volumes
Conditions map[string]interface{} `yaml:"conditions"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our internal discussion, we found matching the values of the fields does not satisfy the use case, for example, capacity is a range in the policy data, and to apply the policy against a PV we do not simply check if the value in the selector is the same as in the spec, therefore we introduce the condition which does not have to be a 1:1 mapping to each field in the spec.

@reasonerjt As proposed, the field selector approach can be extended (client-side) to work like this. AIUI, there aren't ways to do server-side filtering for fields like capacity, right? We will have to retrieve all the PVs and then apply the filters on the client-side. If that's true, the underlying implementation will be the same regardless of whether it's fields.Fields, field.Selector, or Conditions in the API. (Fields like name, namespace and status will work without additional implementation because K8s know how to filter by those fields.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we should introduce a 1:1 mapping to the PV, or put this chunk of data into backup CR?

IMO, 1:1 mapping may be sufficient but a more flexible condition map can be easier to create and maintain, I don't see a very strong reason we must keep this 1:1 mapping to the fileds of PV resource.

Comment on lines +98 to +118
- conditions:
csi:
driver: aws.efs.csi.driver
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume in that case, you just defined 2 conditions; one for storage class abc with volume-snapshot, and the wildcard condition with restic.

@qiuming-best how would one represent a policy with wildcard conditions?

#### Versioning
Here we introduced the version field in the YAML config to contain break changes in case of some compatibility problems:
```yaml
version: v1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status subresource is optional. Policy engines like OPA and Kyverno use CRD to represent policies. With CRD, at least users are familiar with the corresponding K8s versioning scheme.


Volume policies are a list and if the target volumes match the first policy, the latter will be ignored, which would reduce the complexity of matching volumes especially when there are multiple complex volumes policies.

#### Action
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason there's the action and parameters fields is that we wanna make sure there're flexibilities so that the user can choose to do more actions against them in addition to include/exclude.

@reasonerjt Seems to me there is potentially a "filtering" and a "grouping" requirements. If grouping is the main use case where we want to apply certain actions to a group of volumes, I wonder if a volume group CRD might be clearer and more structured than a data in a ConfigMap. The Backup CR can reference the volume groups, instead of the ConfigMap.

If grouping and filtering can be tackled separately, the Backup CRD spec already has filter-related fields. We can change it from []string to a struct to fulfill the Conditions requirement. We should keep #5842, which goes with #5838, separated from this proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about grouping, it's about variety of actions, we wanna support actions other than "include/exclude".

As for putting the filter in or out of Backup CR, we wanna put it out of the CR for two reasons:

  1. Control the size of backup CR. We've been continuously adding new fields to the CRD to add new features, despite it's already v1, I hope we can reduce such changes, and will open another issue to discuss the v1 -> v2 transformation of velero CRDs.
  2. Make the policy re-usable across backups. There has been some requirement from other adopters that they wanna introduce a policy CR and use them with different backups, this is a step towards that direction.

Copy link
Contributor Author

@qiuming-best qiuming-best Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"include/exclude" could not further policy distinct when one backup both have volume-snapshot and file-system-backup, so grouping does not work, we introduced actions way

Copy link

@ihcsim ihcsim Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy re-use sounds like a good use case to me. If the ultimate goal is a policy CR, why not start with a CR right from the start? As for not adding new fields to backup, I think you will have to add something to it anyway, even if it's just a ref (or a list of refs), right?

Regardless, can we keep and review #5842, and possibly add it to #5838, as I feel like it's different from the policy concept in this proposal? Thanks.

@qiuming-best qiuming-best force-pushed the volumes-filter-design branch 2 times, most recently from 7f9ec68 to b11b839 Compare March 1, 2023 03:27
reasonerjt
reasonerjt previously approved these changes Mar 1, 2023
Copy link
Contributor

@reasonerjt reasonerjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @qiuming-best for putting this together.

Copy link
Collaborator

@shubham-pampattiwar shubham-pampattiwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments

// Action defined as one action for a specific way of backup
type Action struct {
// Type defined specific type of action, it could be 'file-system-backup', 'volume-snapshot', or 'skip' currently
Type string `yaml:"type"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiuming-best @reasonerjt If we support only a handful type of actions shouldn't the Action.Type be of some struct instead of string ? like:

type VolumeActionType string
const Skip VolumeActionType = "skip"
const VolumeSnapshot VolumeActionType = "volume-snapshot" etc

type Action struct {
    Type VolumeActionType 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll modify it

## Implementation
This implementation should be included in Velero v1.11.0

Currently, in Velero v1.11.0 we only support `Action`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add backup level Validation where in we check whether the specified action in the policy is supported by velero install instance or not. For example, If the user specified fs-backup but the velero install does not have restic or kopia then in that case we could fail the backup even before we start processing it ? WYGT ?

Copy link
Contributor Author

@qiuming-best qiuming-best Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users could config many kinds of Action Type, they can config both fs-backup and volume-snapshot types of Action. When doing the backup for one specific volume, we will first check which types of Action the volume matched, if all policies are validated and returned it could use fs-backup and volume-snapshot both ok, and the current backup is through CSI plugin to take a snapshot, it will take a snapshot for volume-snapshot matched even though users have configured anther fs-backup for it.

If we add lots of validation related to the installation or environments it will make our validation complicated, and some redundant configurations by users are also acceptable.

- backup.velero.io/must-include-additional-items

So it should be careful with the combination of volumes resources policies and the above resources filters.
- When volume resource policies conflict with the above resource filters, we should respect the above resource filters. For example, if the user used the opt-out approach to `backup.velero.io/backup-volumes-excludes` annotation on the pod and also defined include volume in volumes resources filters configuration, we should respect the opt-out approach to skip backup of the volume.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Also, please be sure to add this in the velero documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Signed-off-by: Ming <[email protected]>
Copy link
Collaborator

@shubham-pampattiwar shubham-pampattiwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiuming-best qiuming-best merged commit 54042c3 into vmware-tanzu:main Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants