Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero restore completes with warning while restoring services and endpoints backup #6280

Closed
pavansokkenagaraj opened this issue May 17, 2023 · 8 comments · Fixed by #6315
Closed

Comments

@pavansokkenagaraj
Copy link

What steps did you take and what happened:

In velero v1.10.3 and v1.11.0
Restoring a backup with deployment, service and endpoints completes with warning

time="2023-05-17T13:48:49Z" level=warning msg="Namespace test, resource restore warning: could not restore, Endpoints \"example-nginx\" already exists. Warning: the in-cluster version is different than the backed-up version." logSource="pkg/controller/restore_controller.go:509" restore=velero/qakotsregression-g5mw7

What did you expect to happen:
Velero restore completes without warning

The following information will help us better understand what's going on:
In velero v1.11.0 and v1.10.3, services Resoruce was added to HighPriorities which makes the services to be restored first and then endpoints. So when endpoints are restored, there is a conflict due to Subsets.Address change from new Service's Endpoint.
Corresponding code:

var defaultRestorePriorities = restore.Priorities{
HighPriorities: []string{
"customresourcedefinitions",
"namespaces",
"storageclasses",
"volumesnapshotclass.snapshot.storage.k8s.io",
"volumesnapshotcontents.snapshot.storage.k8s.io",
"volumesnapshots.snapshot.storage.k8s.io",
"persistentvolumes",
"persistentvolumeclaims",
"serviceaccounts",
"secrets",
"configmaps",
"limitranges",
"pods",
// we fully qualify replicasets.apps because prior to Kubernetes 1.16, replicasets also
// existed in the extensions API group, but we back up replicasets from "apps" so we want
// to ensure that we prioritize restoring from "apps" too, since this is how they're stored
// in the backup.
"replicasets.apps",
"clusterclasses.cluster.x-k8s.io",
"services",
},
LowPriorities: []string{
"clusterbootstraps.run.tanzu.vmware.com",
"clusters.cluster.x-k8s.io",
"clusterresourcesets.addons.cluster.x-k8s.io",
},
}

var defaultRestorePriorities = restore.Priorities{
HighPriorities: []string{
"customresourcedefinitions",
"namespaces",
"storageclasses",
"volumesnapshotclass.snapshot.storage.k8s.io",
"volumesnapshotcontents.snapshot.storage.k8s.io",
"volumesnapshots.snapshot.storage.k8s.io",
"persistentvolumes",
"persistentvolumeclaims",
"serviceaccounts",
"secrets",
"configmaps",
"limitranges",
"pods",
// we fully qualify replicasets.apps because prior to Kubernetes 1.16, replicasets also
// existed in the extensions API group, but we back up replicasets from "apps" so we want
// to ensure that we prioritize restoring from "apps" too, since this is how they're stored
// in the backup.
"replicasets.apps",
"clusterclasses.cluster.x-k8s.io",
"services",
},
LowPriorities: []string{
"clusterbootstraps.run.tanzu.vmware.com",
"clusters.cluster.x-k8s.io",
"clusterresourcesets.addons.cluster.x-k8s.io",
},
}

Whereas in velero v1.10.2, the restore completed without warnings as the Services and Endpoints were restored in alphabetical sorted order as they were restored after the HighPriority Resources.

var defaultRestorePriorities = restore.Priorities{
HighPriorities: []string{
"customresourcedefinitions",
"namespaces",
"storageclasses",
"volumesnapshotclass.snapshot.storage.k8s.io",
"volumesnapshotcontents.snapshot.storage.k8s.io",
"volumesnapshots.snapshot.storage.k8s.io",
"persistentvolumes",
"persistentvolumeclaims",
"serviceaccounts",
"secrets",
"configmaps",
"limitranges",
"pods",
// we fully qualify replicasets.apps because prior to Kubernetes 1.16, replicasets also
// existed in the extensions API group, but we back up replicasets from "apps" so we want
// to ensure that we prioritize restoring from "apps" too, since this is how they're stored
// in the backup.
"replicasets.apps",
"clusterclasses.cluster.x-k8s.io",
},
LowPriorities: []string{
"clusterbootstraps.run.tanzu.vmware.com",
"clusters.cluster.x-k8s.io",
"clusterresourcesets.addons.cluster.x-k8s.io",
},
}
(edited)

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero describe restore qakotsregression-g5mw7 --details
$ velero describe restore qakotsregression-g5mw7 --details
Name:         qakotsregression-g5mw7
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                       Completed
Total items to be restored:  9
Items restored:              9

Started:    2023-05-17 13:48:48 +0000 UTC
Completed:  2023-05-17 13:48:49 +0000 UTC

Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    test:  could not restore, Endpoints "example-nginx" already exists. Warning: the in-cluster version is different than the backed-up version.

Backup:  qakotsregression-g5mw7

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  included

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  true

Existing Resource Policy:   <none>
ItemOperationTimeout:       1h0m0s

Preserve Service NodePorts:  auto

Resource List:
  apps/v1/Deployment:
    - test/example-nginx(created)
  apps/v1/ReplicaSet:
    - test/example-nginx-85b445fb65(created)
  discovery.k8s.io/v1/EndpointSlice:
    - test/example-nginx-gbvm2(created)
  v1/ConfigMap:
    - test/example-config(created)
  v1/Endpoints:
    - test/example-nginx(failed)
  v1/Pod:
    - test/example-nginx-85b445fb65-n5b2v(created)
  v1/Secret:
    - test/kotsadm-replicated-registry(created)
    - test/qakotsregression-registry(created)
  v1/Service:
    - test/example-nginx(created)
  • velero restore logs qakotsregression-g5mw7
...
...

time="2023-05-17T13:48:49Z" level=info msg="Getting client for /v1, Kind=Service" logSource="pkg/restore/restore.go:918" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1189" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Executing item action for services" logSource="pkg/restore/restore.go:1196" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Attempting to restore Service: example-nginx" logSource="pkg/restore/restore.go:1337" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="the managed fields for test/example-nginx is patched" logSource="pkg/restore/restore.go:1522" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Restored 6 items out of an estimated total of 9 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:669" name=example-nginx namespace=test progress= resource=services restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Getting client for apps/v1, Kind=Deployment" logSource="pkg/restore/restore.go:918" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1189" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Executing item action for deployments.apps" logSource="pkg/restore/restore.go:1196" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Executing ChangeImageNameAction" cmd=/velero logSource="pkg/restore/change_image_name_action.go:68" pluginName=velero restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Done executing ChangeImageNameAction" cmd=/velero logSource="pkg/restore/change_image_name_action.go:81" pluginName=velero restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Attempting to restore Deployment: example-nginx" logSource="pkg/restore/restore.go:1337" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="the managed fields for test/example-nginx is patched" logSource="pkg/restore/restore.go:1522" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Restored 7 items out of an estimated total of 9 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:669" name=example-nginx namespace=test progress= resource=deployments.apps restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Getting client for /v1, Kind=Endpoints" logSource="pkg/restore/restore.go:918" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1189" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Attempting to restore Endpoints: example-nginx" logSource="pkg/restore/restore.go:1337" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Restored 8 items out of an estimated total of 9 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:669" name=example-nginx namespace=test progress= resource=endpoints restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Getting client for discovery.k8s.io/v1, Kind=EndpointSlice" logSource="pkg/restore/restore.go:918" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1189" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Attempting to restore EndpointSlice: example-nginx-gbvm2" logSource="pkg/restore/restore.go:1337" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="the managed fields for test/example-nginx-gbvm2 is patched" logSource="pkg/restore/restore.go:1522" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Restored 9 items out of an estimated total of 9 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:669" name=example-nginx-gbvm2 namespace=test progress= resource=endpointslices.discovery.k8s.io restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:551" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Done waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:567" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Waiting for all post-restore-exec hooks to complete" logSource="pkg/restore/restore.go:571" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="Done waiting for all post-restore exec hooks to complete" logSource="pkg/restore/restore.go:579" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=warning msg="Namespace test, resource restore warning: could not restore, Endpoints \"example-nginx\" already exists. Warning: the in-cluster version is different than the backed-up version." logSource="pkg/controller/restore_controller.go:509" restore=velero/qakotsregression-g5mw7
time="2023-05-17T13:48:49Z" level=info msg="restore completed" logSource="pkg/controller/restore_controller.go:512" restore=velero/qakotsregression-g5mw7

Anything else you would like to add:

Environment:

  • Velero version (use velero version): v1.11.0 and v1.10.3
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version): v1.27
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@sseago
Copy link
Collaborator

sseago commented May 17, 2023

That warning just means that the resource already existed before the restore was executed. By default, velero doesn't attempt to modify resources to be restored if they already exist, so there will be a warning if the resource already exists but the content differs from what was in the backup. If you need the version from backup instead of what's already in the cluster, you have a couple options:

  1. delete the resource before restoring
  2. set the existing resource policy to update -- when this is set, velero will attempt to update resources that already exist rather than just warning and moving on. In some cases this will fail -- if the resource is immutable or one of the fields which differs is an immutable field, if this happens, velero falls back to warning the user that it couldn't make the change. Documentation for this feature is here: https://velero.io/docs/v1.11/restore-reference/#restore-existing-resource-policy

@pavansokkenagaraj
Copy link
Author

@sseago

  1. delete the resource before restoring

The endpoints and services are delete before restore.

  1. When velero restores Service first, k8s api will create a new endpoints for the Service as there are no endpoints exist.
  2. Later velero restores endpoints and at this time, there is an new endpoints from the above step and the content differs, so velero restore returns an warning.
  1. set the existing resource policy to update -- when this is set, velero will attempt to update resources that already exist rather than just warning and moving on. In some cases this will fail -- if the resource is immutable or one of the fields which differs is an immutable field, if this happens, velero falls back to warning the user that it couldn't make the change. Documentation for this feature is here: https://velero.io/docs/v1.11/restore-reference/#restore-existing-resource-policy

This would be a regression as the behaviour changed from v1.10.2 to v1.10.3/v1.11.0
In velero v1.10.2, with resource policy <none>, velero was restoring the resources without any warnings.[restores Endpoints first then Services]

So, now in version > v1.10.3/v1.11.0 the velero restores Services first and Endpoints later and the warning will always occur as k8s API will reconcile Service and create a new Endpoints which later conflicts with the restoring Endpoint
Is this an expected behaviour from velero restore for versions starting from v1.10.3/v1.11.0?

Shouldn't Enpoints be restored before Service [adding Endpoint to HighPriority before Service] so that the restore behaves like versions < v1.10.2 ?

@sgalsaleh
Copy link
Contributor

As @pavansokkenagaraj mentioned, I also think that endpoints need to be restored before services as Kubernetes would automatically create endpoints for services if the corresponding pods exist. Since pods are also in the high priority list and are restored before services, I believe endpoints will have to be restored before services as well.

@sseago
Copy link
Collaborator

sseago commented May 17, 2023

Ahh, I see now. So it looks like services were added as a high priority resource but endpoints were not. So we need endpoints added to high priority before service. Also, it looks like this only worked before by luck -- because Endpoints happen to fall earlier in the alphabetical sort.

@pavansokkenagaraj
Copy link
Author

Also, it looks like this only worked before by luck -- because Endpoints happen to fall earlier in the alphabetical sort.

Yes. 🤞🏽

@reasonerjt
Copy link
Contributor

I believe the change was introduced purposefully.

@ywk253100 please clarify.

@ywk253100
Copy link
Contributor

The service was put into the high-priority list to fix an issue of AKO-operator, but seems the endpoint issue is a regression.

@sseago
Copy link
Collaborator

sseago commented May 22, 2023

Right. Adding service to high priority is fine. But I think we need to add endpoint as well, right before service, because there are use cases where endpoints must be restored before services.

ywk253100 added a commit to ywk253100/velero that referenced this issue May 29, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit to ywk253100/velero that referenced this issue May 29, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit to ywk253100/velero that referenced this issue May 29, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit to ywk253100/velero that referenced this issue Jun 19, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit to ywk253100/velero that referenced this issue Jun 19, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit to ywk253100/velero that referenced this issue Jun 19, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit to ywk253100/velero that referenced this issue Jun 19, 2023
Restore Endpoints before Services

Fixes vmware-tanzu#6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit that referenced this issue Jun 20, 2023
Restore Endpoints before Services

Fixes #6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit that referenced this issue Jun 20, 2023
Restore Endpoints before Services

Fixes #6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
ywk253100 added a commit that referenced this issue Jun 20, 2023
Restore Endpoints before Services

Fixes #6280

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants