[1.15 dev] DataUploads fail in DataMover count constrained environments with timeouts #8316

msfrucht · 2024-10-17T19:08:15Z

What steps did you take and what happened:

Backups with snapshotMoveData enable and constrained datamover counts and larger counts of PVCs will fail with a timeout in the datamover error 'context deadline exceeded'.

In this test the number of datamovers was constrained to a maximum of 1 and use the node-selector of both node-agent and loadAffinity to restrict the datamover pod creation to a single node.

The environment consisted of a single Apache HTTPD pod attached with a Ceph RBD filled with about 20GB of data. The PVCs are then cloned 9 times for a total of 10 PVCs.

Three of the backups failed due to datamover timeout waiting for the node-agent pod to move the DataUpload phase into InProgress.

kind: ConfigMap
apiVersion: v1
metadata:
  name: node-agent-config
  namespace: oadp-1-4
  uid: efff65d7-796c-4bdd-a20b-736d283722f8
  resourceVersion: '16459796'
  creationTimestamp: '2024-10-16T23:38:19Z'
data:
  node-agent-config: '{"loadConcurrency": {"globalConfig": 1}, "loadAffinity": [{"nodeSelector": {"matchLabels": {"kubernetes.io/hostname": "bnr-hub01-hcp-greenstar-sp-fd520828-f5825"}}}], "podResources": {}, "backupPVC": {"ibm-spectrum-fusion-mgmt-sc": {"storageClass": "ibm-spectrum-fusion-mgmt-sc", "readOnly": true}, "ocs-external-storagecluster-cephfs": {"storageClass": "ocs-external-storagecluster-cephfs", "readOnly": true}}}'

If you are using velero v1.7.0+:

bundle-2024-10-17-10-37-39.tar.gz

Node-agent config because is not currently collected by the log collector.

What did you expect to happen:

All 10 data movements to succeed and the namespace resources of "delay" backed up.

Anything else you would like to add:
The issue is caused by an incorrect usage of cluster of resources.

The Expose process occurs during the phase Accepted and moves it into Prepared. This state does too much.

On backup this process copies the VolumeSnapshot and VolumeSnapshotContent, created the backup PVC, and creates the datamover Pod. On restore this involves a copy of the application PVC into creation in the install namespace. The overall problem is the same.

The creation of the datamover Pod this early is the problem.

Inside /pkg/datamover/backup_micro_service.go RunCancelableDataPath, /pkg/datamover/restore_micro_service.go RunCancelableDataPath

Exists a countdown, default 30 seconds, waiting for the DataUpload or DataDownload to be moved into InProgress state.

On the node-agent controller, even after expose is run, the InProgress state is not reached unless the concurrency check succeeds.

In this constrained environment, with sufficient data to backup, this reliably fails. On the datamover the context deadline causes the wait to drop out and the Pod goes to Error state and no longer running.

The Backup ends in a PartiallyFailed state with status.backupItemOperationsFailed: 3 with the count value of the number of failed data movements.

Proposed Solution

There are two problems and this proposed solution may solve both.

The Backups fail due to this timeout. Increasing the timeout delays the problem (the logs included are with some code changes that make the state change issue more clear and the timeout increased 10m from default 30s in the datamover and node-agent).
The datamover pod is running on the cluster without doing anything, waiting for an InProgress state change. Even though the resources used at this point are low, any requests.cpu and request.memory reserves useful resources that are going unused. And all datamovers are started at the same time. For hundred+ PVC workloads, this is not great as the reserved resources have to be sufficient to handle the largest volume multiplied by hundred+ PVCs.

Proposed solution:
New state: DatamoverDeploying

Instead of deploying the datamover during Prepared. Move the concurrency check and the datamover pod creation into different this state.

To move from Accepted -> Prepared on backup only copy the VolumeSnapshot, VolumeSnapshotContents, and create the PVC. On restore, the changes are just create the PVC from the application PVC spec. Do not create the datamover container.

State change:
Prepared -> DatamoverDeploying:

During this state this is where the concurrency check takes place. Until the concurrency check succeeds, there is no change. When the concurrency check succeeds, deploy the datamover.

DatamoverDeploying -> InProgress
The datamover should quickly move into the InProgress state and continue the backup due to being moved into InProgress state on a much shorter time frame due to lowering the necessity of polling for InProgress state.

The datamover check for InProgress may no longer be necessary since a datamover deployed implies the concurrency check has already passed.

Environment:

Velero version (use velero version): 1.15 dev (2024-10-10 build, with added logging changes to better expose and understand the issue)
Velero features (use velero client config get features): EnableCSI
Kubernetes version (use kubectl version):

Client Version: 4.15.12
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.8
Kubernetes Version: v1.29.7+4510e9c

Kubernetes installer & version: OpenShift 4.16
Cloud provider or hardware configuration: Red Hat HyperShift virtualized OpenShift cluster
OS (e.g. from /etc/os-release): Red Hat Enterprise Linux 9.2 (Plow)

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

blackpiglet · 2024-10-20T16:01:25Z

velero/pkg/cmd/cli/nodeagent/server.go

Line 90 in 732b87b

dataMoverPrepareTimeout time.Duration

There is a timeout timer for the DataUpload expose stage. Is that enough to fix this issue?

Lyndon-Li · 2024-10-21T02:47:51Z

The attached log doesn't match to the latest code of upstream. And I also don't see this behavior either:

Exists a countdown, default 30 seconds, waiting for the DataUpload or DataDownload to be moved into InProgress state

There is a wait of DU/DD to go to InProgress in VGDP pod, but there is NO timeout.
As a principle of the data mover micro service design #7576, the VGDP pod won't quit itself unless crash or error happens.

@msfrucht Please double check the code/image being tested

msfrucht · 2024-10-23T22:51:49Z

Backup failure caused by unintended consequences that set a context deadline from internal changes. Resolved internally.

Retry with upstream velero succeeds.

bundle-2024-10-23-15-47-30.tar.gz

The issue about excessive datamover pods running doing nothing but checking DataUpload phase remains using node cpu and memory requests. The existing proposal will still work to resolve the issue.

That is a new feature request, not a defect. Closing. Will open as a new feature request.

blackpiglet added the area/datamover label Oct 20, 2024

blackpiglet assigned Lyndon-Li Oct 21, 2024

msfrucht closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.15 dev] DataUploads fail in DataMover count constrained environments with timeouts #8316

[1.15 dev] DataUploads fail in DataMover count constrained environments with timeouts #8316

msfrucht commented Oct 17, 2024

blackpiglet commented Oct 20, 2024

Lyndon-Li commented Oct 21, 2024

msfrucht commented Oct 23, 2024

[1.15 dev] DataUploads fail in DataMover count constrained environments with timeouts #8316

[1.15 dev] DataUploads fail in DataMover count constrained environments with timeouts #8316

Comments

msfrucht commented Oct 17, 2024

blackpiglet commented Oct 20, 2024

Lyndon-Li commented Oct 21, 2024

msfrucht commented Oct 23, 2024