DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

duduvaa · 2023-08-31T15:06:40Z

What steps did you take and what happened:

Running datamover backup and restore. During the tests, monitoring the datauploads and datadownloads resources and noticed the resources weren't distributed equally among the workers.
It causes the tests to run long duration, also while running a few cycles of the same test - the results are inconsistent.

Another issue - the test is not run with the max concurrent (1 resource per node)

What did you expect to happen:
The resources should distributed among all workers equally as much as it can.

The following information will help us better understand what's going on:

Anything else you would like to add:

5 backup cycles duration and datauploads distributed (ns with 100 PVs):
-0:23:48
worker000-r640 : 3, worker001-r640 : 61, worker002-r640 : 0, worker003-r640 : 4, worker004-r640 : 32, worker005-r640 : 0
-0:16:39
worker000-r640 : 11, worker001-r640 : 23, worker002-r640 : 14, worker003-r640 : 20, worker004-r640 : 19,worker005-r640 : 13
-0:17:53
worker000-r640 : 20, worker001-r640 : 17, worker002-r640 : 15, worker003-r640 : 16, worker004-r640 : 9, worker005-r640 : 23
-0:18:45
worker000-r640 : 24, worker001-r640 : 15, worker002-r640 : 22, worker003-r640 : 17, worker004-r640 : 6, worker005-r640 : 16
-0:28:39
worker000-r640 : 26, worker001-r640 : 15, worker002-r640 : 20, worker003-r640 : 20, worker004-r640 : 2, worker005-r640 : 17

5 restore cycles datadownloads distributed (ns with 100 PVs):
-worker000-r640 : 5, worker001-r640 : 51, worker002-r640 : 0, worker003-r640 : 17, worker004-r640 : 27, worker005-r640 : 0
-worker000-r640 : 24, worker001-r640 : 13, worker002-r640 : 0, worker003-r640 : 22, worker004-r640 : 24, worker005-r640 : 17
-worker000-r640 : 28, worker001-r640 : 12, worker002-r640 : 10, worker003-r640 : 23, worker004-r640 : 14, worker005-r640 : 13
-worker000-r640 : 21, worker001-r640 : 18, worker002-r640 : 10, worker003-r640 : 18, worker004-r640 : 17, worker005-r640 : 16
-worker000-r640 : 15, worker001-r640 : 17, worker002-r640 : 11, worker003-r640 : 21, worker004-r640 : 21, worker005-r640 : 15

5 restore cycles duration and datadownloadloads distributed (ns with 50 PVs):
-0:29:39
worker000-r640 : 0, worker001-r640 : 29, worker002-r640 : 0, worker003-r640 : 20, worker004-r640 : 1, worker005-r640 : 0
-0:17:20
worker000-r640 : 0, worker001-r640 : 20, worker002-r640 : 0, worker003-r640 : 14, worker004-r640 : 16, worker005-r640 : 0
-0:19:32
worker000-r640 : 2, worker001-r640 : 20, worker002-r640 : 11, worker003-r640 : 6, worker004-r640 : 11, worker005-r640 : 0
-0:18:29
worker000-r640 : 0, worker001-r640 : 14, worker002-r640 : 3, worker003-r640 : 13, worker004-r640 : 18, worker005-r640 : 2
-0:14:26
worker000-r640 : 1, worker001-r640 : 11, worker002-r640 : 14, worker003-r640 : 8, worker004-r640 : 12, worker005-r640 : 4

Environment:

Velero version: main (Velero-1.12) , last commit
commit 30e54b0 (HEAD -> main, origin/main, origin/HEAD)
Author: Daniel Jiang [email protected]
Date: Wed Aug 16 15:45:00 2023 +0800
Velero features (use velero client config get features):

./velero client config get features

features:
Kubernetes version (use kubectl version):

oc version

Client Version: 4.12.9
Kustomize Version: v4.5.7
Server Version: 4.12.9
Kubernetes Version: v1.25.7+eab9cc9

Cloud provider or hardware configuration:
OCP running over BM servers
3 masters & 6 workers nodes
oc get nodes

NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 148d v1.25.7+eab9cc9
master-1 Ready control-plane,master 148d v1.25.7+eab9cc9
master-2 Ready control-plane,master 148d v1.25.7+eab9cc9
worker000-r640 Ready worker 148d v1.25.7+eab9cc9
worker001-r640 Ready worker 148d v1.25.7+eab9cc9
worker002-r640 Ready worker 148d v1.25.7+eab9cc9
worker003-r640 Ready worker 148d v1.25.7+eab9cc9
worker004-r640 Ready worker 148d v1.25.7+eab9cc9
worker005-r640 Ready worker 148d v1.25.7+eab9cc9

OS (e.g. from /etc/os-release):
Red Hat Enterprise Linux CoreOS 412.86.202303211731-0
Part of OpenShift 4.12, RHCOS is a Kubernetes native operating system

cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202303211731-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202303211731-0"

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

sseago · 2023-08-31T16:15:54Z

Which node agent pod handles a DataUpload or DataDownload is determined by what node the backupPod or restorePod is running on. Currently Velero is creating these pods without any particular configuration to restrict or control where they run, so the node distribution is determined by the kubernetes scheduler, not by velero. We could consider modifying this via node selectors, affinity, or topology spread constraints -- the latter may be the way to go here.

shawn-hurley · 2023-08-31T20:23:55Z

Something to consider: the scheduler knows better than us what the current constraints are on each node I would worry about that artificially spreading out the resources may cause other issues (like overcommitting a very important node) or something along those lines.

I would be very cautious about getting into the scheduling game, IMO. I think a better option is to work on making each node able to handle more than 1.

As for inconsistent performance results, isn't that pretty vindictive of something running on K8s? that there is a probable range for performance or am I incorrect on this thought process (This is for me learning :) )

sseago · 2023-08-31T20:58:11Z

@shawn-hurley Hmm, yeah, it may be better leaving this as-is. Looking back at the posted distribution above, it strikes me that for many of the runs, they're actually reasonably well-distributed, although with certain nodes having well less than average. But maybe those nodes were already overcommitted at that time?

As for each node handling more than one at a time, there's already an issue opened for that and it's targeted for 1.13.

Lyndon-Li · 2023-09-01T04:01:59Z

We have discussed this topic during the initial data mover discussions --- Velero's own load balancer:

On one hand, Kubernetes scheduler knows the CPU and memory resources well and it also knows the affinities and topologies which are all required by Velero data mover workload distribution
On the other hand, Kubernetes scheduler doesn't handle some other requirements that Velero data mover workload distribution cares, for example, the network bandwidth usage, if CPU and memory are sufficient in all nodes, Kubernetes scheduler may assign multiple backup/restore pods in one node, but the network bandwidth is probably the bottleneck. Moreover, in this case, even though the network bandwidth is sufficient, Velero has the concurrency config for each node, for which Kubernetes scheduler doesn't consider either.

Therefore, the ultimate solution may a combination of Kubernetes scheduler and Velero's own load balancer.

sseago · 2023-09-01T18:19:39Z

@Lyndon-Li If we do our own, we may want to make it configurable -- turning it on or off (not sure which the default should be) -- that way if one option is providing bad performance, users could try the other.

Lyndon-Li · 2023-09-04T01:28:41Z

As mentioned above, we need the capability of Kubernetes scheduler as well as some supplements.
Ideally, we make a combination --- Velero only implements the supplements, the Kubernetes scheduler works as is together with Velero's part. Then we will not need a fallback.
Otherwise, if we cannot make them work together and Velero has to implement Kubernetes scheduler's part as well, then we will need to make it configurable in case that Velero's implementation is with bugs or out of sync with the latest Kubernetes system.

Lyndon-Li · 2023-10-10T02:22:42Z

Reopen this issue, as #6926 has not completely fixed the problem --- the restore part is not fixed; and even for the backup part, there is not as much intelligence to assign data upload overhead as a LD provides.

Let's keep the issue open for new ideas of fixes.

kaovilai · 2024-02-05T20:49:26Z

Is Design for data mover node selection #7383 related?

Lyndon-Li · 2024-02-06T02:24:55Z

Is Design for data mover node selection #7383 related?

No, #7383 is for node selection (include/exclude nodes), instead of spreading VGDP in the nodes

danfengliu added kind/requirement backlog area/datamover labels Sep 4, 2023

Lyndon-Li self-assigned this Oct 8, 2023

Lyndon-Li mentioned this issue Oct 8, 2023

Issue 6734: spread backup pod evenly #6926

Merged

ywk253100 closed this as completed in #6926 Oct 10, 2023

Lyndon-Li reopened this Oct 10, 2023

Lyndon-Li mentioned this issue Oct 10, 2023

[1.12] Issue 6734: spread backup pod evenly #6935

Merged

CrimsonFez mentioned this issue Dec 22, 2023

Need the ability to control what nodes the data movers are scheduled on #7243

Closed

Lyndon-Li added the 1.16-candidate label Sep 24, 2024

Lyndon-Li mentioned this issue Sep 24, 2024

DataMover - distribute datadownload evenly across nodes if possible #8242

Open

Lyndon-Li removed the 1.16-candidate label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

duduvaa commented Aug 31, 2023

./velero client config get features

oc version

oc get nodes

sseago commented Aug 31, 2023

shawn-hurley commented Aug 31, 2023

sseago commented Aug 31, 2023

Lyndon-Li commented Sep 1, 2023

sseago commented Sep 1, 2023 •

edited

Loading

Lyndon-Li commented Sep 4, 2023

Lyndon-Li commented Oct 10, 2023

kaovilai commented Feb 5, 2024 •

edited

Loading

Lyndon-Li commented Feb 6, 2024

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

Comments

duduvaa commented Aug 31, 2023

./velero client config get features

oc version

oc get nodes

sseago commented Aug 31, 2023

shawn-hurley commented Aug 31, 2023

sseago commented Aug 31, 2023

Lyndon-Li commented Sep 1, 2023

sseago commented Sep 1, 2023 • edited Loading

Lyndon-Li commented Sep 4, 2023

Lyndon-Li commented Oct 10, 2023

kaovilai commented Feb 5, 2024 • edited Loading

Lyndon-Li commented Feb 6, 2024

sseago commented Sep 1, 2023 •

edited

Loading

kaovilai commented Feb 5, 2024 •

edited

Loading