-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734
Comments
Which node agent pod handles a DataUpload or DataDownload is determined by what node the backupPod or restorePod is running on. Currently Velero is creating these pods without any particular configuration to restrict or control where they run, so the node distribution is determined by the kubernetes scheduler, not by velero. We could consider modifying this via node selectors, affinity, or topology spread constraints -- the latter may be the way to go here. |
Something to consider: the scheduler knows better than us what the current constraints are on each node I would worry about that artificially spreading out the resources may cause other issues (like overcommitting a very important node) or something along those lines. I would be very cautious about getting into the scheduling game, IMO. I think a better option is to work on making each node able to handle more than 1. As for inconsistent performance results, isn't that pretty vindictive of something running on K8s? that there is a probable range for performance or am I incorrect on this thought process (This is for me learning :) ) |
@shawn-hurley Hmm, yeah, it may be better leaving this as-is. Looking back at the posted distribution above, it strikes me that for many of the runs, they're actually reasonably well-distributed, although with certain nodes having well less than average. But maybe those nodes were already overcommitted at that time? As for each node handling more than one at a time, there's already an issue opened for that and it's targeted for 1.13. |
We have discussed this topic during the initial data mover discussions --- Velero's own load balancer:
Therefore, the ultimate solution may a combination of Kubernetes scheduler and Velero's own load balancer. |
@Lyndon-Li If we do our own, we may want to make it configurable -- turning it on or off (not sure which the default should be) -- that way if one option is providing bad performance, users could try the other. |
As mentioned above, we need the capability of Kubernetes scheduler as well as some supplements. |
Reopen this issue, as #6926 has not completely fixed the problem --- the restore part is not fixed; and even for the backup part, there is not as much intelligence to assign data upload overhead as a LD provides. Let's keep the issue open for new ideas of fixes. |
Is Design for data mover node selection #7383 related? |
What steps did you take and what happened:
Running datamover backup and restore. During the tests, monitoring the datauploads and datadownloads resources and noticed the resources weren't distributed equally among the workers.
It causes the tests to run long duration, also while running a few cycles of the same test - the results are inconsistent.
Another issue - the test is not run with the max concurrent (1 resource per node)
What did you expect to happen:
The resources should distributed among all workers equally as much as it can.
The following information will help us better understand what's going on:
Anything else you would like to add:
5 backup cycles duration and datauploads distributed (ns with 100 PVs):
-0:23:48
worker000-r640 : 3, worker001-r640 : 61, worker002-r640 : 0, worker003-r640 : 4, worker004-r640 : 32, worker005-r640 : 0
-0:16:39
worker000-r640 : 11, worker001-r640 : 23, worker002-r640 : 14, worker003-r640 : 20, worker004-r640 : 19,worker005-r640 : 13
-0:17:53
worker000-r640 : 20, worker001-r640 : 17, worker002-r640 : 15, worker003-r640 : 16, worker004-r640 : 9, worker005-r640 : 23
-0:18:45
worker000-r640 : 24, worker001-r640 : 15, worker002-r640 : 22, worker003-r640 : 17, worker004-r640 : 6, worker005-r640 : 16
-0:28:39
worker000-r640 : 26, worker001-r640 : 15, worker002-r640 : 20, worker003-r640 : 20, worker004-r640 : 2, worker005-r640 : 17
5 restore cycles datadownloads distributed (ns with 100 PVs):
-worker000-r640 : 5, worker001-r640 : 51, worker002-r640 : 0, worker003-r640 : 17, worker004-r640 : 27, worker005-r640 : 0
-worker000-r640 : 24, worker001-r640 : 13, worker002-r640 : 0, worker003-r640 : 22, worker004-r640 : 24, worker005-r640 : 17
-worker000-r640 : 28, worker001-r640 : 12, worker002-r640 : 10, worker003-r640 : 23, worker004-r640 : 14, worker005-r640 : 13
-worker000-r640 : 21, worker001-r640 : 18, worker002-r640 : 10, worker003-r640 : 18, worker004-r640 : 17, worker005-r640 : 16
-worker000-r640 : 15, worker001-r640 : 17, worker002-r640 : 11, worker003-r640 : 21, worker004-r640 : 21, worker005-r640 : 15
5 restore cycles duration and datadownloadloads distributed (ns with 50 PVs):
-0:29:39
worker000-r640 : 0, worker001-r640 : 29, worker002-r640 : 0, worker003-r640 : 20, worker004-r640 : 1, worker005-r640 : 0
-0:17:20
worker000-r640 : 0, worker001-r640 : 20, worker002-r640 : 0, worker003-r640 : 14, worker004-r640 : 16, worker005-r640 : 0
-0:19:32
worker000-r640 : 2, worker001-r640 : 20, worker002-r640 : 11, worker003-r640 : 6, worker004-r640 : 11, worker005-r640 : 0
-0:18:29
worker000-r640 : 0, worker001-r640 : 14, worker002-r640 : 3, worker003-r640 : 13, worker004-r640 : 18, worker005-r640 : 2
-0:14:26
worker000-r640 : 1, worker001-r640 : 11, worker002-r640 : 14, worker003-r640 : 8, worker004-r640 : 12, worker005-r640 : 4
Environment:
Velero version: main (Velero-1.12) , last commit
commit 30e54b0 (HEAD -> main, origin/main, origin/HEAD)
Author: Daniel Jiang [email protected]
Date: Wed Aug 16 15:45:00 2023 +0800
Velero features (use
velero client config get features
):./velero client config get features
features:
Kubernetes version (use
kubectl version
):oc version
Client Version: 4.12.9
Kustomize Version: v4.5.7
Server Version: 4.12.9
Kubernetes Version: v1.25.7+eab9cc9
OCP running over BM servers
3 masters & 6 workers nodes
oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 148d v1.25.7+eab9cc9
master-1 Ready control-plane,master 148d v1.25.7+eab9cc9
master-2 Ready control-plane,master 148d v1.25.7+eab9cc9
worker000-r640 Ready worker 148d v1.25.7+eab9cc9
worker001-r640 Ready worker 148d v1.25.7+eab9cc9
worker002-r640 Ready worker 148d v1.25.7+eab9cc9
worker003-r640 Ready worker 148d v1.25.7+eab9cc9
worker004-r640 Ready worker 148d v1.25.7+eab9cc9
worker005-r640 Ready worker 148d v1.25.7+eab9cc9
/etc/os-release
):Red Hat Enterprise Linux CoreOS 412.86.202303211731-0
Part of OpenShift 4.12, RHCOS is a Kubernetes native operating system
cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202303211731-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202303211731-0"
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: