Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

Open
duduvaa opened this issue Aug 31, 2023 · 9 comments · Fixed by #6926

Comments

@duduvaa
Copy link

duduvaa commented Aug 31, 2023

What steps did you take and what happened:

Running datamover backup and restore. During the tests, monitoring the datauploads and datadownloads resources and noticed the resources weren't distributed equally among the workers.
It causes the tests to run long duration, also while running a few cycles of the same test - the results are inconsistent.

Another issue - the test is not run with the max concurrent (1 resource per node)

What did you expect to happen:
The resources should distributed among all workers equally as much as it can.

The following information will help us better understand what's going on:

Anything else you would like to add:

5 backup cycles duration and datauploads distributed (ns with 100 PVs):
-0:23:48
worker000-r640 : 3, worker001-r640 : 61, worker002-r640 : 0, worker003-r640 : 4, worker004-r640 : 32, worker005-r640 : 0
-0:16:39
worker000-r640 : 11, worker001-r640 : 23, worker002-r640 : 14, worker003-r640 : 20, worker004-r640 : 19,worker005-r640 : 13
-0:17:53
worker000-r640 : 20, worker001-r640 : 17, worker002-r640 : 15, worker003-r640 : 16, worker004-r640 : 9, worker005-r640 : 23
-0:18:45
worker000-r640 : 24, worker001-r640 : 15, worker002-r640 : 22, worker003-r640 : 17, worker004-r640 : 6, worker005-r640 : 16
-0:28:39
worker000-r640 : 26, worker001-r640 : 15, worker002-r640 : 20, worker003-r640 : 20, worker004-r640 : 2, worker005-r640 : 17

5 restore cycles datadownloads distributed (ns with 100 PVs):
-worker000-r640 : 5, worker001-r640 : 51, worker002-r640 : 0, worker003-r640 : 17, worker004-r640 : 27, worker005-r640 : 0
-worker000-r640 : 24, worker001-r640 : 13, worker002-r640 : 0, worker003-r640 : 22, worker004-r640 : 24, worker005-r640 : 17
-worker000-r640 : 28, worker001-r640 : 12, worker002-r640 : 10, worker003-r640 : 23, worker004-r640 : 14, worker005-r640 : 13
-worker000-r640 : 21, worker001-r640 : 18, worker002-r640 : 10, worker003-r640 : 18, worker004-r640 : 17, worker005-r640 : 16
-worker000-r640 : 15, worker001-r640 : 17, worker002-r640 : 11, worker003-r640 : 21, worker004-r640 : 21, worker005-r640 : 15

5 restore cycles duration and datadownloadloads distributed (ns with 50 PVs):
-0:29:39
worker000-r640 : 0, worker001-r640 : 29, worker002-r640 : 0, worker003-r640 : 20, worker004-r640 : 1, worker005-r640 : 0
-0:17:20
worker000-r640 : 0, worker001-r640 : 20, worker002-r640 : 0, worker003-r640 : 14, worker004-r640 : 16, worker005-r640 : 0
-0:19:32
worker000-r640 : 2, worker001-r640 : 20, worker002-r640 : 11, worker003-r640 : 6, worker004-r640 : 11, worker005-r640 : 0
-0:18:29
worker000-r640 : 0, worker001-r640 : 14, worker002-r640 : 3, worker003-r640 : 13, worker004-r640 : 18, worker005-r640 : 2
-0:14:26
worker000-r640 : 1, worker001-r640 : 11, worker002-r640 : 14, worker003-r640 : 8, worker004-r640 : 12, worker005-r640 : 4

Environment:

  • Velero version: main (Velero-1.12) , last commit
    commit 30e54b0 (HEAD -> main, origin/main, origin/HEAD)
    Author: Daniel Jiang [email protected]
    Date: Wed Aug 16 15:45:00 2023 +0800

  • Velero features (use velero client config get features):

    ./velero client config get features

    features:

  • Kubernetes version (use kubectl version):

    oc version

Client Version: 4.12.9
Kustomize Version: v4.5.7
Server Version: 4.12.9
Kubernetes Version: v1.25.7+eab9cc9

  • Cloud provider or hardware configuration:
    OCP running over BM servers
    3 masters & 6 workers nodes

    oc get nodes

NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 148d v1.25.7+eab9cc9
master-1 Ready control-plane,master 148d v1.25.7+eab9cc9
master-2 Ready control-plane,master 148d v1.25.7+eab9cc9
worker000-r640 Ready worker 148d v1.25.7+eab9cc9
worker001-r640 Ready worker 148d v1.25.7+eab9cc9
worker002-r640 Ready worker 148d v1.25.7+eab9cc9
worker003-r640 Ready worker 148d v1.25.7+eab9cc9
worker004-r640 Ready worker 148d v1.25.7+eab9cc9
worker005-r640 Ready worker 148d v1.25.7+eab9cc9

  • OS (e.g. from /etc/os-release):
    Red Hat Enterprise Linux CoreOS 412.86.202303211731-0
    Part of OpenShift 4.12, RHCOS is a Kubernetes native operating system

cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202303211731-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202303211731-0"

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@sseago
Copy link
Collaborator

sseago commented Aug 31, 2023

Which node agent pod handles a DataUpload or DataDownload is determined by what node the backupPod or restorePod is running on. Currently Velero is creating these pods without any particular configuration to restrict or control where they run, so the node distribution is determined by the kubernetes scheduler, not by velero. We could consider modifying this via node selectors, affinity, or topology spread constraints -- the latter may be the way to go here.

@shawn-hurley
Copy link
Contributor

Something to consider: the scheduler knows better than us what the current constraints are on each node I would worry about that artificially spreading out the resources may cause other issues (like overcommitting a very important node) or something along those lines.

I would be very cautious about getting into the scheduling game, IMO. I think a better option is to work on making each node able to handle more than 1.

As for inconsistent performance results, isn't that pretty vindictive of something running on K8s? that there is a probable range for performance or am I incorrect on this thought process (This is for me learning :) )

@sseago
Copy link
Collaborator

sseago commented Aug 31, 2023

@shawn-hurley Hmm, yeah, it may be better leaving this as-is. Looking back at the posted distribution above, it strikes me that for many of the runs, they're actually reasonably well-distributed, although with certain nodes having well less than average. But maybe those nodes were already overcommitted at that time?

As for each node handling more than one at a time, there's already an issue opened for that and it's targeted for 1.13.

@Lyndon-Li
Copy link
Contributor

We have discussed this topic during the initial data mover discussions --- Velero's own load balancer:

  1. On one hand, Kubernetes scheduler knows the CPU and memory resources well and it also knows the affinities and topologies which are all required by Velero data mover workload distribution
  2. On the other hand, Kubernetes scheduler doesn't handle some other requirements that Velero data mover workload distribution cares, for example, the network bandwidth usage, if CPU and memory are sufficient in all nodes, Kubernetes scheduler may assign multiple backup/restore pods in one node, but the network bandwidth is probably the bottleneck. Moreover, in this case, even though the network bandwidth is sufficient, Velero has the concurrency config for each node, for which Kubernetes scheduler doesn't consider either.

Therefore, the ultimate solution may a combination of Kubernetes scheduler and Velero's own load balancer.

@sseago
Copy link
Collaborator

sseago commented Sep 1, 2023

@Lyndon-Li If we do our own, we may want to make it configurable -- turning it on or off (not sure which the default should be) -- that way if one option is providing bad performance, users could try the other.

@Lyndon-Li
Copy link
Contributor

As mentioned above, we need the capability of Kubernetes scheduler as well as some supplements.
Ideally, we make a combination --- Velero only implements the supplements, the Kubernetes scheduler works as is together with Velero's part. Then we will not need a fallback.
Otherwise, if we cannot make them work together and Velero has to implement Kubernetes scheduler's part as well, then we will need to make it configurable in case that Velero's implementation is with bugs or out of sync with the latest Kubernetes system.

@Lyndon-Li
Copy link
Contributor

Reopen this issue, as #6926 has not completely fixed the problem --- the restore part is not fixed; and even for the backup part, there is not as much intelligence to assign data upload overhead as a LD provides.

Let's keep the issue open for new ideas of fixes.

@kaovilai
Copy link
Member

kaovilai commented Feb 5, 2024

Is Design for data mover node selection #7383 related?

@Lyndon-Li
Copy link
Contributor

Is Design for data mover node selection #7383 related?

No, #7383 is for node selection (include/exclude nodes), instead of spreading VGDP in the nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants