Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data mover backup node black list - Don't run in specified node #7036

Closed
Lyndon-Li opened this issue Oct 31, 2023 · 7 comments
Closed

Data mover backup node black list - Don't run in specified node #7036

Lyndon-Li opened this issue Oct 31, 2023 · 7 comments

Comments

@Lyndon-Li
Copy link
Contributor

The data mover backup exposer generally has the ability to select node for running the data movement. This enables us to fulfill below user requirement:
Sometimes, if a node is running very critical workloads, users don't want data movements to run in the node

We can develop a black list of node mechanism, the nodes in the list will not host data movements.

@Lyndon-Li Lyndon-Li added area/datamover Enhancement/User End-User Enhancement to Velero labels Oct 31, 2023
@Lyndon-Li Lyndon-Li self-assigned this Oct 31, 2023
@Lyndon-Li Lyndon-Li changed the title Data mover backup black list - Don't run in specified node Data mover backup node black list - Don't run in specified node Oct 31, 2023
@Lyndon-Li
Copy link
Contributor Author

Another user case on this requirement #7185

@Lyndon-Li
Copy link
Contributor Author

Another use case on this requirement #7243

@balbiv
Copy link

balbiv commented Jan 31, 2024

@Lyndon-Li thank you for adding this to the milestones. Because the helm deployment allows to set nodeSelectors for the node-agent, I created a dedicated node pool for it so data movement (CSI Snapshot) does have no impact on critical production applications. However, I figured that the pod responsible for mounting the backup PVC (running /velero-helper pause) is allowed to schedule on every node. If the backup pod can take over the nodeSelectors/tolerations from the node-agent daemonset, it will already be a big improvement. This also fixes that the backup pod cannot be started because the image pull policy is set to never.

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Feb 5, 2024

@balbiv
Thanks for the suggestion, as the current plan, we will create a dedicate loadAffinity configMap, instead of coupling node-agent pods scheduling to data mover backup pods scheduling, for below reasons:

  • The node-agent not only runs snapshot data movement, but for many other purpose, e.g., it also runs fs-backup/restore. So users may still not want to run data mover backups in all the nodes where node-agent pods resides in
  • The other plan is to automatically inherit node-agent pods' node-selection configurations to backupPods'. However, we are clear of some of the configurations, i.e., nodeSelectors, but we are not confident to some others. Specifically, we are not confident that simply inheriting all the configurations from node-agent would bring the same result for backupPod schedule as the node-agent pod schedule since we think the daemonset scheduler behaves differently from plain pod scheduler

Finally, we decided not to bring node-agent's node selection into consideration, if there are some of these configurations, users must apply them to loadAffinity configMap in an appropriate way, see the design PR #7383.

This is the initial plan, we may make changes according to comments of the PR, so for any ideas, you can comment in the PR.

@SCLogo
Copy link

SCLogo commented Nov 26, 2024

I know it is a quite old ticket, but do you have idea how to set tolerations? I have nodes with taints set.

@kaovilai
Copy link
Member

I see it for maintenance jobs but not for node-agents.

@kaovilai
Copy link
Member

@SCLogo follow: #2898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants