Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for node-agent concurrency #6950

Merged
merged 1 commit into from
Nov 8, 2023

Conversation

Lyndon-Li
Copy link
Contributor

@Lyndon-Li Lyndon-Li commented Oct 13, 2023

For issue #6663, Add the design for node-agent concurrency

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Merging #6950 (a0edad9) into main (74ed994) will increase coverage by 0.43%.
Report is 57 commits behind head on main.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #6950      +/-   ##
==========================================
+ Coverage   60.67%   61.11%   +0.43%     
==========================================
  Files         250      252       +2     
  Lines       26515    26916     +401     
==========================================
+ Hits        16088    16449     +361     
- Misses       9283     9302      +19     
- Partials     1144     1165      +21     

see 29 files with indirect coverage changes

design/node-agent-concurrency.md Outdated Show resolved Hide resolved
### Global data path manager
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters.
Therefore, we need to create one global data path manager instance server-wide, and pass it to different controllers. The instance will be created at node-agent server startup.
The concurrent number is required to initiate a data path manager, the number comes from either Per-node concurrent number or Global concurrent number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we control the limit of the running VGDP instances? Is this already implemented by Velero or Kopia?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it has been implemented, check the dataPath manager code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also update the PVB, PVR, DataUpload, and DataDownload status to reflect the reason why they are still not running?

	fsBackup, err := r.dataPathMgr.CreateFileSystemBR(pvb.Name, pVBRRequestor, ctx, r.Client, pvb.Namespace, callbacks, log)
	if err != nil {
		if err == datapath.ConcurrentLimitExceed {
			return ctrl.Result{Requeue: true, RequeueAfter: time.Minute}, nil
		} else {
			return r.errorOut(ctx, &pvb, err, "error to create data path", log)
		}
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do this in the code change PR.


## Goals

- Define the behaviors of concurrent VGDP instances in node-agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support configuring the concurrency in the backup/restore level in the future, will this feature be impacted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not impacted. The two controls could work together, backups/restores are not probably all in the same node; while in the same node, the VGDP concurrency mechanism has higher priority over backup/restore level concurrency, as the former is for resources consideration, if there are not enough resources, the backups/restores level configurations are not be fulfilled.

@Lyndon-Li Lyndon-Li force-pushed the issue-fix-6663-design branch 2 times, most recently from 9962e43 to d219b0e Compare October 25, 2023 13:44
@Lyndon-Li Lyndon-Li force-pushed the issue-fix-6663-design branch 3 times, most recently from d1daa89 to 8163ad0 Compare October 26, 2023 10:21
Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions on the design of this API

design/node-agent-concurrency.md Outdated Show resolved Hide resolved
## Non-Goals
- VGDP instances from different nodes always run in concurrent since in most common cases the resources are isolated. For special cases that some resources are shared across nodes, there is no support at present
- In practice, restores run in prioritized scenarios, e.g., disarster recovery. However, the current design doesn't consider this difference, a VGDP instance for a restore is blocked if it reaches to the limit of the concurrency, even though the ones block it are for backups. If users do meet some problems here, they should consider to stop the backups first
- Sometimes, users wants to block VGDP instances from running in a specific node, this is not considered in the current design. To archive this, more modules need to be considered, just setting VGDP instances' concurrent number to 0 won't bring the ideal reasult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the result then? and why would it not be ideal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the details in the design.

### Per-node concurrent number
We allow users to specify different concurrent number per node, for example, users can allow 3 concurrent instances in Node-1, 2 instances in Node-2 and 1 instance in Node-3. This is for below considerations:
- The resources may be different among nodes. Then users could specify smaller concurrent number for nodes with less resources while larger number for the ones with more resources
- Help users to isolate critical environments. Users may run some critical workloads in some specified nodes, since VGDP instances may take large resource consumption, users may only want to run them in the other nodes or run less number of instances in the nodes with critical workloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may only want to run them in the other nodes

How does this work if I can't set the number to zero and we do not consider turning off the node agent for specific nodes.

Feels like there is a mismatch between goals/non-goals and solution, can you help me understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number is 0 for either the global number of per-node number, it will be ignored.

There were indeed some struggles on supporting the 0 number (blocking mode), but finally I gave it up for now. So there were some mismatch in the doc. I've removed this sentence.

In future, if we see cases hardly require this, let's think about a solution from the entire workflow. As mentioned in the doc, only do it from VGDP instance level doesn't work


The range of Per-node concurrent number is the same with Global concurrent number.
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node.
Per-node concurrent number is implemented in a separate configMap named ```node-agent-configs```. This configMap is not created by Velero, users should create it manually on demand. The configMap should be created before node-agent server starts, otherwise, node-agent server should be restarted in order to make the effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we watch for this particular config map instead and make the changes necessary when they are made?

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. From me, this number is not frequently changed, so we don't need to watch it, but retrieve it once at server starts.
  2. There are some more details to cover. If we allow it, users may change the number in the configure any time, there will be a chaos. E.g., for one time, there are 3 concurrencies and for the other time, there are 2 concurrencies.
  3. It makes the implementation complex, the dataPathManager is a singleton instance, if the number is not set only once when initiating it, we will need a watch mechanism and also a contest solving mechanism to update the number

Therefore, if there is not a case that strongly requires this, I would prefer not adding this complexy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having behavior that matches kube-like behavior (constant reconciliation) is complexity that is worth adding.

The ```node-agent-configs``` configMap may be used for other purpose of configuring node-agent in future, at present, there is only one kind of data ```data-path-concurrency```. Its data struct is like below:

```go
type DataPathConcurrency struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it considered to make this a CRD singleton cluster scoped CR?

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I get your question, let me try to answer it. By the design, this is singleton for a single Velero installation. If there are more than one installation in a cluster, I didn't consider it for the design, but technically, yes, all the Velero installations will follow the same cm.

Copy link
Contributor

@shawn-hurley shawn-hurley Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So either, this should be a cluster-scoped resource as a CRD (not a config map). Or it should be namespace scoped configmap and tied to the velero node agent configuration. I think it will be confusing otherwise. Magic config maps that have special meaning used to be needed, but I think we have options to make it more declarative and therefore more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as my understanding, this per-node number is a static configuration, so I don't understand the benefit to make it a CRD.
If you are talking about dynamically watching it, as mentioned here, I don't think it is necessary to dynamically watch a configuration that is not to be frequently changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be frequently changed, BUT when you change it, what is the process now, and how does it affect running backups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So both the global number change and per-node number change won't take effect unless node-agent server restarts and node-agent server is only recommended to restart when there are no running backups/restores

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about the brittleness of this with a backup and restore system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For the already running backups/restores, I do believe this is a reasonable behavior, we cannot peel a data path thread (which for example has backed up tones of data) whenever the number is changed by users.
  2. For the backups/restores to be created/submitted, we can use the new number without restarting node-agent server, but I don't think it is something must have, especially in the first version.

For 2, I am open for a vote of this. It is just efforts/complexity vs. user friendliness

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I am following 1 unless you are spelling out the agreement on letting running threads continue running and reducing the number over time.

I believe that if you implemented the system described in 1 and 2 it would be more resilient and have a better UX (especially when you say need to bump up/down these resources for a $REASON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the agreement on letting running threads continue running and reducing the number over time

Yes, this is what I meant, the limit should only work on the new coming VGDP instance creations.

```go
type DataPathConcurrency struct {
// ConfigRules specifies the configs by node identified by label selector
ConfigRules map[string]int `json:"configRules"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a string, we should use a LabelSelector, IMO.

Also, something to consider: if multiple labels are matched, which one is selected? I suggest the safe thing is the lesser of the entries.

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if multiple labels are matched, which one is selected? I suggest the safe thing is the lesser of the entries

Yes, that is what we are going to do. See this statement in the design "If one node falls into more than one elements, e.g., if node1 also has the label beta.kubernetes.io/instance-type=Standard_B4ms, the smallest number (3) will be used. "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a string, we should use a LabelSelector, IMO.

See this comment. The string is in the pattern of LabelSelector, so the code will convert a string to a LabelSelector. This is for two considerations:

  1. LabelSelector cannot be a key of map. To use LabelSelector, the structure will be complex
  2. A string is convenient for users to write the config, they don't need to write "{}" or the assignment to "matchLabel"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don' know if this makes sense. There is a well-defined struct for label selector, and if you don't use that, it won't be very clear.

I would suggest using a CRD for this and the well-known label selector struct. I would probably just do a simple list. Then, you can order that list by the size of the threads. you will loop through the selectors, and once one is found, it will be the lowest option.

I don't think you gain anything from having a map here, as to see if a node is selected you have to look at the key, so you will have to loop over the map. I think that a list would be a better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try with a list of label selector. I have ever considered this, but it eventually involves two data structures, so I just dropped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed to LabelSelector

If one node falls into more than one elements, e.g., if ```node1``` also has the label ```beta.kubernetes.io/instance-type=Standard_B4ms```, the smallest number (3) will be used.

### Global data path manager
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand.

This is saying that the node agent controller will be responsible for the number of concurrent DataUploads/DataDownloads it can handle and ensure this is true? Can you help me understand why this is not the thread of the workers for the controller and needs something in a global manager?

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is saying that the node agent controller will be responsible for the number of concurrent

No, this is the current/existing behavior. We need to change the control to the node-agent server wide, as you mentioned in a global manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Can you please help me?

Is this because we have two things (DataUploads and DataDownloads) with workers, and you want to split them?

If this is the case, then isn't there a concern that a node agent will take ownership, because its worker is free just spinning because there are no available "dataManager" threads? Then other node agents that are in the system, that may have free "dataManager" threads, will be sitting idle?

Did I miss us handling this in the design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have mislead you. The thing is, in the existing code, there are one dataPathMgr in one controller, so in order to count the overall VGDP instances no matter they are from a DataUpload, DataDownload, PodVolumeBackup or PodVolumeRestore, we need to make a singleton dataPathMgr. See the existing code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I do think that, because a node takes ownwership of something and then can determine if it can do work on it, we may want to consider how to make sure this is something that can more adequately be spread out.

@Lyndon-Li Lyndon-Li force-pushed the issue-fix-6663-design branch from 8163ad0 to f1ddae3 Compare October 27, 2023 01:48
## Solution

### Global concurrent number
We add a node-agent server parameter ```data-path-concurrent-num``` to accept a concurrent number that will be applied to all nodes if the per-node number is not specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this param is for the node-agent server, isn't it always per-node, because different nodes run different node-agent server instances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for node-agent server, see this comment


The range of Per-node concurrent number is the same with Global concurrent number.
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node.
Per-node concurrent number is implemented in a separate configMap named ```node-agent-configs```. This configMap is not created by Velero, users should create it manually on demand. The configMap should be created before node-agent server starts, otherwise, node-agent server should be restarted in order to make the effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the configmap is an alternative to the param for node-agent server, I'd vote for node-agent server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an alternative, the thing is, we want a number that applied to all nodes and we also want some nodes have their particular number.


### Global data path manager
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters.
Therefore, we need to create one global data path manager instance server-wide, and pass it to different controllers. The instance will be created at node-agent server startup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by server-wide, do you mean node-agent server or the whole velero setup? I assume it's node-agent server, if that's the node-agent server, I don't think it's global

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is node-agent server. Maybe we can give it a better name, what about "Ordinary Number"?

- Help users to isolate critical environments. Users may run some critical workloads in some specified nodes, since VGDP instances may take large resource consumption, users may want to run less number of instances in the nodes with critical workloads

The range of Per-node concurrent number is the same with Global concurrent number.
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "Global concurrent number" is a little confusing to me. IMO this concurrent number is to control the concurrency within one node-agent server, and if this is true, the number is per-node. And the "Global current number" is just a shortcut to help user to set the currency for every node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "Global concurrent number" applies to every node by default and if a per-node number is also set for a node, the per-node number overwrites the number in that node.
For example, say Global Number is 3, and there are 3 nodes -- node-1, node-2, node-3; and users also configured a per-node number for node-2, say 5, then the concurrent number for each node are:
node-1: 3, node-2:5, node-3:3

@shawn-hurley
Copy link
Contributor

shawn-hurley commented Oct 27, 2023

I would like to see an alternative explored, where

  1. we handle a node not having workers running for one of DataDownloads OR DataUploads or both. I think this is going to be an everyday thing. You don't want a node that is doing AI work with GPU's to have to have a worker running. Especially if you say you have a node with tons of resources for this purpose.
  2. One that uses a CRD that uses well defined label selector syntax. I would also like to consider being able to break up if a node does DataDownloads or DataRestores. I would also add to this, consideration of something other than a per node singleton that manages all the data path stuff.

@Lyndon-Li
Copy link
Contributor Author

we handle a node not having workers running for one of DataDownloads OR DataUploads or both. I think this is going to be an everyday thing. You don't want a node that is doing AI work with GPU's to have to have a worker running. Especially if you say you have a node with tons of resources for this purpose.

I agree it is a common requirement. However, as I mentioned, we cannot simply make this from define the concurrency of VGDP.
We need to consider the reality of different backup types and consider more modules like the exposer. E.g., we cannot make this for fs-backup, because the VGDP must run in the same node that the source pod is running.
Therefore, to support this, I would suggest we a new design which would be very different from the current one

@Lyndon-Li
Copy link
Contributor Author

I would also like to consider being able to break up if a node does DataDownloads or DataRestores. I would also add to this, consideration of something other than a per node singleton that manages all the data path stuff.

I agree this is a reasonable requirement. But again, we cannot finish this from the VGDP level only. Things are different for backup types. E.g., for both fs restore and data mover restore, pods are first created and then VGDP instances are launched, the thing is, in some cases, a VGDP instance must run in the same node where the pod is scheduled to, otherwise, the restored volume may not be mounted to the pod if there are some topology/affinity/taint settings.

So the same, these requirements is not the current design could fulfill, because it is only aiming to the number of VGDP instances

@shawn-hurley
Copy link
Contributor

I think that if we agree these are reasonable requirements, then the design should also solve them. I wonder if we could take some time in a community meeting or set up a one-off call, to discuss and brainstorm if that would be helpful?

@Lyndon-Li
Copy link
Contributor Author

I think that if we agree these are reasonable requirements, then the design should also solve them. I wonder if we could take some time in a community meeting or set up a one-off call, to discuss and brainstorm if that would be helpful?

Any way is fine to me. We can discuss in the next community meeting. Meanwhile, I am not sure if we are on the same page for everything (regardless our opinions), if not, we can run a sync-up call first.

@Lyndon-Li
Copy link
Contributor Author

we handle a node not having workers running for one of DataDownloads OR DataUploads or both.
I would also like to consider being able to break up if a node does DataDownloads or DataRestores.

@shawn-hurley For these two questions, I think we are still not in the same page, because I thought you regarded them as something we can make, but I don't think so --- considering all types, PVB/PVR, data mover backup/restore, these are something we cannot achieve right now and for a long time in future. For the details, we need to consider from end to end:

Why can't we support --- "we handle a node not having workers running for one of DataDownloads OR DataUploads or both":
For all the types, only data mover backup could support it right now. For PVB/PVR, at the time when launching the VGDP instance, the source/restored pod is already there in a specific node, and volume is already mounted in that node, we must run VGDP in the node, no way to choose. So it is a vain to set VGDP limit in that node to 0, unless we want the PVB/PVR never to finish. For data mover restore, in the case that the volume binding mode is WaitForFirstConsumer, things are the same, we cannot run the VGDP in a node we choose, but only in the node that the restored pod is scheduled to.

This means, unless enhancing the above pod schedule related behavior, we have no way to support it. For sure, it will be a hard work.

Why can't we support --- "being able to break up if a node does DataDownloads or DataRestores":
For the same reasons as mentioned above, VGDP instances for PVB, PVR and DataRestores cannot run in a node we choose, so it is a vain to split them as the ones for backups and for restores.

@Lyndon-Li Lyndon-Li force-pushed the issue-fix-6663-design branch 5 times, most recently from 57a307e to 2d33d45 Compare October 30, 2023 10:46
@Lyndon-Li
Copy link
Contributor Author

Done on some further changes to address comments:

  • Change the rule structs to use list and LabelSelector
  • Move the GlobalNum to the configMap as well
  • Add a sample for the configMap

Please continue the review.

@shawn-hurley
Copy link
Contributor

Thank you for explaining where we were not on the same page. I think that helped a lot for me to understand the concerns. I think now we have to go up the stack a second and talk about the feature and what it means it for users.

Is it worth breaking up what can be expected by users based on which backup they choose? This would allow us to do the "more correct" thing for DataUploads and continue to support PVB/PVR requests and DataDownloads that have to get set for a specific node.

I think that it is reasonable that the main need here is to control nodes that are uploading data. If you are restoring, something went wrong, and you probably want most/all available resources to do that, from a UX perspective, it seems reasonable.

What are the downsides of this?

@Lyndon-Li
Copy link
Contributor Author

Is it worth breaking up what can be expected by users based on which backup they choose? This would allow us to do the "more correct" thing for DataUploads

This is what I was planning to do in a a separate PR/design (another issue #7036 is opened for it), since its implementation/usability deviates the current PR:

  1. The modules to implement this is different, that is, exposer vs. VGDP
  2. The scope is different, that is, data mover backup vs. all tasks with moving data
  3. The user experience is different, that is, black list vs. VGDP concurrency

@Lyndon-Li Lyndon-Li force-pushed the issue-fix-6663-design branch from 2d33d45 to d2cfd9f Compare October 31, 2023 02:32
Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerns around naming now that I finally understand the issues, thank you @Lyndon-Li for being patient while I caught up.

I hope the comments make sense in that light, let me know if I am still misunderstanding!


The data structure for ```data-path-concurrency``` is as below:
```go
type DataPathConcurrency struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that, this in conjunction with the changes that you are describing for the data upload change, may be confusing.

I think that, at least now, i understand the DataPath name and what it means, but I don't think it is clear to the end users what this is managing.

Can we call this external concurrency or node load concurrency or something, it feels like DataPath is very much an internal concept that does convey specific meaning.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, DataPath is more like a design/architecture level concept.
We have discussed a little about this in the community meeting, but haven't get the final answer. Some of the previous nominations -- "dataPathConcurrency", "dataMoverConcurrency".

Personally, I think "loadConcurrency" (we don't need to mention "node" because it is under the node-agent-configs) is close to what we are doing here and is not ambiguous. Waiting for others' opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this naming question, we decide not to block the merge of this design. We can keep the discussion on this, and when we want to change the name, if the design has been merged, we come back to make slight change to this design.
So this discussion can continue until the FC of 1.13.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li "loadConcurrency" works for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadConcurrency works for me as well. Docs on this guy will be necessary, but I now better understand what this is doing. Thank you for taking the time, and I agree if inspiration strikes and folks come up with a better name, awesome!


## Solution

We introduce a configMap named ```node-agent-configs``` for users to specify the node-agent related configurations. This configMap is not created by Velero, users should create it manually on demand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which namespace would this config live in? How would two node agent daemon sets, if we have multiple installs know which one to look at?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I have modify the design to include it --- the configMap is namespace specific and only works for node-agent in the specific namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. I worry about the UX once you have three and try to understand what is going on in a node across the three pods, but I think we can solve it if/when we hit that problem.

@Lyndon-Li Lyndon-Li force-pushed the issue-fix-6663-design branch from d2cfd9f to a0edad9 Compare November 1, 2023 03:35
@shawn-hurley
Copy link
Contributor

/lgtm

@ywk253100 ywk253100 merged commit 866fbb5 into vmware-tanzu:main Nov 8, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants