Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1alpha2 planning #24

Closed
4 tasks done
ffromani opened this issue Jan 23, 2023 · 17 comments
Closed
4 tasks done

v1alpha2 planning #24

ffromani opened this issue Jan 23, 2023 · 17 comments

Comments

@ffromani
Copy link
Contributor

ffromani commented Jan 23, 2023

proposed changes for v1alpha2, to be discussed

  • add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64) these are scheduler/topology manager constraints or constants, the API itself does not have any requirement here so we are deferring any further evaluation to at least v1alpha3
  • add a top-level node Attributes field, like we already have for Zones
  • deprecate topologyPolicy (remove in v1beta1). Document how node-local Topology Manager configuration can be expressed (labels the newly added Attributes)
  • find a way to expose TM policy options as labels (extension of point 2 above) Expose TM policy options as top-level Attributes object

spin off from #21 to enable more incremental evolution

@ffromani
Copy link
Contributor Author

ffromani commented Jan 23, 2023

In v1alpha1, the NodeResourceTopology (NRT) object expose a field TopologyPolicies at object level (= at compute node level) which maps 1:1 to the combination of the kubelet topology manager scope and policy configured on the node.
See here for details.

This data is exposed to make the scheduler plugin, the sole consumer of this information, easier.
The scheduler plugin uses this information to use the same behavior the Topology Manager (TM) would use on the node, so to filter out and/or score the nodes appropriately.

Additionally, this field is (planned to be) used to group together nodes using the same TM configuration. The scheduler can verify nodes are properly grouped by the same TM config checking this value. This verification is not very strong, but is the only available data the scheduler can consume.

Is worth to stress that the primary intended use, by far and large, is to run the same logic as TM does on the scheduler side.

The information about the TM configuration however doesn't really belong to NRT. NRT should report only resources, not configuration data. We can very much get the same information using a more kubernetes-native mechanism: labels.
updating agents (RTE, NFD...) can add labels to convey this information. The pending missing item before being able to implement this design, and thus deprecate and then remove the TopologyPolicies field is

  1. should labels represent 1:1 configuration option names and values from kubelet configuration or rather
  2. should labels abstract the kubelet configuration into their intended behavior?

Pros and cons:
exposing kubelet configuration duplicating kubeletconfiguration API
PRO: straightforward mapping and trivial implementation
CON: leaks and duplicates kubelet configuration details
CON: needs to be kept in sync with kubelet

just expose the kubeletconfiguration API
PRO: straightforward mapping and trivial implementation
CON: adds another dependency: clients need to consume this API and the kubeletconfiguration API

new set of labels derived from kubelet configuration
PRO: decoupled from kubelet configuration
CON: translation layer has a cost (documentation, implementation)
???: still needs to be kept in sync with kubelet, but less urgently

@ffromani
Copy link
Contributor Author

In v1alpha1, the NodeResourceTopology (NRT) object expose a field TopologyPolicies at object level (= at compute node level) which maps 1:1 to the combination of the kubelet topology manager scope and policy configured on the node. See here for details.

This data is exposed to make the scheduler plugin, the sole consumer of this information, easier. The scheduler plugin uses this information to use the same behavior the Topology Manager (TM) would use on the node, so to filter out and/or score the nodes appropriately.

Additionally, this field is (planned to be) used to group together nodes using the same TM configuration. The scheduler can verify nodes are properly grouped by the same TM config checking this value. This verification is not very strong, but is the only available data the scheduler can consume.

Is worth to stress that the primary intended use, by far and large, is to run the same logic as TM does on the scheduler side.

The information about the TM configuration however doesn't really belong to NRT. NRT should report only resources, not configuration data. We can very much get the same information using a more kubernetes-native mechanism: labels. updating agents (RTE, NFD...) can add labels to convey this information. The pending missing item before being able to implement this design, and thus deprecate and then remove the TopologyPolicies field is

1. should labels represent 1:1 configuration option names and values from kubelet configuration or rather

2. should labels abstract the kubelet configuration into their intended behavior?

Pros and cons: exposing kubelet configuration duplicating kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: leaks and duplicates kubelet configuration details CON: needs to be kept in sync with kubelet

just expose the kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: adds another dependency: clients need to consume this API and the kubeletconfiguration API

new set of labels derived from kubelet configuration PRO: decoupled from kubelet configuration CON: translation layer has a cost (documentation, implementation) ???: still needs to be kept in sync with kubelet, but less urgently

my take: NRT wants to generalize what TM does, so binding ourselves to TM is a step back. For these reasons it seems better not to depend directly on kubelet configuration, we should define our labels and our enum values for these labels.

The TM scope concept is generic and forward-looking enough concept we can translate into something like

topology.node.k8s.io/alignment-scope =["container", "pod"]

(or resource-alignment-scope?)

a less mature proposal for expressing the TM policy constraints can be

topology.node.k8s.io/alignment-constraint =["none", "minimize", "full"]

(values very much subject to change)
full -> "fully constrained" -> single numa node
minimize -> "use the minimal number of zones" -> restricted
none (or no label) -> "best-effort" or "none" (aka no guarantees/constraints)

@PiotrProkop
Copy link
Contributor

PiotrProkop commented Jan 24, 2023

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels.
Something like topology.node.k8s.io/alignment-hints?
And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

@ffromani
Copy link
Contributor Author

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

@PiotrProkop
Copy link
Contributor

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

Right now it's easy cause we only have one option 😄 I feel like there won't be a lot of them in the future thanks to DRA and NRI so I am happy with exposing only selected few and we should include only prefer-closest-numa-nodes in initial support.

@ffromani
Copy link
Contributor Author

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

Right now it's easy cause we only have one option smile I feel like there won't be a lot of them in the future thanks to DRA and NRI so I am happy with exposing only selected few and we should include only prefer-closest-numa-nodes in initial support.

Thanks. This is going to be a balancing act. We're shaping up the label API and the design direction with help and feedback from NFD maintainers, I'll keep you in this loop tagging you on GH discussions as needed.

@ffromani
Copy link
Contributor Author

Here's the (very) initial proposal, so we can look at some code and start to get a feeling of how it could look like: #25

@marquiz
Copy link
Contributor

marquiz commented Jan 25, 2023

I'm just wondering are labels the way we want to advertise this information? Just a thought but could the recently introduced NodeFeature CRD in NFD be used for this?

@ffromani
Copy link
Contributor Author

I'm just wondering are labels the way we want to advertise this information? Just a thought but could the recently introduced NodeFeature CRD in NFD be used for this?

I think what we need is a way to categorize and group nodes using common attributes. Labels can be a way to do this but I'm (and I think we) are very open to alternatives. I'm not up to speed with the NodeFeature CRD so I'll happily read about this.

@marquiz
Copy link
Contributor

marquiz commented Jan 25, 2023

BTW, one unrelated improvement in the types would be to enumerate also other ZoneTypes than jyst "NUMA", e.g. all CPU topology levels identified by the Linux kernel, i.e. package, die, cluster (a bit overloaded term in K8s context 😅), core(?)

@kad
Copy link

kad commented Jan 25, 2023

BTW, one unrelated improvement in the types would be to enumerate also other ZoneTypes than jyst "NUMA", e.g. all CPU topology levels identified by the Linux kernel, i.e. package, die, cluster (a bit overloaded term in K8s context sweat_smile), core(?)

probably instead of "numa" -> "memory node" or something similar, emphasis on memory partitioning.
For package/die/cluster/core -> those should be something like "cpu xxxx" to emphasize partitioning by cpu topology.

@swatisehgal
Copy link
Contributor

add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64)

I think this makes sense in general and we should do this but I would like to point out that the maximum supported NUMA zones is 8 as can be seen the topology manager documentation.

should labels represent 1:1 configuration option names and values from kubelet configuration or rather
should labels abstract the kubelet configuration into their intended behavior?

new set of labels derived from kubelet configuration
PRO: decoupled from kubelet configuration
CON: translation layer has a cost (documentation, implementation)
???: still needs to be kept in sync with kubelet, but less urgently

my take: NRT wants to generalize what TM does, so binding ourselves to TM is a step back. For these reasons it seems better not to depend directly on kubelet configuration, we should define our labels and our enum values for these labels.

Can you elaborate on why you think using TM policies is a step back? We are running an algorithm in the scheduler plugin to approximate the allocation decision make by Topology Manager in kubelet. I am noticing that we are trying to decouple from kubelet configuration but I am probably not fully understanding why that is a problem. I am open to learn and get a better understanding of this.

I am biased towards labels with a 1:1 mapping between the configured kubelet policy/policy options and the created labels than abstracted versions of them. It seems more intuitive and saves us from having to translate and maintain additional documentation. Maybe it's just me but I feel that it might cause confusion and would be hard to explain that the labels exposed by NFD are related to Topology Manager configuration but still don't match with the configured values? What if someone independent of Topology aware scheduling usecase wants to use these labels to target a workload to be scheduled on a node that has a specific topology manager policy configured? If we use these new labels, we would be diverging from using well recognized policies that is documented in the officical kubernetes documentation.

When it comes to keeping the values synced with kubelet, since there are planning to graduating Topology Manager to GA, I expect the names of existing policies and policyoptions to continue to exist in the near future. If new policies or policy options are added, we would have to take those into account regardless of whether we want to have a 1:1 mapping or an abstracted label representation.

Maybe the abstraction of the polices is motivated by the fact that there are ongoing discussions is the community about making resource management pluggable in kubelet for more flexibility? Thinking out loud here but if that is the case, we would have to figure out the fate of topology manager and its policies moving forward. I expect support for resource managers and existing polices be supported in next few release (especially now that there are plans of graduating it to GA). A clear migration path would have to be identified in the community in case responsibility is offloaded to resource plugins and resource managers are deprecated.

Having said that, I recognize we are heading in the right direction by removing the topologypolicies from the NRT API to NFD where it is more suitable so don't want to slow down the momentum. We can pick an approach that everyone prefers and pivot later if needed.

@ffromani
Copy link
Contributor Author

Thanks @swatisehgal , yours is a good point. I'm reconsidering a bit the approach about label representation, also taking into account the previous feedback from @marquiz . I'll take the minimum time to properly formalize my thoughts and I'll update this thread.

@ffromani
Copy link
Contributor Author

I had a offline chat with others including @kad and we agreed that the API should not define the labels after all. It should be a contract between the producer and the consumer, and it's at very least too early to have labels and values frozen in the API, if we should do this at all.

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to the proposal (and into #25).

The combinations of these two factors enables a nice upgrade path: NFD (or RTE or any other agent) can add the TM configuration of the node as top-level attributes which the scheduler plugin (or any other consumer) can read and use.

ffromani added a commit to k8stopologyawareschedwg/resource-topology-exporter that referenced this issue Jan 30, 2023
expose TM configuration as attributes of a special Zone.
This is in addition to TopologyPolicies field, which is
being deprecated. See:
- k8stopologyawareschedwg/noderesourcetopology-api#24
- k8stopologyawareschedwg/noderesourcetopology-api#25

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to k8stopologyawareschedwg/resource-topology-exporter that referenced this issue Jan 30, 2023
expose TM configuration as attributes of a special Zone.
This is in addition to TopologyPolicies field, which is
being deprecated. See:
- k8stopologyawareschedwg/noderesourcetopology-api#24
- k8stopologyawareschedwg/noderesourcetopology-api#25

Signed-off-by: Francesco Romani <[email protected]>
@fmuyassarov
Copy link

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to t> he proposal (and into #25).

Also, I guess the ttl time we discussed on the call for the cleaning stale NRT CRs can go in the top level attributes too.
Also just a reminder from the same call, it would be good to add a short name to the CRD, probably nrt.

@ffromani
Copy link
Contributor Author

ffromani commented Feb 1, 2023

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to t> he proposal (and into #25).

Also, I guess the ttl time we discussed on the call for the cleaning stale NRT CRs can go in the top level attributes too.

Yes. We'are going to add other extra fields as well.

Also just a reminder from the same call, it would be good to add a short name to the CRD, probably nrt.

Planned for v1beta1

@ffromani
Copy link
Contributor Author

ffromani commented Feb 1, 2023

closed by #25

@ffromani ffromani closed this as completed Feb 1, 2023
ffromani added a commit to k8stopologyawareschedwg/resource-topology-exporter that referenced this issue Feb 2, 2023
expose TM configuration as attributes.
This is in addition to TopologyPolicies field, which is
being deprecated. See:
- k8stopologyawareschedwg/noderesourcetopology-api#24
- k8stopologyawareschedwg/noderesourcetopology-api#25

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to k8stopologyawareschedwg/podfingerprint that referenced this issue Feb 2, 2023
Since the NRT API is moving towards top-level attributes
(see:
k8stopologyawareschedwg/noderesourcetopology-api#24)
add the preferred Attribute Name alongside the existing Annotatio Name.

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to k8stopologyawareschedwg/podfingerprint that referenced this issue Feb 2, 2023
Since the NRT API is moving towards top-level attributes
(see:
k8stopologyawareschedwg/noderesourcetopology-api#24)
add the preferred Attribute Name alongside the existing annotation Name.

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to k8stopologyawareschedwg/resource-topology-exporter that referenced this issue Feb 2, 2023
expose TM configuration as attributes.
This is in addition to TopologyPolicies field, which is
being deprecated. See:
- k8stopologyawareschedwg/noderesourcetopology-api#24
- k8stopologyawareschedwg/noderesourcetopology-api#25

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to k8stopologyawareschedwg/resource-topology-exporter that referenced this issue Feb 7, 2023
expose TM configuration as attributes.
This is in addition to TopologyPolicies field, which is
being deprecated. See:
- k8stopologyawareschedwg/noderesourcetopology-api#24
- k8stopologyawareschedwg/noderesourcetopology-api#25

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to k8stopologyawareschedwg/resource-topology-exporter that referenced this issue Feb 7, 2023
expose TM configuration as attributes.
This is in addition to TopologyPolicies field, which is
being deprecated. See:
- k8stopologyawareschedwg/noderesourcetopology-api#24
- k8stopologyawareschedwg/noderesourcetopology-api#25

Signed-off-by: Francesco Romani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants