v1alpha2 planning #24

ffromani · 2023-01-23T13:58:40Z

proposed changes for v1alpha2, to be discussed

~~add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64)~~ these are scheduler/topology manager constraints or constants, the API itself does not have any requirement here so we are deferring any further evaluation to at least v1alpha3
add a top-level node Attributes field, like we already have for Zones
deprecate topologyPolicy (remove in v1beta1). Document how node-local Topology Manager configuration can be expressed (~~labels~~ the newly added Attributes)
~~find a way to expose TM policy options as labels (extension of point 2 above)~~ Expose TM policy options as top-level Attributes object

spin off from #21 to enable more incremental evolution

The text was updated successfully, but these errors were encountered:

ffromani · 2023-01-23T14:43:24Z

In v1alpha1, the NodeResourceTopology (NRT) object expose a field TopologyPolicies at object level (= at compute node level) which maps 1:1 to the combination of the kubelet topology manager scope and policy configured on the node.
See here for details.

This data is exposed to make the scheduler plugin, the sole consumer of this information, easier.
The scheduler plugin uses this information to use the same behavior the Topology Manager (TM) would use on the node, so to filter out and/or score the nodes appropriately.

Additionally, this field is (planned to be) used to group together nodes using the same TM configuration. The scheduler can verify nodes are properly grouped by the same TM config checking this value. This verification is not very strong, but is the only available data the scheduler can consume.

Is worth to stress that the primary intended use, by far and large, is to run the same logic as TM does on the scheduler side.

The information about the TM configuration however doesn't really belong to NRT. NRT should report only resources, not configuration data. We can very much get the same information using a more kubernetes-native mechanism: labels.
updating agents (RTE, NFD...) can add labels to convey this information. The pending missing item before being able to implement this design, and thus deprecate and then remove the TopologyPolicies field is

should labels represent 1:1 configuration option names and values from kubelet configuration or rather
should labels abstract the kubelet configuration into their intended behavior?

Pros and cons:
exposing kubelet configuration duplicating kubeletconfiguration API
PRO: straightforward mapping and trivial implementation
CON: leaks and duplicates kubelet configuration details
CON: needs to be kept in sync with kubelet

just expose the kubeletconfiguration API
PRO: straightforward mapping and trivial implementation
CON: adds another dependency: clients need to consume this API and the kubeletconfiguration API

new set of labels derived from kubelet configuration
PRO: decoupled from kubelet configuration
CON: translation layer has a cost (documentation, implementation)
???: still needs to be kept in sync with kubelet, but less urgently

ffromani · 2023-01-23T15:25:26Z

In v1alpha1, the NodeResourceTopology (NRT) object expose a field TopologyPolicies at object level (= at compute node level) which maps 1:1 to the combination of the kubelet topology manager scope and policy configured on the node. See here for details.

This data is exposed to make the scheduler plugin, the sole consumer of this information, easier. The scheduler plugin uses this information to use the same behavior the Topology Manager (TM) would use on the node, so to filter out and/or score the nodes appropriately.

Additionally, this field is (planned to be) used to group together nodes using the same TM configuration. The scheduler can verify nodes are properly grouped by the same TM config checking this value. This verification is not very strong, but is the only available data the scheduler can consume.

Is worth to stress that the primary intended use, by far and large, is to run the same logic as TM does on the scheduler side.

The information about the TM configuration however doesn't really belong to NRT. NRT should report only resources, not configuration data. We can very much get the same information using a more kubernetes-native mechanism: labels. updating agents (RTE, NFD...) can add labels to convey this information. The pending missing item before being able to implement this design, and thus deprecate and then remove the TopologyPolicies field is
1. should labels represent 1:1 configuration option names and values from kubelet configuration or rather

2. should labels abstract the kubelet configuration into their intended behavior?
Pros and cons: exposing kubelet configuration duplicating kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: leaks and duplicates kubelet configuration details CON: needs to be kept in sync with kubelet

just expose the kubeletconfiguration API PRO: straightforward mapping and trivial implementation CON: adds another dependency: clients need to consume this API and the kubeletconfiguration API

new set of labels derived from kubelet configuration PRO: decoupled from kubelet configuration CON: translation layer has a cost (documentation, implementation) ???: still needs to be kept in sync with kubelet, but less urgently

my take: NRT wants to generalize what TM does, so binding ourselves to TM is a step back. For these reasons it seems better not to depend directly on kubelet configuration, we should define our labels and our enum values for these labels.

The TM scope concept is generic and forward-looking enough concept we can translate into something like

topology.node.k8s.io/alignment-scope =["container", "pod"]

(or resource-alignment-scope?)

a less mature proposal for expressing the TM policy constraints can be

topology.node.k8s.io/alignment-constraint =["none", "minimize", "full"]

(values very much subject to change)
full -> "fully constrained" -> single numa node
minimize -> "use the minimal number of zones" -> restricted
none (or no label) -> "best-effort" or "none" (aka no guarantees/constraints)

PiotrProkop · 2023-01-24T09:35:44Z

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels.
Something like topology.node.k8s.io/alignment-hints?
And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

ffromani · 2023-01-24T09:41:57Z

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

PiotrProkop · 2023-01-24T09:46:36Z

One thing we missing here are TopologyManagerPolicyOptions and how we want to express them via labels. Something like topology.node.k8s.io/alignment-hints? And I think that best-effort and none policy should be kept separately, because if we can pick the right node the best-effort policy will align resources same as restricted policy.

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

Right now it's easy cause we only have one option 😄 I feel like there won't be a lot of them in the future thanks to DRA and NRI so I am happy with exposing only selected few and we should include only prefer-closest-numa-nodes in initial support.

ffromani · 2023-01-24T09:49:19Z

right, both are good points. Exposing options becomes more hairy. From what you can see at this point in time, would we need to expose all options or just selected few, from scheduler perspective?

Right now it's easy cause we only have one option smile I feel like there won't be a lot of them in the future thanks to DRA and NRI so I am happy with exposing only selected few and we should include only prefer-closest-numa-nodes in initial support.

Thanks. This is going to be a balancing act. We're shaping up the label API and the design direction with help and feedback from NFD maintainers, I'll keep you in this loop tagging you on GH discussions as needed.

ffromani · 2023-01-25T06:58:57Z

Here's the (very) initial proposal, so we can look at some code and start to get a feeling of how it could look like: #25

marquiz · 2023-01-25T08:45:28Z

I'm just wondering are labels the way we want to advertise this information? Just a thought but could the recently introduced NodeFeature CRD in NFD be used for this?

ffromani · 2023-01-25T08:49:44Z

I'm just wondering are labels the way we want to advertise this information? Just a thought but could the recently introduced NodeFeature CRD in NFD be used for this?

I think what we need is a way to categorize and group nodes using common attributes. Labels can be a way to do this but I'm (and I think we) are very open to alternatives. I'm not up to speed with the NodeFeature CRD so I'll happily read about this.

marquiz · 2023-01-25T11:58:00Z

BTW, one unrelated improvement in the types would be to enumerate also other ZoneTypes than jyst "NUMA", e.g. all CPU topology levels identified by the Linux kernel, i.e. package, die, cluster (a bit overloaded term in K8s context 😅), core(?)

kad · 2023-01-25T17:03:30Z

BTW, one unrelated improvement in the types would be to enumerate also other ZoneTypes than jyst "NUMA", e.g. all CPU topology levels identified by the Linux kernel, i.e. package, die, cluster (a bit overloaded term in K8s context sweat_smile), core(?)

probably instead of "numa" -> "memory node" or something similar, emphasis on memory partitioning.
For package/die/cluster/core -> those should be something like "cpu xxxx" to emphasize partitioning by cpu topology.

swatisehgal · 2023-01-26T14:55:41Z

add constants for values we use repeatedly, like "Node" (and possibly more). Also add constants for max supported NUMA zones (64)

I think this makes sense in general and we should do this but I would like to point out that the maximum supported NUMA zones is 8 as can be seen the topology manager documentation.

should labels represent 1:1 configuration option names and values from kubelet configuration or rather
should labels abstract the kubelet configuration into their intended behavior?

new set of labels derived from kubelet configuration
PRO: decoupled from kubelet configuration
CON: translation layer has a cost (documentation, implementation)
???: still needs to be kept in sync with kubelet, but less urgently

my take: NRT wants to generalize what TM does, so binding ourselves to TM is a step back. For these reasons it seems better not to depend directly on kubelet configuration, we should define our labels and our enum values for these labels.

Can you elaborate on why you think using TM policies is a step back? We are running an algorithm in the scheduler plugin to approximate the allocation decision make by Topology Manager in kubelet. I am noticing that we are trying to decouple from kubelet configuration but I am probably not fully understanding why that is a problem. I am open to learn and get a better understanding of this.

I am biased towards labels with a 1:1 mapping between the configured kubelet policy/policy options and the created labels than abstracted versions of them. It seems more intuitive and saves us from having to translate and maintain additional documentation. Maybe it's just me but I feel that it might cause confusion and would be hard to explain that the labels exposed by NFD are related to Topology Manager configuration but still don't match with the configured values? What if someone independent of Topology aware scheduling usecase wants to use these labels to target a workload to be scheduled on a node that has a specific topology manager policy configured? If we use these new labels, we would be diverging from using well recognized policies that is documented in the officical kubernetes documentation.

When it comes to keeping the values synced with kubelet, since there are planning to graduating Topology Manager to GA, I expect the names of existing policies and policyoptions to continue to exist in the near future. If new policies or policy options are added, we would have to take those into account regardless of whether we want to have a 1:1 mapping or an abstracted label representation.

Maybe the abstraction of the polices is motivated by the fact that there are ongoing discussions is the community about making resource management pluggable in kubelet for more flexibility? Thinking out loud here but if that is the case, we would have to figure out the fate of topology manager and its policies moving forward. I expect support for resource managers and existing polices be supported in next few release (especially now that there are plans of graduating it to GA). A clear migration path would have to be identified in the community in case responsibility is offloaded to resource plugins and resource managers are deprecated.

Having said that, I recognize we are heading in the right direction by removing the topologypolicies from the NRT API to NFD where it is more suitable so don't want to slow down the momentum. We can pick an approach that everyone prefers and pivot later if needed.

ffromani · 2023-01-26T15:20:15Z

Thanks @swatisehgal , yours is a good point. I'm reconsidering a bit the approach about label representation, also taking into account the previous feedback from @marquiz . I'll take the minimum time to properly formalize my thoughts and I'll update this thread.

ffromani · 2023-01-27T14:41:02Z

I had a offline chat with others including @kad and we agreed that the API should not define the labels after all. It should be a contract between the producer and the consumer, and it's at very least too early to have labels and values frozen in the API, if we should do this at all.

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to the proposal (and into #25).

The combinations of these two factors enables a nice upgrade path: NFD (or RTE or any other agent) can add the TM configuration of the node as top-level attributes which the scheduler plugin (or any other consumer) can read and use.

expose TM configuration as attributes of a special Zone. This is in addition to TopologyPolicies field, which is being deprecated. See: - k8stopologyawareschedwg/noderesourcetopology-api#24 - k8stopologyawareschedwg/noderesourcetopology-api#25 Signed-off-by: Francesco Romani <[email protected]>

fmuyassarov · 2023-01-31T20:53:20Z

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to t> he proposal (and into #25).

Also, I guess the ttl time we discussed on the call for the cleaning stale NRT CRs can go in the top level attributes too.
Also just a reminder from the same call, it would be good to add a short name to the CRD, probably nrt.

ffromani · 2023-02-01T08:16:02Z

We also agreed that a Attributes field like we have already in Zone makes sense into the top-level object, so I'm adding it to t> he proposal (and into #25).

Also, I guess the ttl time we discussed on the call for the cleaning stale NRT CRs can go in the top level attributes too.

Yes. We'are going to add other extra fields as well.

Also just a reminder from the same call, it would be good to add a short name to the CRD, probably nrt.

Planned for v1beta1

ffromani · 2023-02-01T14:19:17Z

closed by #25

expose TM configuration as attributes. This is in addition to TopologyPolicies field, which is being deprecated. See: - k8stopologyawareschedwg/noderesourcetopology-api#24 - k8stopologyawareschedwg/noderesourcetopology-api#25 Signed-off-by: Francesco Romani <[email protected]>

Since the NRT API is moving towards top-level attributes (see: k8stopologyawareschedwg/noderesourcetopology-api#24) add the preferred Attribute Name alongside the existing Annotatio Name. Signed-off-by: Francesco Romani <[email protected]>

Since the NRT API is moving towards top-level attributes (see: k8stopologyawareschedwg/noderesourcetopology-api#24) add the preferred Attribute Name alongside the existing annotation Name. Signed-off-by: Francesco Romani <[email protected]>

expose TM configuration as attributes. This is in addition to TopologyPolicies field, which is being deprecated. See: - k8stopologyawareschedwg/noderesourcetopology-api#24 - k8stopologyawareschedwg/noderesourcetopology-api#25 Signed-off-by: Francesco Romani <[email protected]>

ffromani mentioned this issue Jan 25, 2023

API: add v1alpha2 #25

Merged

ffromani mentioned this issue Jan 30, 2023

noderesourcetopology: overhaul Topology Manager configuration management kubernetes-sigs/scheduler-plugins#488

Merged

ffromani mentioned this issue Jan 30, 2023

nrt: expose TM config as attributes k8stopologyawareschedwg/resource-topology-exporter#156

Closed

This was referenced Feb 1, 2023

v1alpha3 planning #27

Open

Exposing TopologyManagerScope as a separate field #22

Closed

ffromani closed this as completed Feb 1, 2023

ffromani mentioned this issue Feb 2, 2023

nrt: add attribute name definition k8stopologyawareschedwg/podfingerprint#7

Merged

ffromani mentioned this issue Feb 3, 2023

Add option to compute the pod set fingerprint to improve NUMA aware scheduling kubernetes-sigs/node-feature-discovery#1048

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1alpha2 planning #24

v1alpha2 planning #24

ffromani commented Jan 23, 2023 •

edited

Loading

ffromani commented Jan 23, 2023 •

edited

Loading

ffromani commented Jan 23, 2023

PiotrProkop commented Jan 24, 2023 •

edited

Loading

ffromani commented Jan 24, 2023

PiotrProkop commented Jan 24, 2023

ffromani commented Jan 24, 2023

ffromani commented Jan 25, 2023

marquiz commented Jan 25, 2023

ffromani commented Jan 25, 2023

marquiz commented Jan 25, 2023

kad commented Jan 25, 2023

swatisehgal commented Jan 26, 2023

ffromani commented Jan 26, 2023

ffromani commented Jan 27, 2023

fmuyassarov commented Jan 31, 2023

ffromani commented Feb 1, 2023 •

edited

Loading

ffromani commented Feb 1, 2023

v1alpha2 planning #24

v1alpha2 planning #24

Comments

ffromani commented Jan 23, 2023 • edited Loading

ffromani commented Jan 23, 2023 • edited Loading

ffromani commented Jan 23, 2023

PiotrProkop commented Jan 24, 2023 • edited Loading

ffromani commented Jan 24, 2023

PiotrProkop commented Jan 24, 2023

ffromani commented Jan 24, 2023

ffromani commented Jan 25, 2023

marquiz commented Jan 25, 2023

ffromani commented Jan 25, 2023

marquiz commented Jan 25, 2023

kad commented Jan 25, 2023

swatisehgal commented Jan 26, 2023

ffromani commented Jan 26, 2023

ffromani commented Jan 27, 2023

fmuyassarov commented Jan 31, 2023

ffromani commented Feb 1, 2023 • edited Loading

ffromani commented Feb 1, 2023

ffromani commented Jan 23, 2023 •

edited

Loading

ffromani commented Jan 23, 2023 •

edited

Loading

PiotrProkop commented Jan 24, 2023 •

edited

Loading

ffromani commented Feb 1, 2023 •

edited

Loading