Add `requiredDuringSchedulingRequiredDuringExecution` to ClusterResourcePlacement affinity #715

nojnhuh · 2024-03-07T19:58:15Z

In ClusterResourcePlacement's affinity definitions, adding requiredDuringSchedulingRequiredDuringExecution would enable the scheduler to react to underlying changes to a member cluster over time that affect its ability to run certain workloads.

One concrete use case might be to ensure that workloads only run on clusters that contain GPU nodes. As nodes are added to and removed from a cluster, whether or not any GPU nodes exist in a cluster may change over time. As a cluster operator detects these changes and updates some label on the member clusters to indicate whether or not GPU nodes are available, Fleet would automatically reschedule workloads that require GPU nodes onto a different member cluster.

The text was updated successfully, but these errors were encountered:

nojnhuh · 2024-03-08T19:10:02Z

@ryanzhang-oss I'm starting to dig into this so if you could please assign me to this issue I'd appreciate it!

ryanzhang-oss · 2024-03-12T06:36:50Z

@nojnhuh Even k8s does not support requiredDuringSchedulingRequiredDuringExecution, I wonder why do we want to support that? Also what does "requiredDuringSchedulingRequiredDuringExecution" mean semantically?

nojnhuh · 2024-03-12T16:39:12Z

This would mean the same thing as the placeholder definition for the same field for placing a Pod on a Node, but for scheduling workloads onto clusters: https://github.com/kubernetes/kubernetes/blob/634fc1b4836b3a500e0d715d71633ff67690526a/staging/src/k8s.io/api/core/v1/types.go#L3449-L3456

// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to an update), the system
// will try to eventually evict the pod from its node.

This would help with the use case I outlined above where conditions on a member cluster change such that it's no longer suitable to run certain workloads. Then fleet can reschedule affected workloads without relying on a change to the ClusterResourcePlacement to trigger the reschedule.

ryanzhang-oss · 2024-03-12T19:09:03Z

Thanks @nojnhuh.

Just to clarify, there are two cases, to schedule a workload to newly GPU available cluster is actually a requiredDuringScheduleTime case since the workload is not scheduled if there is no GPU cluster available. The workload will be scheduled to a cluster automatically when we detect that GPU is added to the cluster. This is already supported today in fleet.

On the flip side, when a workload is already running in a cluster, we don't evict it unless the cluster is deleted, the same behavior as k8s. I think there is a reason why k8s never implements that feature. The main reason is continuously trying to reschedule all workloads will add huge load to our scheduler which is the performance bottleneck. Since we didn't get any feature request to support this from our customers, we don't think the benefit out weight the huge performance hit.

We can revisit this if there are strong use cases coming from customers and even with that, I suspect we need to scope down the semantics to ensure performance.

jackfrancis · 2024-03-12T22:04:57Z

There is an active KEP right now in upstream k8s to solve for RequiredDuringSchedulingRequiredDuringExecution:

KEP4328: Affinity Based Eviction kubernetes/enhancements#4329

The intent to solve for this is longstanding:

https://github.com/kubernetes/kubernetes/blob/v1.29.2/pkg/apis/core/types.go#L2916-L2925

Additionally, the widely used descheduler project implements this as well for folks who have needed this functionality prior to its landing in k/k:

https://github.com/kubernetes-sigs/descheduler/blob/v0.29.0/README.md#removepodsviolatingnodeaffinity

The main reason is continuously trying to reschedule all workloads will add huge load to our scheduler which is the performance bottleneck

The above is a true statement. We wouldn't want to continuously reschedule. Rather we would want to continuously determine "do I need to reschedule?", which would looks something like (1) ensuring that ClusterResourcePlacement status is current and reflects the underlying state of the scheduled resources and (2) introspecting those status and engaging a reschedule trigger when the declared status goal state status (e.g., Running) was unrealized beyond some configurable TTL.

I would like to be both a customer and implementer of this in fleet, so it makes sense to me to keep the issue open as a reference for the resultant PR.

ryanzhang-oss · 2024-03-12T23:54:18Z

Thanks, Jack. I am keeping this issue open. However, I don't think there is a way to determine "do I need to reschedule" without actually scheduling it. Also, just continuously "determine" is already a huge cost.

IMO, the right way to solve this problem is to deploy a descheduler instead of within the scheduler. We are planning for a descheduler already.

In any case, we would like to see a design first before moving forward with any code change.

ryanzhang-oss · 2024-03-12T23:59:37Z

- Add `node-affinity-eviction` controller to ensure pods being evicted if the selectors are no longer met at runtime.

This is a "descheduler" to me

jackfrancis · 2024-03-13T00:15:31Z

Thx for re-opening!

However, I don't think there is a way to determine "do I need to reschedule" without actually scheduling it.

This is the way:

A discrete actor performs the descheduling on a particular cluster (either the standard descheduler does it, or in the future, if k/k itself supports it then it would do it).
A multi-cluster actor (e.g., ClusterResourcePlacement) looks for scheduled workloads that are "not running" as the trigger for "do I need to reschedule?".

The multi-cluster actor does not need to actually schedule anything in order to determine if it needs to be rescheduled. It simply needs to be aware of the delta between its desired goal state (this workload is operational on cluster XYZ) and the actual goal state (this workload is stuck Pending on cluster XYZ). When such a delta is observed, the entire E2E multi-cluster scheduling operation kicks in, with the new nuance that cluster XYZ is no longer considered as a target cluster for scheduling (we already know that the workload doesn't run there).

ryanzhang-oss · 2024-03-13T04:33:25Z

So IIRC,

We need a single cluster de-scheduler first which is out of the scope of this project.
We need a way for the multi-cluster agent to know what does "running" mean for any resources.

I wonder how do you solve the second part?

In addition, the second part is actually already part of the advanced rollout feature as we will provide options for customer when we detect the resources placed are not in goal state. Currently, we don't plan to offer "reschedule" option but that's not hard to add.
The hard part is actually how to determine what "running" means for an arbitrary resource. I don't see any way other than providing a "hook" for the users to tell us but it's a quite involved process. We are working on the UX.

zhiying-lin · 2024-03-13T06:54:16Z

The current KEP4329 has not been approved yet. I have the same question listed in the https://github.com/kubernetes/enhancements/pull/4329/files#r1478023120. not sure what's the benefit of adding into the node affinity instead of using the descheduler and what's the boundary between these two? Probably we can hold until sig gets to the conclusion?

jackfrancis · 2024-03-13T15:06:26Z

In addition, the second part is actually already part of the advanced rollout feature as we will provide options for customer when we detect the resources placed are not in goal state. Currently, we don't plan to offer "reschedule" option but that's not hard to add.

Cool, are there PRs that are implementing "advanced rollout"?

zhiying-lin · 2024-03-14T03:37:31Z

https://github.com/Azure/fleet/pull/689/files is the one we support checking the availability of the native resource. More PRs are coming.

ryanzhang-oss assigned nojnhuh and unassigned nojnhuh Mar 12, 2024

ryanzhang-oss closed this as completed Mar 12, 2024

ryanzhang-oss reopened this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `requiredDuringSchedulingRequiredDuringExecution` to ClusterResourcePlacement affinity #715

Add `requiredDuringSchedulingRequiredDuringExecution` to ClusterResourcePlacement affinity #715

nojnhuh commented Mar 7, 2024

nojnhuh commented Mar 8, 2024

ryanzhang-oss commented Mar 12, 2024

nojnhuh commented Mar 12, 2024

ryanzhang-oss commented Mar 12, 2024 •

edited

Loading

jackfrancis commented Mar 12, 2024

ryanzhang-oss commented Mar 12, 2024 •

edited

Loading

ryanzhang-oss commented Mar 12, 2024

jackfrancis commented Mar 13, 2024

ryanzhang-oss commented Mar 13, 2024 •

edited

Loading

zhiying-lin commented Mar 13, 2024

jackfrancis commented Mar 13, 2024

zhiying-lin commented Mar 14, 2024

Add requiredDuringSchedulingRequiredDuringExecution to ClusterResourcePlacement affinity #715

Add requiredDuringSchedulingRequiredDuringExecution to ClusterResourcePlacement affinity #715

Comments

nojnhuh commented Mar 7, 2024

nojnhuh commented Mar 8, 2024

ryanzhang-oss commented Mar 12, 2024

nojnhuh commented Mar 12, 2024

ryanzhang-oss commented Mar 12, 2024 • edited Loading

jackfrancis commented Mar 12, 2024

ryanzhang-oss commented Mar 12, 2024 • edited Loading

ryanzhang-oss commented Mar 12, 2024

jackfrancis commented Mar 13, 2024

ryanzhang-oss commented Mar 13, 2024 • edited Loading

zhiying-lin commented Mar 13, 2024

jackfrancis commented Mar 13, 2024

zhiying-lin commented Mar 14, 2024

Add `requiredDuringSchedulingRequiredDuringExecution` to ClusterResourcePlacement affinity #715

Add `requiredDuringSchedulingRequiredDuringExecution` to ClusterResourcePlacement affinity #715

ryanzhang-oss commented Mar 12, 2024 •

edited

Loading

ryanzhang-oss commented Mar 12, 2024 •

edited

Loading

ryanzhang-oss commented Mar 13, 2024 •

edited

Loading