Custom percentageOfNodesToScore in a PreFilter plugin #14

yuanchen8911 · 2020-06-26T15:09:58Z

The scheduler option percentageOfNodeToScore controls how many nodes should be checked when scheduling a pod. It has an important impact on the scheduling performance.

To better balance the scheduling performance and quality to meet different scheduling needs of diverse workloads, an idea is to introduce a PreFilter plugin that updates the default global value if a custom threshold is specified through a Pod label.

This plugin sets the value of percentageOfNodesToScore according to the value associated with a label. For example,

parameter.scheduling.sigs.k8s.io/percentageOfNodesToScore: 10

We’d like to have your input and suggestions, particularly

Is it a valid and useful feature?
Is it possible to implement? A problem we notice is that the current scheduling framework does not provide a mechanism for plugins to access and update the scheduler options. Would it possible to change the plugin APIs with an additional argument, e.g. a scheduler option pointer?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

alculquicondor · 2020-06-26T18:24:49Z

introduce a PreFilter plugin that updates the default global value if a custom threshold is specified through a Pod label

Given that it would affect other workloads, I don't think it's a good idea.

One approach could be to have the percentageOfNodesToScore to be part of a scheduling profile and then have pods use the profile that gives them better performance if they don't care much about scoring. But if the whole point is to have faster scheduling, there are better things to do:

Have a faster scheduler (we have introduced several performance improvements in 1.17, 1.18 and 1.19). Have you had a chance to test it?
If the pods actually don't care much about scoring, you could have a stripped-down profile that has 0 or very few scoring plugins.

yuanchen8911 · 2020-06-26T19:26:48Z

@alculquicondor Thanks for your comment.

The idea is to have a per-pod custom parameter. It should not affect other pods' scheduling performance as it only changes the thresholds for a pod to be scheduled.

A PreFilter plugin changes the parameter value for the pod to be scheduled before the Filter stage. When the scheduler (pkg/scheduler/core/generic_scheduler.go) calls the following function to calculate numFeasibleNodesToFind , it will use the new value of g.percentageOfNodesToScore. We will need to save the default global value and reset it for future pods without specifying custom values.

// numFeasibleNodesToFind returns the number of feasible nodes that once found, the scheduler stops
// its search for more feasible nodes.
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) (numNodes int32) {
    if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore >= 100 {
        return numAllNodes
    }
    adaptivePercentage := g.percentageOfNodesToScore
    if adaptivePercentage <= 0 {
        basePercentageOfNodesToScore := int32(50)
        adaptivePercentage = basePercentageOfNodesToScore - numAllNodes/125
        if adaptivePercentage < minFeasibleNodesPercentageToFind {
            adaptivePercentage = minFeasibleNodesPercentageToFind
        }
    }

    numNodes = numAllNodes * adaptivePercentage / 100
    if numNodes < minFeasibleNodesToFind {
        return minFeasibleNodesToFind
    }

    return numNodes
}

The need is driven by use cases like scheduling a large number of pods (e.g., a job/deployment with thousands of pods) in ultra large clusters (several thousand nodes) where the scheduling performance would be an issues. Our preliminary results show that percentageOfNodesToScore has a big impact on the scheduling performance. The difference of 10% vs. 50% in a~1k cluster is significant.

yuanchen8911 · 2020-06-26T19:33:21Z

Below is a possible implementation for a dummy CustomParameter PreFilterPlugin.

const (
    // Name is the name of the plugin used in the plugin registry and configurations.
    Name = "CustomParameter"
    // PercentageOfNodesToScore specifices how many feasible nodes to return in the filter phase
    PercentageOfNodesToScore = "parameter.scheduling.sigs.k8s.io/percentageOfNodesToScore"
)

// CustomParameters is a plugin that sets scheduling configuation parameters specified as pod labels.
// The current implementation supports PercentageOfNodesToScore only.
type CustomParameters struct {
    parameterNames []string
}

var _ framework.PreFilterPlugin = &CustomParameters{}

// Name returns the plugin's name.
func (cp *CustomParameters) Name() string {
    return Name
}

// PreFilter is a dummy plugin that sets PercentageOfNodesToScore if specified in Pod's label.
func (cp *CustomParameters) PreFilter(ctx context.Context, state *framework.CycleState, pod *v1.Pod) *framework.Status {

    for _, paramName := range cp.parameterNames {
        value := getParameterValue(pod, paramName)
        switch paramName {
        case PercentageOfNodesToScore:
            if value == "" {
                klog.Infof("Custom parameter-- Pod: %s, PercentageOfNodesToScore: <default value>", pod.Name)
            } else {
                val, err := strconv.Atoi(value)
                if err != nil {
                    return framework.NewStatus(framework.Error, "Invalid prameter value for percentageOfnodesTorScore: %v", value)
                }
                //TODO: set percentageOfNodesToScore in sched.Algorithm.percentageOfNodesToScore (Not Supported Yet)
                ...
                klog.Infof("Custom Parameter-- Pod: %s, PercentageOfNodesToScore: %d\n", pod.Name, val)
            }
        }
    }
    return framework.NewStatus(framework.Success, "")
}

alculquicondor · 2020-06-26T20:21:19Z

why not modify your global default?
Have you tested 1.18 and 1.19?

yuanchen8911 · 2020-06-26T20:38:10Z

why not modify your global default?

The motivation is to customize the threshold for different workloads running in the same cluster. For example, a long running service that cares a lot about the scheduling quality will use a high threshold to achieve a better scheduling quality while a large batch job that is looking for a quick turnaround may set a =lower threshold for a quicker scheduling.

Have you tested 1.18 and 1.19?

We use 1.18.

Again, the assumption is scheduling a large number of pods (> 1k pods) in an ultra large cluster (>1k nodes).

alculquicondor · 2020-06-26T20:52:22Z

Again, the assumption is scheduling a large number of pods (> 1k pods) in an ultra large cluster (>1k nodes).

We run such big clusters as well, except that we only target 100 pods/s

You could consider using multiple profiles and disabling some (or all) plugins in one of them. Then, your jobs use the optimized profile. And you can pair that with a percentage of nodes to score that makes sense for both types of workloads, such as 30%. Note that we use the percentage in a windowed fashion, so eventually all nodes are tested for a big enough deployment.
If that's not enough, we can make a case for adding percentage of nodes to score to a profile.

Making a pod directly affect scheduler configuration might not be the best API. Scheduling profiles is a much better way.

yuanchen8911 · 2020-06-26T23:00:25Z

Thanks, per profile threshold will be certainly helpful, but the value is set statically. A PreFilter plugin can support dynamic and adaptive threshold at a single pod level, which will provide a greater flexibility. I agree we need to think more about the use cases and tradeoff.

Huang-Wei · 2020-06-27T00:10:04Z

Atm PercentageOfNodesToScore is a global parameter, I'm not a fan of changing it in PreFilter or other places.

Given that PercentageOfNodesToScorec is used in numFeasibleNodesToFind(), which is called in every scheduling cycle, so it makes more sense to me to change it to be a profile-level parameter, so that users just configure different thresholds to filter/score in each profile.

Huang-Wei · 2020-06-27T00:18:22Z

A side note:PercentageOfNodesToScore has nothing to do with Scoring... I thought it impacts both Filter and Score, but it does nothing in Score (g.prioritizeNodes).

Anyway, if we read it as "the eventual percentage of nodes to score", then it's not that misleading.

yuanchen8911 · 2020-06-27T14:12:21Z

A side note:PercentageOfNodesToScore has nothing to do with Scoring... I thought it impacts both Filter and Score, but it does nothing in Score (g.prioritizeNodes).

Anyway, if we read it as "the eventual percentage of nodes to score", then it's not that misleading.

It controls the number of feasible nodes to find by the filter stage and hence affects how many nodes will be scored and ranked in the score stage. It hence affects the scoring stage’s performance, which can be the bottleneck.

yuanchen8911 · 2020-06-27T14:13:34Z

Atm PercentageOfNodesToScore is a global parameter, I'm not a fan of changing it in PreFilter or other places.

Given that PercentageOfNodesToScorec is used in numFeasibleNodesToFind(), which is called in every scheduling cycle, so it makes more sense to me to change it to be a profile-level parameter, so that users just configure different thresholds to filter/score in each profile.

Thanks, we will explore this option.

yuanchen8911 · 2020-07-20T23:49:20Z

Opened an issue: support per scheduling profile configuration to kube-scheduler kubernetes/kubernetes#93270

fejta-bot · 2020-10-19T00:09:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

alculquicondor · 2020-10-19T13:25:55Z

/close

discussion moved to kubernetes/kubernetes#93270

k8s-ci-robot · 2020-10-19T13:26:06Z

@alculquicondor: Closing this issue.

In response to this:

/close

discussion moved to kubernetes/kubernetes#93270

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

[KNI] refer to specific Makefile in Dockerfile

…mp-golang-dockerfile [WG] update golang for containerized builds

yuanchen8911 changed the title ~~Custom percentageOfNodesToScore~~ Custom percentageOfNodesToScore in a PreFilter plugin Jun 26, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2020

k8s-ci-robot closed this as completed Oct 19, 2020

Tal-or referenced this issue in openshift-kni/scheduler-plugins Dec 8, 2021

Merge pull request #14 from Tal-or/fix_dockerfile

4e7833c

[KNI] refer to specific Makefile in Dockerfile

swatisehgal pushed a commit to swatisehgal/scheduler-plugins that referenced this issue May 19, 2022

Merge pull request kubernetes-sigs#14 from k8stopologyawareschedwg/bu…

0c4ce18

…mp-golang-dockerfile [WG] update golang for containerized builds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom percentageOfNodesToScore in a PreFilter plugin #14

Custom percentageOfNodesToScore in a PreFilter plugin #14

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

alculquicondor commented Jun 26, 2020 •

edited

Loading

yuanchen8911 commented Jun 26, 2020

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

alculquicondor commented Jun 26, 2020

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

alculquicondor commented Jun 26, 2020

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

Huang-Wei commented Jun 27, 2020

Huang-Wei commented Jun 27, 2020 •

edited

Loading

yuanchen8911 commented Jun 27, 2020

yuanchen8911 commented Jun 27, 2020

yuanchen8911 commented Jul 20, 2020

fejta-bot commented Oct 19, 2020

alculquicondor commented Oct 19, 2020

k8s-ci-robot commented Oct 19, 2020

Custom percentageOfNodesToScore in a PreFilter plugin #14

Custom percentageOfNodesToScore in a PreFilter plugin #14

Comments

yuanchen8911 commented Jun 26, 2020 • edited Loading

alculquicondor commented Jun 26, 2020 • edited Loading

yuanchen8911 commented Jun 26, 2020

yuanchen8911 commented Jun 26, 2020 • edited Loading

alculquicondor commented Jun 26, 2020

yuanchen8911 commented Jun 26, 2020 • edited Loading

alculquicondor commented Jun 26, 2020

yuanchen8911 commented Jun 26, 2020 • edited Loading

Huang-Wei commented Jun 27, 2020

Huang-Wei commented Jun 27, 2020 • edited Loading

yuanchen8911 commented Jun 27, 2020

yuanchen8911 commented Jun 27, 2020

yuanchen8911 commented Jul 20, 2020

fejta-bot commented Oct 19, 2020

alculquicondor commented Oct 19, 2020

k8s-ci-robot commented Oct 19, 2020

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

alculquicondor commented Jun 26, 2020 •

edited

Loading

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

yuanchen8911 commented Jun 26, 2020 •

edited

Loading

Huang-Wei commented Jun 27, 2020 •

edited

Loading