Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom percentageOfNodesToScore in a PreFilter plugin #14

Closed
yuanchen8911 opened this issue Jun 26, 2020 · 15 comments
Closed

Custom percentageOfNodesToScore in a PreFilter plugin #14

yuanchen8911 opened this issue Jun 26, 2020 · 15 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@yuanchen8911
Copy link
Member

yuanchen8911 commented Jun 26, 2020

The scheduler option percentageOfNodeToScore controls how many nodes should be checked when scheduling a pod. It has an important impact on the scheduling performance.

To better balance the scheduling performance and quality to meet different scheduling needs of diverse workloads, an idea is to introduce a PreFilter plugin that updates the default global value if a custom threshold is specified through a Pod label.

This plugin sets the value of percentageOfNodesToScore according to the value associated with a label. For example,

parameter.scheduling.sigs.k8s.io/percentageOfNodesToScore: 10

We’d like to have your input and suggestions, particularly

  1. Is it a valid and useful feature?

  2. Is it possible to implement? A problem we notice is that the current scheduling framework does not provide a mechanism for plugins to access and update the scheduler options. Would it possible to change the plugin APIs with an additional argument, e.g. a scheduler option pointer?

Thanks a lot!

@yuanchen8911 yuanchen8911 changed the title Custom percentageOfNodesToScore Custom percentageOfNodesToScore in a PreFilter plugin Jun 26, 2020
@alculquicondor
Copy link

alculquicondor commented Jun 26, 2020

introduce a PreFilter plugin that updates the default global value if a custom threshold is specified through a Pod label

Given that it would affect other workloads, I don't think it's a good idea.

One approach could be to have the percentageOfNodesToScore to be part of a scheduling profile and then have pods use the profile that gives them better performance if they don't care much about scoring. But if the whole point is to have faster scheduling, there are better things to do:

  • Have a faster scheduler (we have introduced several performance improvements in 1.17, 1.18 and 1.19). Have you had a chance to test it?
  • If the pods actually don't care much about scoring, you could have a stripped-down profile that has 0 or very few scoring plugins.

@yuanchen8911
Copy link
Member Author

@alculquicondor Thanks for your comment.

  1. The idea is to have a per-pod custom parameter. It should not affect other pods' scheduling performance as it only changes the thresholds for a pod to be scheduled.

A PreFilter plugin changes the parameter value for the pod to be scheduled before the Filter stage. When the scheduler (pkg/scheduler/core/generic_scheduler.go) calls the following function to calculate numFeasibleNodesToFind , it will use the new value of g.percentageOfNodesToScore. We will need to save the default global value and reset it for future pods without specifying custom values.

// numFeasibleNodesToFind returns the number of feasible nodes that once found, the scheduler stops
// its search for more feasible nodes.
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) (numNodes int32) {
    if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore >= 100 {
        return numAllNodes
    }
    adaptivePercentage := g.percentageOfNodesToScore
    if adaptivePercentage <= 0 {
        basePercentageOfNodesToScore := int32(50)
        adaptivePercentage = basePercentageOfNodesToScore - numAllNodes/125
        if adaptivePercentage < minFeasibleNodesPercentageToFind {
            adaptivePercentage = minFeasibleNodesPercentageToFind
        }
    }

    numNodes = numAllNodes * adaptivePercentage / 100
    if numNodes < minFeasibleNodesToFind {
        return minFeasibleNodesToFind
    }

    return numNodes
}
  1. The need is driven by use cases like scheduling a large number of pods (e.g., a job/deployment with thousands of pods) in ultra large clusters (several thousand nodes) where the scheduling performance would be an issues. Our preliminary results show that percentageOfNodesToScore has a big impact on the scheduling performance. The difference of 10% vs. 50% in a~1k cluster is significant.

@yuanchen8911
Copy link
Member Author

yuanchen8911 commented Jun 26, 2020

Below is a possible implementation for a dummy CustomParameter PreFilterPlugin.

const (
    // Name is the name of the plugin used in the plugin registry and configurations.
    Name = "CustomParameter"
    // PercentageOfNodesToScore specifices how many feasible nodes to return in the filter phase
    PercentageOfNodesToScore = "parameter.scheduling.sigs.k8s.io/percentageOfNodesToScore"
)

// CustomParameters is a plugin that sets scheduling configuation parameters specified as pod labels.
// The current implementation supports PercentageOfNodesToScore only.
type CustomParameters struct {
    parameterNames []string
}

var _ framework.PreFilterPlugin = &CustomParameters{}

// Name returns the plugin's name.
func (cp *CustomParameters) Name() string {
    return Name
}

// PreFilter is a dummy plugin that sets PercentageOfNodesToScore if specified in Pod's label.
func (cp *CustomParameters) PreFilter(ctx context.Context, state *framework.CycleState, pod *v1.Pod) *framework.Status {

    for _, paramName := range cp.parameterNames {
        value := getParameterValue(pod, paramName)
        switch paramName {
        case PercentageOfNodesToScore:
            if value == "" {
                klog.Infof("Custom parameter-- Pod: %s, PercentageOfNodesToScore: <default value>", pod.Name)
            } else {
                val, err := strconv.Atoi(value)
                if err != nil {
                    return framework.NewStatus(framework.Error, "Invalid prameter value for percentageOfnodesTorScore: %v", value)
                }
                //TODO: set percentageOfNodesToScore in sched.Algorithm.percentageOfNodesToScore (Not Supported Yet)
                ...
                klog.Infof("Custom Parameter-- Pod: %s, PercentageOfNodesToScore: %d\n", pod.Name, val)
            }
        }
    }
    return framework.NewStatus(framework.Success, "")
}

@alculquicondor
Copy link

  1. why not modify your global default?
  2. Have you tested 1.18 and 1.19?

@yuanchen8911
Copy link
Member Author

yuanchen8911 commented Jun 26, 2020

  • why not modify your global default?

The motivation is to customize the threshold for different workloads running in the same cluster. For example, a long running service that cares a lot about the scheduling quality will use a high threshold to achieve a better scheduling quality while a large batch job that is looking for a quick turnaround may set a =lower threshold for a quicker scheduling.

  • Have you tested 1.18 and 1.19?

We use 1.18.

Again, the assumption is scheduling a large number of pods (> 1k pods) in an ultra large cluster (>1k nodes).

@alculquicondor
Copy link

Again, the assumption is scheduling a large number of pods (> 1k pods) in an ultra large cluster (>1k nodes).

We run such big clusters as well, except that we only target 100 pods/s

You could consider using multiple profiles and disabling some (or all) plugins in one of them. Then, your jobs use the optimized profile. And you can pair that with a percentage of nodes to score that makes sense for both types of workloads, such as 30%. Note that we use the percentage in a windowed fashion, so eventually all nodes are tested for a big enough deployment.
If that's not enough, we can make a case for adding percentage of nodes to score to a profile.

Making a pod directly affect scheduler configuration might not be the best API. Scheduling profiles is a much better way.

@yuanchen8911
Copy link
Member Author

yuanchen8911 commented Jun 26, 2020

Thanks, per profile threshold will be certainly helpful, but the value is set statically. A PreFilter plugin can support dynamic and adaptive threshold at a single pod level, which will provide a greater flexibility. I agree we need to think more about the use cases and tradeoff.

@Huang-Wei
Copy link
Contributor

Atm PercentageOfNodesToScore is a global parameter, I'm not a fan of changing it in PreFilter or other places.

Given that PercentageOfNodesToScorec is used in numFeasibleNodesToFind(), which is called in every scheduling cycle, so it makes more sense to me to change it to be a profile-level parameter, so that users just configure different thresholds to filter/score in each profile.

@Huang-Wei
Copy link
Contributor

Huang-Wei commented Jun 27, 2020

A side note:PercentageOfNodesToScore has nothing to do with Scoring... I thought it impacts both Filter and Score, but it does nothing in Score (g.prioritizeNodes).

Anyway, if we read it as "the eventual percentage of nodes to score", then it's not that misleading.

@yuanchen8911
Copy link
Member Author

A side note:PercentageOfNodesToScore has nothing to do with Scoring... I thought it impacts both Filter and Score, but it does nothing in Score (g.prioritizeNodes).

Anyway, if we read it as "the eventual percentage of nodes to score", then it's not that misleading.

It controls the number of feasible nodes to find by the filter stage and hence affects how many nodes will be scored and ranked in the score stage. It hence affects the scoring stage’s performance, which can be the bottleneck.

@yuanchen8911
Copy link
Member Author

Atm PercentageOfNodesToScore is a global parameter, I'm not a fan of changing it in PreFilter or other places.

Given that PercentageOfNodesToScorec is used in numFeasibleNodesToFind(), which is called in every scheduling cycle, so it makes more sense to me to change it to be a profile-level parameter, so that users just configure different thresholds to filter/score in each profile.

Thanks, we will explore this option.

@yuanchen8911
Copy link
Member Author

Opened an issue: support per scheduling profile configuration to kube-scheduler kubernetes/kubernetes#93270

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2020
@alculquicondor
Copy link

/close

discussion moved to kubernetes/kubernetes#93270

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

/close

discussion moved to kubernetes/kubernetes#93270

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Tal-or referenced this issue in openshift-kni/scheduler-plugins Dec 8, 2021
[KNI] refer to specific Makefile in Dockerfile
swatisehgal pushed a commit to swatisehgal/scheduler-plugins that referenced this issue May 19, 2022
…mp-golang-dockerfile

[WG] update golang for containerized builds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants