-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PredicateFn for allocate and preempt actions #2916
Refactor PredicateFn for allocate and preempt actions #2916
Conversation
f015bec
to
39d87b0
Compare
@@ -96,13 +97,22 @@ func (alloc *Action) Execute(ssn *framework.Session) { | |||
pendingTasks := map[api.JobID]*util.PriorityQueue{} | |||
|
|||
allNodes := ssn.NodeList | |||
predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) error { | |||
predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) ([]*api.Status, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyang0616 Whether it is enough to use *api.Status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predicate plugin integrates multiple filter plug-ins of kube-scheduler. It is necessary to collect the filtering results of each plug-in and judge on the action side.
At present, it is necessary to return the list list. If there is a more elegant solution in the future, we can use it in optimize.
return api.NewFitError(task, node, reason) | ||
return nil, api.NewFitError(task, node, reason) | ||
} | ||
predicateStatus, err := ssn.PredicateFn(task, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to make the action
code clear and short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review, here is mainly the core logic that distinguishes the difference between allocate and preempt. It has been modified using the previous logical architecture.
I will consider it later to see if it can be further optimized.
aa31812
to
a4600b8
Compare
} | ||
for _, status := range predicateStatus { | ||
if status != nil && status.Code != api.Success { | ||
return nil, api.NewFitError(task, node, status.Reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return the first failure?
@@ -250,4 +254,32 @@ func (alloc *Action) Execute(ssn *framework.Session) { | |||
} | |||
} | |||
|
|||
func prePredicateforAllocate(ssn *framework.Session, task *api.TaskInfo, allNodes []*api.NodeInfo, job *api.JobInfo) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prePredicateforAllolcate
is similar with prePredicateforBackfill
and prePredicateforReclaim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -101,4 +96,45 @@ func (backfill *Action) Execute(ssn *framework.Session) { | |||
} | |||
} | |||
|
|||
func predicateforBackfill(ssn *framework.Session, task *api.TaskInfo, node *api.NodeInfo, fe *api.FitErrors) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consolidate the predicateforBackfill
, predicateforReclaim
, predicateforAllocate
and predicateforPrempt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Yikun Would you please have a look at the spark failure? |
94124f7
to
461742b
Compare
4058273
to
8e060d9
Compare
f1d8522
to
886e438
Compare
@@ -107,3 +107,47 @@ func taskGroupID(task *api.TaskInfo) string { | |||
func NewPredicateHelper() PredicateHelper { | |||
return &predicateHelper{taskPredicateErrorCache: map[string]map[string]error{}} | |||
} | |||
|
|||
type PredicateStatus interface { | |||
IsContainsUnschedulable() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IsContainsUnschedulable
is inconsistent with implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to be deleted here, done.
task.Namespace, task.Name, node.Name, err) | ||
fe.SetNodeError(node.Name, err) | ||
continue | ||
} | ||
|
||
if statusSets.ContainsUnschedulable() || statusSets.ContainsUnschedulableAndUnresolvable() || | ||
statusSets.ContainsErrorSkipOrWait() { | ||
err := fmt.Errorf("predicates failed in backfill for task <%s/%s> on node <%s>, status is not success", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to generate the errors structure. The log output has been updated.
var statusSets util.StatusSets | ||
statusSets, err := ssn.PredicateFn(task, node) | ||
if err != nil { | ||
klog.V(3).Infof("backfill predicates failed for task <%s/%s> on node <%s>: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make the log clear for user. for example. change the backfill predicates failed xxx
to predicate failed in backfill action xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/lgtm |
1 similar comment
/lgtm |
…locate and preempt at the same time Signed-off-by: wangyang <[email protected]>
…ation of this part of adaptation will be rolled back Signed-off-by: wangyang <[email protected]>
Signed-off-by: w00568049 <[email protected]>
Signed-off-by: w00568049 <[email protected]>
Signed-off-by: wangyang <[email protected]>
Signed-off-by: w00568049 <[email protected]>
0683e75
to
cb8a186
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The proboot issue is fixed and need to mark the lgtm again. |
/lgtm |
fix: #2739
Problem phenomenon:
Related issue: Reclaim is not working with GPUSharingEnable if resource used is volcano.sh/gpu-memory #2739
Submit low-priority jobs to occupy all the GPU resources of the cluster, and submit high-priority jobs to request GPU resources.
problem causes:
The scheduling process requires advanced node filtering and then optimization. The corresponding processes in Volcano are predicate and nodeorder, and preemption also needs to go through these two stages. In the predicate stage, each plug-in will be called to implement predicateFn to filter nodes. If a plug-in judges that the node does not meet the scheduling requirements and returns an error, the node will be filtered out.
When vgpu's predicateFn (FilterNode) performs node filtering judgment, it judges the remaining gpu resources of the node. If the remaining resources do not meet the pod scheduling, an error will be reported. This implementation is reasonable for allocate, but in preempt scenarios, judgment The logic of whether a pod can be scheduled to a node is whether the full resources of the node can satisfy pod scheduling, not the remaining resources, which causes this problem.
Extended troubleshooting for similar issues:
In addition to the above-mentioned problems in the vgpu resource, check the implementation of other resources and plug-ins, and find that similar problems exist in many places, for example: the predicate-plugin has an upper limit on the number of pods supported by the filtering node node; it is compatible with kube-scheduler-related Filter functions Filtering logic, node port, pod affinity, node volume, pod topology spread, etc. all have similar problems, and it is necessary to provide a general solution to fix the above problems.
Fix:
Solution 1:
Split the predicate function, and classify the resources that can be preempted and the resources that cannot be preempted. The details are as follows:
The pr corresponding to this program: Split the predicate function to solve the resource filtering problem encountered during preemption #2818
shortcoming:
Splitting the predicate function requires exposing too many filtering functions, the semantic distinction is not clear enough, and it is difficult for development and users to understand and maintain
Solution 2:
Extend the return value type of the predicate and add the Status type, indicating that after the node is filtered, scheduling, preemption is allowed, or preemption is not allowed. The details are as follows:
allocate
andbackfill
accept nodes whose predicate status is Success, andpreempt
andreclaim
accept nodes whose predicate status is Success and Unschedulable.This scheme corresponds to pr: Predicate adapts allocate and preempt #2916
advantage:
After comprehensive consideration,
Solution 2
is more suitable