When the Gang policy(JobReadyFn) is enabled，cache in the Predicate phase of the scheduling cycle causes a scheduling failure #3666

Kyrie336 · 2024-08-09T03:36:08Z

Description

The tasks submitted will have the following requirements
1. Pods in a Job are not the same, perhaps with different resource requirements, perhaps with different node affinities, etc
2. the Gang policy(JobReadyFn) is enabled

In this case, all pods in a Job will use the same Predicate cache. When a Pod fails to Predicate a node, that node will never be considered.This may cause the task to fail.

I think you can't assume that the Pods under the Job are all the same, so you shouldn't use the same cache, and can the Predicate cache switch be configured?

attention: I used a third-party platform to submit the Job, not Volcano Job

Steps to reproduce the issue

Describe the results you received and expected

I think you can't assume that the Pods under the Job are all the same, so you shouldn't use the same cache, and can the Predicate cache switch be configured?

What version of Volcano are you using?

release v1.9.0

Any other relevant information

No response

lowang-bh · 2024-08-09T04:48:45Z

Thanks for your feedback. There is already a PR #3649 to improve it.

Kyrie336 · 2024-08-09T07:32:58Z

Thanks for your feedback. There is already a PR #3649 to improve it.

Thanks, I see it. There are now different scheduling caches based on the TaskRole. However, there are scenarios where pods for the same TaskRole predicate different nodes during custom scheduling plugin development. The current solution reduces the extensibility of predicate extension points. Could you consider making the cache configurable or something else

Kyrie336 · 2024-08-15T08:28:58Z

Thanks for your feedback. There is already a PR #3649 to improve it.

Thanks, I see it. There are now different scheduling caches based on the TaskRole. However, there are scenarios where pods for the same TaskRole predicate different nodes during custom scheduling plugin development. The current solution reduces the extensibility of predicate extension points. Could you consider making the cache configurable or something else

@lowang-bh Can you consider this problem? I think this is an important bug fix. Thanks

Kyrie336 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 9, 2024

lowang-bh mentioned this issue Aug 16, 2024

don't enable error cache if task role spec is empty #3649

Merged

volcano-sh-bot closed this as completed in #3649 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the Gang policy(JobReadyFn) is enabled，cache in the Predicate phase of the scheduling cycle causes a scheduling failure #3666

When the Gang policy(JobReadyFn) is enabled，cache in the Predicate phase of the scheduling cycle causes a scheduling failure #3666

Kyrie336 commented Aug 9, 2024

lowang-bh commented Aug 9, 2024

Kyrie336 commented Aug 9, 2024

Kyrie336 commented Aug 15, 2024

When the Gang policy(JobReadyFn) is enabled，cache in the Predicate phase of the scheduling cycle causes a scheduling failure #3666

When the Gang policy(JobReadyFn) is enabled，cache in the Predicate phase of the scheduling cycle causes a scheduling failure #3666

Comments

Kyrie336 commented Aug 9, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

lowang-bh commented Aug 9, 2024

Kyrie336 commented Aug 9, 2024

Kyrie336 commented Aug 15, 2024