Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the Gang policy(JobReadyFn) is enabled,cache in the Predicate phase of the scheduling cycle causes a scheduling failure #3666

Closed
Kyrie336 opened this issue Aug 9, 2024 · 3 comments · Fixed by #3649
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Kyrie336
Copy link

Kyrie336 commented Aug 9, 2024

Description

The tasks submitted will have the following requirements
1. Pods in a Job are not the same, perhaps with different resource requirements, perhaps with different node affinities, etc
2. the Gang policy(JobReadyFn) is enabled

In this case, all pods in a Job will use the same Predicate cache. When a Pod fails to Predicate a node, that node will never be considered.This may cause the task to fail.

I think you can't assume that the Pods under the Job are all the same, so you shouldn't use the same cache, and can the Predicate cache switch be configured?

attention: I used a third-party platform to submit the Job, not Volcano Job

Steps to reproduce the issue

Describe the results you received and expected

I think you can't assume that the Pods under the Job are all the same, so you shouldn't use the same cache, and can the Predicate cache switch be configured?

What version of Volcano are you using?

release v1.9.0

Any other relevant information

No response

@Kyrie336 Kyrie336 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 9, 2024
@lowang-bh
Copy link
Member

Thanks for your feedback. There is already a PR #3649 to improve it.

@Kyrie336
Copy link
Author

Kyrie336 commented Aug 9, 2024

Thanks for your feedback. There is already a PR #3649 to improve it.

Thanks, I see it. There are now different scheduling caches based on the TaskRole. However, there are scenarios where pods for the same TaskRole predicate different nodes during custom scheduling plugin development. The current solution reduces the extensibility of predicate extension points. Could you consider making the cache configurable or something else

@Kyrie336
Copy link
Author

Thanks for your feedback. There is already a PR #3649 to improve it.

Thanks, I see it. There are now different scheduling caches based on the TaskRole. However, there are scenarios where pods for the same TaskRole predicate different nodes during custom scheduling plugin development. The current solution reduces the extensibility of predicate extension points. Could you consider making the cache configurable or something else

@lowang-bh Can you consider this problem? I think this is an important bug fix. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants