Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize log output when resources are not enough #2993

Closed
hwdef opened this issue Jul 24, 2023 · 11 comments · Fixed by #3538
Closed

Optimize log output when resources are not enough #2993

hwdef opened this issue Jul 24, 2023 · 11 comments · Fixed by #3538
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@hwdef
Copy link
Member

hwdef commented Jul 24, 2023

What would you like to be added:

Optimize the scheduler's logs when resources are low by printing exactly what kind of resources are low.

Why is this needed:

easier to identify the reasons for insufficient resources.

kubernetes scheduler VS volcano

kubernetese scheduler:
image

volcano:
image

volcano just print NotEnoughResources

@hwdef hwdef added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 24, 2023
@srikanth-iyengar
Copy link

srikanth-iyengar commented Aug 12, 2023

Hello @hwdef, I'd like to inquire if I could select this particular issue as my initial task for contributing to Volcano:

Is it feasible to determine the type of resource scarcity causing a job blockage? One potential approach could involve quering the Kubernetes cluster for available memory and CPU. By comparing these metrics with the job's resource allocation requests, we could potentially pinpoint the source of failure in resource allocation. Your guidance on whether this is an appropriate first task would be greatly appreciated.

@hwdef
Copy link
Member Author

hwdef commented Aug 12, 2023

Yes you can fix this, our goal is that the error prompts in case of insufficient resources are consistent with the native scheduler of kubernetes.

@srikanth-iyengar
Copy link

@hwdef Can you please confirm if my approach is correct or not
i.e. quering the Kubernetes cluster for available memory and CPU. By comparing these metrics with the job's resource allocation requests we can find which is causing the issue and will log the same

@hwdef
Copy link
Member Author

hwdef commented Aug 12, 2023

@hwdef Can you please confirm if my approach is correct or not i.e. quering the Kubernetes cluster for available memory and CPU. By comparing these metrics with the job's resource allocation requests we can find which is causing the issue and will log the same

By and large this is correct, but I don't think you need to actively query the k8s for available cpu, memory remaining, volcano should already be maintaining these in cache. And we don't just need to know the cpu and memory ones, there are also extended resources such as nvidia.com/gpu

@srikanth-iyengar
Copy link

Ohk got it,
Will start working on this

@srikanth-iyengar
Copy link

/assign

@lowang-bh
Copy link
Member

lowang-bh commented Aug 13, 2023

Hi @srikanth-iyengar , you can refer to the following display, deployment with 2 pods.

pod first scheduled failed
image

another pod's message:
image

podgroup message:
image

@srikanth-iyengar
Copy link

srikanth-iyengar commented Aug 13, 2023

image
Will this solution resolve the issue of resource unavailability log?

I am covering all the resources by looping through the resource required like this
image

@srikanth-iyengar
Copy link

Hi @srikanth-iyengar , you can refer to the following display, deployment with 2 pods.

pod first scheduled failed image

another pod's message: image

podgroup message: image

Thanks for the info @lowang-bh

@srikanth-iyengar
Copy link

Hey @hwdef and @lowang-bh, could one of you provide a heads-up if everything is looking good in the PR?

@srikanth-iyengar
Copy link

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
3 participants