Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFVE: Support GPU share #624

Closed
k82cn opened this issue Dec 19, 2019 · 16 comments
Closed

RFVE: Support GPU share #624

k82cn opened this issue Dec 19, 2019 · 16 comments
Labels
area/scheduling kind/feature Categorizes issue or PR as related to a new feature. kind/RFE Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@k82cn
Copy link
Member

k82cn commented Dec 19, 2019

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature
/priority important-soon
/area scheduling

Description:

There're several scenarios require GPU share, e.g. inference workload, dev environment; it's better to support that by default.

@volcano-sh-bot volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/scheduling labels Dec 19, 2019
@k82cn k82cn added this to the v0.4 milestone Dec 19, 2019
@k82cn
Copy link
Member Author

k82cn commented Dec 19, 2019

/cc @carmark @Jeffwan

@Rui-Tang
Copy link
Contributor

@Rui-Tang
Copy link
Contributor

These two projects share GPU by sharing GPU memory

@k82cn k82cn changed the title Support GPU share RFVE: Support GPU share Dec 24, 2019
@k82cn
Copy link
Member Author

k82cn commented Dec 24, 2019

I known those two projects, that'll be great if we can build gpushare-scheduler-extender as a plugin of Volcano :)

/cc @cheyang

@k82cn
Copy link
Member Author

k82cn commented Dec 24, 2019

and that should be fine to import the code from that repo, similar to what we have done to the algorithm of default scheduler :)

@k82cn
Copy link
Member Author

k82cn commented Dec 24, 2019

/kind rfve

@volcano-sh-bot volcano-sh-bot added the kind/RFE Categorizes issue or PR as related to design. label Dec 24, 2019
@Jeffwan
Copy link
Member

Jeffwan commented Dec 26, 2019

@k82cn We will probably release a simple vGPU device plugin later. The GPU share idea is pretty straightforward, map physical GPU to virtual GPUs and advertise custom resources to APIServer. Ali's solution need extra extender to handle complex case for multi-GPU nodes. For us, I think we probably ask user only use it on single chip node which simplify the case a lot. I will collect more feedbacks and share here.

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

@k82cn
Copy link
Member Author

k82cn commented Dec 27, 2019

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

Are you going to use volcano as default scheduler in your cluster? I'm ok to support extender in volcano :)

@k82cn
Copy link
Member Author

k82cn commented Dec 27, 2019

We will probably release a simple vGPU device plugin later

Will you open source it? If so, we'd like to leverage it in volcano :)

@Jeffwan
Copy link
Member

Jeffwan commented Dec 27, 2019

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

Are you going to use volcano as default scheduler in your cluster? I'm ok to support extender in volcano :)

I mean on AWS, users can not use scheduler extender since user can not touch default scheduler configs. If volcano provides some scheduling support for custom resources, these users can adopt it as an alternative.

We will probably release a simple vGPU device plugin later

Will you open source it? If so, we'd like to leverage it in volcano :)

It's in internal discussion and I will try to move it forward asap. It's similar like gpushare-device-plugin but simpler. What I understand is volcano aims to provides a generic scheduling solution for these kinds of custom resources? Using gpushare-scheduler-extender as an example, it does some bin pack and filter some cases that virtual GPU resources are across two different physical GPUs.

@k82cn
Copy link
Member Author

k82cn commented Dec 27, 2019

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

Are you going to use volcano as default scheduler in your cluster? I'm ok to support extender in volcano :)

I mean on AWS, users can not use scheduler extender since user can not touch default scheduler configs. If volcano provides some scheduling support for custom resources, these users can adopt it as an alternative.

Great! We do need to support gpu share, and that'll be great to be an alternative for your serivce :)

We will probably release a simple vGPU device plugin later

Will you open source it? If so, we'd like to leverage it in volcano :)

It's in internal discussion and I will try to move it forward asap. It's similar like gpushare-device-plugin but simpler. What I understand is volcano aims to provides a generic scheduling solution for these kinds of custom resources? Using gpushare-scheduler-extender as an example, it does some bin pack and filter some cases that virtual GPU resources are across two different physical GPUs.

Vocano aimt to be a batch system, similar to Slurm/YARN; we will include more components in this project, including device-plugin if necessary. As we'd like to provides an e2e solution for batch workload :)

@k82cn
Copy link
Member Author

k82cn commented Jan 5, 2020

There's a really long discussion in upstream: kubernetes/kubernetes#52757

@Jeffwan
Copy link
Member

Jeffwan commented Jan 20, 2020

That's true. Our solution is based on kubernetes/kubernetes#52757 (comment)

@xiaogaozi
Copy link
Contributor

kubernetes/kubernetes#52757 (comment) provide a new solution from Tencent Cloud team

@stale
Copy link

stale bot commented Aug 18, 2020

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2020
@stale
Copy link

stale bot commented Oct 17, 2020

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Oct 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scheduling kind/feature Categorizes issue or PR as related to a new feature. kind/RFE Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

5 participants