RFVE: Support GPU share #624

k82cn · 2019-12-19T02:12:50Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature
/priority important-soon
/area scheduling

Description:

There're several scenarios require GPU share, e.g. inference workload, dev environment; it's better to support that by default.

k82cn · 2019-12-19T02:14:49Z

/cc @carmark @Jeffwan

Rui-Tang · 2019-12-20T08:29:56Z

Hi, I found two relevent projects.
https://github.com/AliyunContainerService/gpushare-device-plugin
https://github.com/AliyunContainerService/gpushare-scheduler-extender

Rui-Tang · 2019-12-20T08:30:42Z

These two projects share GPU by sharing GPU memory

k82cn · 2019-12-24T08:07:08Z

I known those two projects, that'll be great if we can build gpushare-scheduler-extender as a plugin of Volcano :)

/cc @cheyang

k82cn · 2019-12-24T08:08:22Z

and that should be fine to import the code from that repo, similar to what we have done to the algorithm of default scheduler :)

k82cn · 2019-12-24T08:12:16Z

/kind rfve

Jeffwan · 2019-12-26T22:05:54Z

@k82cn We will probably release a simple vGPU device plugin later. The GPU share idea is pretty straightforward, map physical GPU to virtual GPUs and advertise custom resources to APIServer. Ali's solution need extra extender to handle complex case for multi-GPU nodes. For us, I think we probably ask user only use it on single chip node which simplify the case a lot. I will collect more feedbacks and share here.

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

k82cn · 2019-12-27T01:32:30Z

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

Are you going to use volcano as default scheduler in your cluster? I'm ok to support extender in volcano :)

k82cn · 2019-12-27T01:33:34Z

We will probably release a simple vGPU device plugin later

Will you open source it? If so, we'd like to leverage it in volcano :)

Jeffwan · 2019-12-27T04:32:38Z

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

Are you going to use volcano as default scheduler in your cluster? I'm ok to support extender in volcano :)

I mean on AWS, users can not use scheduler extender since user can not touch default scheduler configs. If volcano provides some scheduling support for custom resources, these users can adopt it as an alternative.

We will probably release a simple vGPU device plugin later

Will you open source it? If so, we'd like to leverage it in volcano :)

It's in internal discussion and I will try to move it forward asap. It's similar like gpushare-device-plugin but simpler. What I understand is volcano aims to provides a generic scheduling solution for these kinds of custom resources? Using gpushare-scheduler-extender as an example, it does some bin pack and filter some cases that virtual GPU resources are across two different physical GPUs.

k82cn · 2019-12-27T06:19:05Z

I would say for some of the cloud providers who doesn't allow users to change master scheduler configurations, extender support in volcano (secondary scheduler) will help

Are you going to use volcano as default scheduler in your cluster? I'm ok to support extender in volcano :)

I mean on AWS, users can not use scheduler extender since user can not touch default scheduler configs. If volcano provides some scheduling support for custom resources, these users can adopt it as an alternative.

Great! We do need to support gpu share, and that'll be great to be an alternative for your serivce :)

We will probably release a simple vGPU device plugin later

Will you open source it? If so, we'd like to leverage it in volcano :)

It's in internal discussion and I will try to move it forward asap. It's similar like gpushare-device-plugin but simpler. What I understand is volcano aims to provides a generic scheduling solution for these kinds of custom resources? Using gpushare-scheduler-extender as an example, it does some bin pack and filter some cases that virtual GPU resources are across two different physical GPUs.

Vocano aimt to be a batch system, similar to Slurm/YARN; we will include more components in this project, including device-plugin if necessary. As we'd like to provides an e2e solution for batch workload :)

k82cn · 2020-01-05T03:01:16Z

There's a really long discussion in upstream: kubernetes/kubernetes#52757

Jeffwan · 2020-01-20T22:33:03Z

That's true. Our solution is based on kubernetes/kubernetes#52757 (comment)

xiaogaozi · 2020-03-13T04:15:21Z

kubernetes/kubernetes#52757 (comment) provide a new solution from Tencent Cloud team

stale · 2020-08-18T06:32:44Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2020-10-17T07:17:31Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/scheduling labels Dec 19, 2019

k82cn added this to the v0.4 milestone Dec 19, 2019

k82cn changed the title ~~Support GPU share~~ RFVE: Support GPU share Dec 24, 2019

volcano-sh-bot added the kind/RFE Categorizes issue or PR as related to design. label Dec 24, 2019

k82cn modified the milestones: v0.4, v1.0 Apr 26, 2020

virendrasuryavanshi mentioned this issue May 21, 2020

Volcano Scheduler - Cluster for GSoC Project [CNCF - Virtual Kubelet] cncf/cluster#136

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2020

stale bot closed this as completed Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFVE: Support GPU share #624

RFVE: Support GPU share #624

k82cn commented Dec 19, 2019

k82cn commented Dec 19, 2019

Rui-Tang commented Dec 20, 2019

Rui-Tang commented Dec 20, 2019

k82cn commented Dec 24, 2019

k82cn commented Dec 24, 2019

k82cn commented Dec 24, 2019

Jeffwan commented Dec 26, 2019

k82cn commented Dec 27, 2019

k82cn commented Dec 27, 2019

Jeffwan commented Dec 27, 2019 •

edited

Loading

k82cn commented Dec 27, 2019

k82cn commented Jan 5, 2020

Jeffwan commented Jan 20, 2020

xiaogaozi commented Mar 13, 2020

stale bot commented Aug 18, 2020

stale bot commented Oct 17, 2020

RFVE: Support GPU share #624

RFVE: Support GPU share #624

Comments

k82cn commented Dec 19, 2019

k82cn commented Dec 19, 2019

Rui-Tang commented Dec 20, 2019

Rui-Tang commented Dec 20, 2019

k82cn commented Dec 24, 2019

k82cn commented Dec 24, 2019

k82cn commented Dec 24, 2019

Jeffwan commented Dec 26, 2019

k82cn commented Dec 27, 2019

k82cn commented Dec 27, 2019

Jeffwan commented Dec 27, 2019 • edited Loading

k82cn commented Dec 27, 2019

k82cn commented Jan 5, 2020

Jeffwan commented Jan 20, 2020

xiaogaozi commented Mar 13, 2020

stale bot commented Aug 18, 2020

stale bot commented Oct 17, 2020

Jeffwan commented Dec 27, 2019 •

edited

Loading