CI for GPU packages #1062

h-vetinari · 2020-05-16T11:25:53Z

Some packages require actual GPU devices to run tests, and this is currently not possible - is there any way we can get this to happen?

This has already been mentioned in #901 by @leofang:

Is CI set up to build and test GPU packages? If not, where is this done?

However, I think this is a much more specific question than what #901 tries to address, so I'm opening this issue. Note that the (somewhat stalled) discussion in #902 might provide a clue as to the way forward:

MS-rep: We're soon going to start private preview of elastic self-hosted pools. Basically, we'll do the elastic management side for you; you run it in your Azure subscription so you can have whatever beefy machine you want.

@mariusvniekerk: That sounds great. So essentially these will run the same build host configurations as stock ones, just with different hosting? Or do we have to build the machine images ourselves

For more details see discussion there; but maybe there are other/better ways as well?

A non-exhaustive list of packages that are affected by this: pytorch, cupy, pyarrow, faiss, etc.

leofang · 2020-05-18T14:17:43Z

Thanks, @h-vetinari. This is an important request. For the case of CuPy, we have had multiple times hitting bugs when attempting to enable some (experimental) support that upstream did not cover enough in their CIs. Having CF's own CI would be very helpful.

That said, in our case this could have been completely avoided if the upstream had tested it thoroughly, and I do feel this is the right way to go, especially for GPU packages. The upstream CIs should have a large build matrix (Python ver * NumPy ver * CUDA ver * OS ver * ...) to provide a good coverage. In contrast, CF's CIs should only focus on "getting the packaging right" and nothing further. Taking CuPy as an example, it is impractical to run its full test suite on CF's CIs, as each run takes 1~1.5 hr, and we have 12 builds from the aforementioned matrix. It simply takes too long.

leofang · 2020-05-20T04:44:19Z

cc: @jakirkham

beckermr · 2020-08-15T21:24:47Z

cc @scopatz

h-vetinari · 2020-12-29T18:02:45Z

I wanted to bring up something along the lines of this again:

The birds

Given the effort expended for packaging GPU packages, the role of conda(-forge) in the scientific/ML stack (including "network effect") & the capabilities of existing conda(-forge) infrastructure, it would kill a lot of metaphorical birds with one stone if conda-forge CI had support for GPUs, because:

a lot of redundant efforts could be saved
without sacrificing CI quality
yielding high-quality packaging with large os-/arch-/version-coverage

The stone - A jointly sponsored build queue for conda-forge?

In the comment I referenced above, I just mentioned Microsoft (who are by now powering most of the CI of CF) as a possible sponsor for this, but that goes even more so for the companies that are more directly involved, like Nvidia (obviously), facebook (pytorch, but also faiss, etc.) and perhaps others like NumFOCUS, Ursa Labs (arrow), quansight or quantstack.

I'm thinking that an opt-in build queue based on a separate, GPU-enabled azure subscription (which is already feasible for CF) would have huge bang-for-buck even just for the companies directly involved (in the form of less time spent by employees on packaging), to say nothing of the ecosystem.

I apologise for the multi-ping, but I'm hoping that by bringing different people to the same table, a way forward might emerge more quickly & easily. 🙃

Some NVidia / RAPIDS / CuPy folks
@cjnolet @dantegd @leofang @kkraus14 @mike-wendt @teju85
Some Facebook folks
@beauby @ezyang @mdouze @seemethere @soumith
Some NumFOCUS / Quansight / Ursa Labs peeps
@kszucs @pearu @scopatz @rgommers @wesm
Microsoft
@vtbassmatt
conda-forge / anaconda + other possibly relevant parties
@conda-forge/core @hadim @hmaarrfk @jph00

Happy holidays!

PS. I had this idea for a while, but the thought got retriggered by the new GPU support for pytorch in conda-forge, that however times out (requiring manual builds) and does not test with actual GPUs (plus having some time to write them words).

beckermr · 2020-12-29T18:17:49Z

We should all chat offline. We have other things going on around this as well and hope to make progress in the new year.

hmaarrfk · 2020-12-29T19:50:14Z

An other idea would be to use a self hosted runner.

https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners

For windows it is trickier, but for linux we should be ok to pool a few resources maybe.

beckermr · 2020-12-29T20:24:56Z

There are a bunch of practical and legal issues around this. The technical bits of setting it up are straight forward.

kkraus14 · 2020-12-29T20:59:58Z

cc @datametrician @jakirkham from the NVIDIA side for visibility as well

vtbassmatt · 2020-12-30T14:54:01Z

Happy New Year! 🎉

In terms of the tech, you can probably use scale set agents with an appropriate choice of VM and image to achieve this.

From a funding/sponsorship perspective, Microsoft's contribution model is free hosted agents and parallelism.

beckermr · 2020-12-30T15:53:56Z

@vtbassmatt are you saying the permissions on the conda-forge account already allow us to use agents w/ GPUs attached?

vtbassmatt · 2020-12-30T15:57:38Z

Yes, you (whoever is the organization admin) should be able to create a scale set pool pointing to an Azure subscription. The subscription will need a scale set running on the type of VM you prefer, in this case with GPU support. Then any pipelines authorized to use that pool will have access to that GPU-enabled virtual hardware.

If this isn't clear, my apologies and I can help out more next week when I'm back at work.

leofang · 2020-12-30T15:59:26Z

Sounds like a Christmas gift for free!

beckermr · 2020-12-30T16:06:59Z

That seems perfectly clear! Thank you!

@mariusvniekerk is one of our most knowledgeable azure folks.

We will give this a shot and see what we find!

h-vetinari · 2021-01-09T12:27:45Z

Ping @mariusvniekerk :)

beckermr · 2021-01-09T13:02:08Z

Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

h-vetinari · 2021-01-09T14:25:42Z

@beckermr: Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

That was my understanding when I tried to make the case for bringing together interested/affected parties to come up with (the funding for) such a hosted pool.

beckermr · 2021-03-12T22:16:37Z

We have action here on multiple fronts. I am going to close this issue in favor of #1272.

h-vetinari · 2021-03-13T15:24:34Z

So, for some context for everyone subscribed to this issue (not least all those people I tagged).

In the meantime, I tried to procure some initial funding for the idea of a jointly sponsored build queue from my employer (a small-ish data consulting company based in Switzerland; only a small fraction of our work is related to conda, but we're interested in the health of the ecosystem, particularly from the testing & security side of the story), and got approval for $500/month for a year as an experiment. My hope is that by placing some initial chips on the table, some others of the (much more involved) players might be enticed to join as well. 🙃

Those 6000$/year are roughly the amount to continuously run one NC6 agent on azure (the smallest GPU-instance). This likely wouldn't be enough to cover all of conda-forge's needs, but I'm guesstimating that roughly 3-4 times that much would be more than enough at the moment (e.g. the drone queue is also managing on two machines).

That amount represents around a person-week of engineer time (cf. NEP46), which I think should be a drop in the bucket for companies that employ people that spend any non-trivial amount of time concerned with packaging GPU-related software - especially compared to the time lost by having to do it disjointly, resp. the potential time saved by doing it through conda-forge.

I suggested this to @conda-forge/core, and it turns out that there is at least one other proposal along those lines currently in the pipeline. This would be great from my POV, because unifying all those efforts is IMO the ideal scenario (assuming the legalities are solvable). In any case, those discussions are now slowly unfolding, and perhaps provide some background colour to the opening of #1272 & #1273.

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra costs to enable conda-forge to do the building & integration would provide huge bang-per-buck for the people & companies that are building & using such packages.

leofang mentioned this issue May 18, 2020

Avoid setting CFLAGS, CPPFLAGS, and CXXFLAGS conda-forge/cupy-feedstock#59

Merged

5 tasks

beckermr pinned this issue Aug 15, 2020

h-vetinari mentioned this issue Sep 7, 2020

GPU Support conda-forge/pytorch-cpu-feedstock#7

Closed

beckermr unpinned this issue Sep 27, 2020

h-vetinari mentioned this issue Jan 15, 2021

Continuing the drive to TF2 conda-forge/tensorflow-feedstock#110

Merged

leofang mentioned this issue Jan 28, 2021

Split recipe in components conda-forge/cudatoolkit-feedstock#48

Closed

beckermr closed this as completed Mar 12, 2021

h-vetinari mentioned this issue Mar 13, 2021

tasks and discussion for new GPU CI queue #1272

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI for GPU packages #1062

CI for GPU packages #1062

h-vetinari commented May 16, 2020

leofang commented May 18, 2020

leofang commented May 20, 2020

beckermr commented Aug 15, 2020

h-vetinari commented Dec 29, 2020

beckermr commented Dec 29, 2020

hmaarrfk commented Dec 29, 2020

beckermr commented Dec 29, 2020

kkraus14 commented Dec 29, 2020

vtbassmatt commented Dec 30, 2020

beckermr commented Dec 30, 2020

vtbassmatt commented Dec 30, 2020

leofang commented Dec 30, 2020

beckermr commented Dec 30, 2020

h-vetinari commented Jan 9, 2021

beckermr commented Jan 9, 2021

h-vetinari commented Jan 9, 2021

beckermr commented Mar 12, 2021

h-vetinari commented Mar 13, 2021 •

edited

Loading

CI for GPU packages #1062

CI for GPU packages #1062

Comments

h-vetinari commented May 16, 2020

leofang commented May 18, 2020

leofang commented May 20, 2020

beckermr commented Aug 15, 2020

h-vetinari commented Dec 29, 2020

The birds

The stone - A jointly sponsored build queue for conda-forge?

beckermr commented Dec 29, 2020

hmaarrfk commented Dec 29, 2020

beckermr commented Dec 29, 2020

kkraus14 commented Dec 29, 2020

vtbassmatt commented Dec 30, 2020

beckermr commented Dec 30, 2020

vtbassmatt commented Dec 30, 2020

leofang commented Dec 30, 2020

beckermr commented Dec 30, 2020

h-vetinari commented Jan 9, 2021

beckermr commented Jan 9, 2021

h-vetinari commented Jan 9, 2021

beckermr commented Mar 12, 2021

h-vetinari commented Mar 13, 2021 • edited Loading

h-vetinari commented Mar 13, 2021 •

edited

Loading