Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI for GPU packages #1062

Closed
h-vetinari opened this issue May 16, 2020 · 18 comments
Closed

CI for GPU packages #1062

h-vetinari opened this issue May 16, 2020 · 18 comments

Comments

@h-vetinari
Copy link
Member

Some packages require actual GPU devices to run tests, and this is currently not possible - is there any way we can get this to happen?

This has already been mentioned in #901 by @leofang:

Is CI set up to build and test GPU packages? If not, where is this done?

However, I think this is a much more specific question than what #901 tries to address, so I'm opening this issue. Note that the (somewhat stalled) discussion in #902 might provide a clue as to the way forward:

MS-rep: We're soon going to start private preview of elastic self-hosted pools. Basically, we'll do the elastic management side for you; you run it in your Azure subscription so you can have whatever beefy machine you want.

@mariusvniekerk: That sounds great. So essentially these will run the same build host configurations as stock ones, just with different hosting? Or do we have to build the machine images ourselves

For more details see discussion there; but maybe there are other/better ways as well?

A non-exhaustive list of packages that are affected by this: pytorch, cupy, pyarrow, faiss, etc.

@leofang
Copy link
Member

leofang commented May 18, 2020

Thanks, @h-vetinari. This is an important request. For the case of CuPy, we have had multiple times hitting bugs when attempting to enable some (experimental) support that upstream did not cover enough in their CIs. Having CF's own CI would be very helpful.

That said, in our case this could have been completely avoided if the upstream had tested it thoroughly, and I do feel this is the right way to go, especially for GPU packages. The upstream CIs should have a large build matrix (Python ver * NumPy ver * CUDA ver * OS ver * ...) to provide a good coverage. In contrast, CF's CIs should only focus on "getting the packaging right" and nothing further. Taking CuPy as an example, it is impractical to run its full test suite on CF's CIs, as each run takes 1~1.5 hr, and we have 12 builds from the aforementioned matrix. It simply takes too long.

@leofang
Copy link
Member

leofang commented May 20, 2020

cc: @jakirkham

@beckermr
Copy link
Member

cc @scopatz

@h-vetinari
Copy link
Member Author

I wanted to bring up something along the lines of this again:

The birds

Given the effort expended for packaging GPU packages, the role of conda(-forge) in the scientific/ML stack (including "network effect") & the capabilities of existing conda(-forge) infrastructure, it would kill a lot of metaphorical birds with one stone if conda-forge CI had support for GPUs, because:

  • a lot of redundant efforts could be saved
  • without sacrificing CI quality
  • yielding high-quality packaging with large os-/arch-/version-coverage

The stone - A jointly sponsored build queue for conda-forge?

In the comment I referenced above, I just mentioned Microsoft (who are by now powering most of the CI of CF) as a possible sponsor for this, but that goes even more so for the companies that are more directly involved, like Nvidia (obviously), facebook (pytorch, but also faiss, etc.) and perhaps others like NumFOCUS, Ursa Labs (arrow), quansight or quantstack.

I'm thinking that an opt-in build queue based on a separate, GPU-enabled azure subscription (which is already feasible for CF) would have huge bang-for-buck even just for the companies directly involved (in the form of less time spent by employees on packaging), to say nothing of the ecosystem.

I apologise for the multi-ping, but I'm hoping that by bringing different people to the same table, a way forward might emerge more quickly & easily. 🙃

Some NVidia / RAPIDS / CuPy folks
@cjnolet @dantegd @leofang @kkraus14 @mike-wendt @teju85
Some Facebook folks
@beauby @ezyang @mdouze @seemethere @soumith
Some NumFOCUS / Quansight / Ursa Labs peeps
@kszucs @pearu @scopatz @rgommers @wesm
Microsoft
@vtbassmatt
conda-forge / anaconda + other possibly relevant parties
@conda-forge/core @hadim @hmaarrfk @jph00

Happy holidays!

PS. I had this idea for a while, but the thought got retriggered by the new GPU support for pytorch in conda-forge, that however times out (requiring manual builds) and does not test with actual GPUs (plus having some time to write them words).

@beckermr
Copy link
Member

We should all chat offline. We have other things going on around this as well and hope to make progress in the new year.

@hmaarrfk
Copy link
Contributor

An other idea would be to use a self hosted runner.

https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners

For windows it is trickier, but for linux we should be ok to pool a few resources maybe.

@beckermr
Copy link
Member

There are a bunch of practical and legal issues around this. The technical bits of setting it up are straight forward.

@kkraus14
Copy link
Contributor

cc @datametrician @jakirkham from the NVIDIA side for visibility as well

@vtbassmatt
Copy link

Happy New Year! 🎉

In terms of the tech, you can probably use scale set agents with an appropriate choice of VM and image to achieve this.

From a funding/sponsorship perspective, Microsoft's contribution model is free hosted agents and parallelism.

@beckermr
Copy link
Member

@vtbassmatt are you saying the permissions on the conda-forge account already allow us to use agents w/ GPUs attached?

@vtbassmatt
Copy link

Yes, you (whoever is the organization admin) should be able to create a scale set pool pointing to an Azure subscription. The subscription will need a scale set running on the type of VM you prefer, in this case with GPU support. Then any pipelines authorized to use that pool will have access to that GPU-enabled virtual hardware.

If this isn't clear, my apologies and I can help out more next week when I'm back at work.

@leofang
Copy link
Member

leofang commented Dec 30, 2020

Sounds like a Christmas gift for free!

@beckermr
Copy link
Member

That seems perfectly clear! Thank you!

@mariusvniekerk is one of our most knowledgeable azure folks.

We will give this a shot and see what we find!

@h-vetinari
Copy link
Member Author

Ping @mariusvniekerk :)

@beckermr
Copy link
Member

beckermr commented Jan 9, 2021

Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

@h-vetinari
Copy link
Member Author

@beckermr: Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

That was my understanding when I tried to make the case for bringing together interested/affected parties to come up with (the funding for) such a hosted pool.

@beckermr
Copy link
Member

We have action here on multiple fronts. I am going to close this issue in favor of #1272.

@h-vetinari
Copy link
Member Author

h-vetinari commented Mar 13, 2021

So, for some context for everyone subscribed to this issue (not least all those people I tagged).

In the meantime, I tried to procure some initial funding for the idea of a jointly sponsored build queue from my employer (a small-ish data consulting company based in Switzerland; only a small fraction of our work is related to conda, but we're interested in the health of the ecosystem, particularly from the testing & security side of the story), and got approval for $500/month for a year as an experiment. My hope is that by placing some initial chips on the table, some others of the (much more involved) players might be enticed to join as well. 🙃

Those 6000$/year are roughly the amount to continuously run one NC6 agent on azure (the smallest GPU-instance). This likely wouldn't be enough to cover all of conda-forge's needs, but I'm guesstimating that roughly 3-4 times that much would be more than enough at the moment (e.g. the drone queue is also managing on two machines).

That amount represents around a person-week of engineer time (cf. NEP46), which I think should be a drop in the bucket for companies that employ people that spend any non-trivial amount of time concerned with packaging GPU-related software - especially compared to the time lost by having to do it disjointly, resp. the potential time saved by doing it through conda-forge.

I suggested this to @conda-forge/core, and it turns out that there is at least one other proposal along those lines currently in the pipeline. This would be great from my POV, because unifying all those efforts is IMO the ideal scenario (assuming the legalities are solvable). In any case, those discussions are now slowly unfolding, and perhaps provide some background colour to the opening of #1272 & #1273.

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra costs to enable conda-forge to do the building & integration would provide huge bang-per-buck for the people & companies that are building & using such packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants