-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI for GPU packages #1062
Comments
Thanks, @h-vetinari. This is an important request. For the case of CuPy, we have had multiple times hitting bugs when attempting to enable some (experimental) support that upstream did not cover enough in their CIs. Having CF's own CI would be very helpful. That said, in our case this could have been completely avoided if the upstream had tested it thoroughly, and I do feel this is the right way to go, especially for GPU packages. The upstream CIs should have a large build matrix (Python ver * NumPy ver * CUDA ver * OS ver * ...) to provide a good coverage. In contrast, CF's CIs should only focus on "getting the packaging right" and nothing further. Taking CuPy as an example, it is impractical to run its full test suite on CF's CIs, as each run takes 1~1.5 hr, and we have 12 builds from the aforementioned matrix. It simply takes too long. |
cc: @jakirkham |
cc @scopatz |
I wanted to bring up something along the lines of this again: The birdsGiven the effort expended for packaging GPU packages, the role of conda(-forge) in the scientific/ML stack (including "network effect") & the capabilities of existing conda(-forge) infrastructure, it would kill a lot of metaphorical birds with one stone if conda-forge CI had support for GPUs, because:
The stone - A jointly sponsored build queue for conda-forge?In the comment I referenced above, I just mentioned Microsoft (who are by now powering most of the CI of CF) as a possible sponsor for this, but that goes even more so for the companies that are more directly involved, like Nvidia (obviously), facebook (pytorch, but also faiss, etc.) and perhaps others like NumFOCUS, Ursa Labs (arrow), quansight or quantstack. I'm thinking that an opt-in build queue based on a separate, GPU-enabled azure subscription (which is already feasible for CF) would have huge bang-for-buck even just for the companies directly involved (in the form of less time spent by employees on packaging), to say nothing of the ecosystem. I apologise for the multi-ping, but I'm hoping that by bringing different people to the same table, a way forward might emerge more quickly & easily. 🙃 Some NVidia / RAPIDS / CuPy folks Happy holidays! PS. I had this idea for a while, but the thought got retriggered by the new GPU support for |
We should all chat offline. We have other things going on around this as well and hope to make progress in the new year. |
An other idea would be to use a self hosted runner. For windows it is trickier, but for linux we should be ok to pool a few resources maybe. |
There are a bunch of practical and legal issues around this. The technical bits of setting it up are straight forward. |
cc @datametrician @jakirkham from the NVIDIA side for visibility as well |
Happy New Year! 🎉 In terms of the tech, you can probably use scale set agents with an appropriate choice of VM and image to achieve this. From a funding/sponsorship perspective, Microsoft's contribution model is free hosted agents and parallelism. |
@vtbassmatt are you saying the permissions on the conda-forge account already allow us to use agents w/ GPUs attached? |
Yes, you (whoever is the organization admin) should be able to create a scale set pool pointing to an Azure subscription. The subscription will need a scale set running on the type of VM you prefer, in this case with GPU support. Then any pipelines authorized to use that pool will have access to that GPU-enabled virtual hardware. If this isn't clear, my apologies and I can help out more next week when I'm back at work. |
Sounds like a Christmas gift for free! |
That seems perfectly clear! Thank you! @mariusvniekerk is one of our most knowledgeable azure folks. We will give this a shot and see what we find! |
Ping @mariusvniekerk :) |
Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this. |
We have action here on multiple fronts. I am going to close this issue in favor of #1272. |
So, for some context for everyone subscribed to this issue (not least all those people I tagged). In the meantime, I tried to procure some initial funding for the idea of a jointly sponsored build queue from my employer (a small-ish data consulting company based in Switzerland; only a small fraction of our work is related to conda, but we're interested in the health of the ecosystem, particularly from the testing & security side of the story), and got approval for $500/month for a year as an experiment. My hope is that by placing some initial chips on the table, some others of the (much more involved) players might be enticed to join as well. 🙃 Those 6000$/year are roughly the amount to continuously run one NC6 agent on azure (the smallest GPU-instance). This likely wouldn't be enough to cover all of conda-forge's needs, but I'm guesstimating that roughly 3-4 times that much would be more than enough at the moment (e.g. the drone queue is also managing on two machines). That amount represents around a person-week of engineer time (cf. NEP46), which I think should be a drop in the bucket for companies that employ people that spend any non-trivial amount of time concerned with packaging GPU-related software - especially compared to the time lost by having to do it disjointly, resp. the potential time saved by doing it through conda-forge. I suggested this to @conda-forge/core, and it turns out that there is at least one other proposal along those lines currently in the pipeline. This would be great from my POV, because unifying all those efforts is IMO the ideal scenario (assuming the legalities are solvable). In any case, those discussions are now slowly unfolding, and perhaps provide some background colour to the opening of #1272 & #1273. In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra costs to enable conda-forge to do the building & integration would provide huge bang-per-buck for the people & companies that are building & using such packages. |
Some packages require actual GPU devices to run tests, and this is currently not possible - is there any way we can get this to happen?
This has already been mentioned in #901 by @leofang:
However, I think this is a much more specific question than what #901 tries to address, so I'm opening this issue. Note that the (somewhat stalled) discussion in #902 might provide a clue as to the way forward:
For more details see discussion there; but maybe there are other/better ways as well?
A non-exhaustive list of packages that are affected by this: pytorch, cupy, pyarrow, faiss, etc.
The text was updated successfully, but these errors were encountered: