-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] consolidate CUDA versions #5677
Conversation
We just want to make sure it still works (compiles, run, etc) similar to the original version. Is there a CI building these approaches (cuda, cuda_exp and combined)? |
@ChipKerchner I don't totally understand what you mean by this question, especially "and combined". I'll try to answer but please let me know if that's not sufficient. Every commit merged into Here's the configuration for that: LightGBM/.github/workflows/cuda.yml Lines 30 to 55 in 3c3f79e
And build logs for the latest commit to The same CI coverage is preserved in this PR, and we'll continue to block any future PRs that break the CUDA support.
I'm confused by your use of the phrase "and combined" here, so I want to be absolutely sure you understand what's being proposed here. As of this PR, there will only be one CUDA implementation of LightGBM. |
By "combined" I meant this "one CUDA implementation of LightGBM" approach for this PR. |
ah got it! Yes, the CI as of this PR tests that "one CUDA implementation of LightGBM" on Ubuntu 18.04 for the following combinations of CUDA versions, Python versions, and compilers: LightGBM/.github/workflows/cuda.yml Lines 31 to 45 in 967c005
you can see the build logs by clicking "Details" next to any of the checks with names starting like "CUDA Version" on this PR, e.g. https://github.com/microsoft/LightGBM/actions/runs/3945899727/jobs/6753199490 |
@shiyu1994 @guolinke is there any other information I could provide to help with this? This type of PR will be difficult to keep up to date with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work. I've viewed all the changes and just left a few comments about whether to keep cuda_exp
as a valid input in compilation and device parameter, in addition with a few change suggestions.
Co-authored-by: shiyu1994 <[email protected]>
Thanks @shiyu1994 . I just pushed some commits completely removing |
@jameslamb Thanks. The changes LGTM. It seems that we are encountering some CI issues. One R test failes. And one gpu job in Azure Devops ci test fails. I've tried retrigger the R test again but it fails again. Maybe we should fix the ci issue first. |
Thanks! @shiyu1994 I've already put up a PR to fix the R CI issues. Can you please review #5689? I've noticed that the GPU jobs on Azure DevOps have gotten flakier since I merged #5292. It's usually fixed by re-running once or twice. I'll keep doing that. We can turn the Dask tests back off on GPU builds in the future if it gets too annoying or we don't have time to investigate the issues. |
Sorry for the delay. I see that PR is already merged. Thanks.
I agree. But I'll try to spare some time to investigate it if it happens frequently. |
no problem, thanks @shiyu1994 ! I just merged in the changes from #5689 (thanks to @jmoralez for reviewing!) here so once CI rebuilds on this PR, and if you approve this PR, I think we can merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The changes LGTM.
awesome, thank you so much @shiyu1994 !!! I'm excited to have shorter CI times and for us to be able to focus on a single CUDA version 😁 And thank you so much @ChipKerchner and your teammates for getting LightGBM started on this CUDA journey back in #3160. |
Thanks @ChipKerchner and your team for the contribution to LightGBM CUDA version! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. |
Changes
Proposes removing the CUDA implementation from #3160.
As of this PR, the only CUDA build of LightGBM would be the one we've been calling
cuda_exp
, which @shiyu1994 started in #4528 and #4630.Specifically:
python setup.py --cuda-exp
orcmake -DUSE_CUDA_EXP=1
, raises a deprecation warning and still uses the version we've been until now calling "cuda_exp"pip
,source
, andwheel
builds of the CUDA-enabled Python packageHistory
(please correct me if I've mischaracterized the history below)
In #3160 (merged September 2020), a team from IBM added a first CUDA implementation of LightGBM because the existing OpenCL-based build didn't support some platforms (namely, IBM Power).
About a year after that, @shiyu1994 and @guolinke (along with others at Microsoft?) started on an "experimental" CUDA implementation.
That "experimental" implementation was first merged in #4630 (March 2022), and since then we've had two CUDA implementations maintained in this repo:
cuda
= the IBM contributioncuda_exp
= the newer implementation from MicrosoftSince then, @shiyu1994 has been working actively on that
cuda_exp
version, with the plan to include it in a v4.0.0 release (#5153).The
cuda_exp
version is still missing some important features, like distributed training (#5076) and on-GPU computation of metrics and loss functions (#5163).Despite the current limitations, this PR implements the proposal from #5153 (comment) to consolidate down to only one CUDA implementation in LightGBM... the one currently called
cuda_exp
.Motivation for this change
In my opinion, LightGBM does not have enough maintainer/contributor availability to maintain two separate CUDA implementations.
Consolidating down to 1 allows the project to more effectively channel the limited attention of its maintainers and contributors towards improving the LightGBM-on-GPU experience, by not duplicating effort across two different builds intended to serve the same purpose.
This represents a temporary loss of functionality (e.g. multi-GPU training), but I think it'll help the project to move faster and @shiyu1994 has said that that functionality is actively under development for the
cuda_exp
implementation.Notes for Reviewers
I know this is a large change, so tagging in others for their opinions.
@shiyu1994 @guolinke @huanzhang12 @jmoralez @StrikerRUS @btrotta @ChipKerchner @ceseo
👋 Thanks all for your consideration.