[CUDA] consolidate CUDA versions #5677

jameslamb · 2023-01-17T05:51:31Z

Changes

Proposes removing the CUDA implementation from #3160.

As of this PR, the only CUDA build of LightGBM would be the one we've been calling cuda_exp, which @shiyu1994 started in #4528 and #4630.

Specifically:

removes all mentions of "CUDA exp" or "CUDA Experimental" in docs and internal code
removes all code specific only to the implementation from Add support for CUDA-based GPU build #3160
when using python setup.py --cuda-exp or cmake -DUSE_CUDA_EXP=1, raises a deprecation warning and still uses the version we've been until now calling "cuda_exp"
removes 2 CUDA CI jobs, so now there will be three, one each for pip, source, and wheel builds of the CUDA-enabled Python package
increases the minimum supported CUDA version from 9.0 to 10.0

History

(please correct me if I've mischaracterized the history below)

In #3160 (merged September 2020), a team from IBM added a first CUDA implementation of LightGBM because the existing OpenCL-based build didn't support some platforms (namely, IBM Power).

About a year after that, @shiyu1994 and @guolinke (along with others at Microsoft?) started on an "experimental" CUDA implementation.

That "experimental" implementation was first merged in #4630 (March 2022), and since then we've had two CUDA implementations maintained in this repo:

cuda = the IBM contribution
cuda_exp = the newer implementation from Microsoft

Since then, @shiyu1994 has been working actively on that cuda_exp version, with the plan to include it in a v4.0.0 release (#5153).

The cuda_exp version is still missing some important features, like distributed training (#5076) and on-GPU computation of metrics and loss functions (#5163).

Despite the current limitations, this PR implements the proposal from #5153 (comment) to consolidate down to only one CUDA implementation in LightGBM... the one currently called cuda_exp.

Motivation for this change

In my opinion, LightGBM does not have enough maintainer/contributor availability to maintain two separate CUDA implementations.

Consolidating down to 1 allows the project to more effectively channel the limited attention of its maintainers and contributors towards improving the LightGBM-on-GPU experience, by not duplicating effort across two different builds intended to serve the same purpose.

improves development velocity by removing two costly CI jobs
reduces confusion for users wanting to run GPU-accelerated LightGBM
noticeably simplifies the codebase and reduces its size
focuses all feature requests, bug reports, code contributions, etc. on one CUDA implementation

This represents a temporary loss of functionality (e.g. multi-GPU training), but I think it'll help the project to move faster and @shiyu1994 has said that that functionality is actively under development for the cuda_exp implementation.

Notes for Reviewers

I know this is a large change, so tagging in others for their opinions.

@shiyu1994 @guolinke @huanzhang12 @jmoralez @StrikerRUS @btrotta @ChipKerchner @ceseo

👋 Thanks all for your consideration.

ChipKerchner · 2023-01-18T16:17:18Z

We just want to make sure it still works (compiles, run, etc) similar to the original version. Is there a CI building these approaches (cuda, cuda_exp and combined)?

jameslamb · 2023-01-18T19:35:56Z

Is there a CI building these approaches (cuda, cuda_exp and combined)?

@ChipKerchner I don't totally understand what you mean by this question, especially "and combined". I'll try to answer but please let me know if that's not sufficient.

Every commit merged into master in this project in at least the last 6 months has seen the Python package built successfully and its unit tests pass for both the cuda version (from #3160) and cuda_exp version (from #4630 and onwards).

Here's the configuration for that:

LightGBM/.github/workflows/cuda.yml

Lines 30 to 55 in 3c3f79e

    
           include: 
        
             - method: source 
        
               compiler: gcc 
        
               python_version: "3.8" 
        
               cuda_version: "11.7.1" 
        
               task: cuda 
        
             - method: pip 
        
               compiler: clang 
        
               python_version: "3.9" 
        
               cuda_version: "10.0" 
        
               task: cuda 
        
             - method: wheel 
        
               compiler: gcc 
        
               python_version: "3.10" 
        
               cuda_version: "9.0" 
        
               task: cuda 
        
             - method: source 
        
               compiler: gcc 
        
               python_version: "3.8" 
        
               cuda_version: "11.7.1" 
        
               task: cuda_exp 
        
             - method: pip 
        
               compiler: clang 
        
               python_version: "3.9" 
        
               cuda_version: "10.0" 
        
               task: cuda_exp

And build logs for the latest commit to master: https://github.com/microsoft/LightGBM/actions/runs/3935991924

The same CI coverage is preserved in this PR, and we'll continue to block any future PRs that break the CUDA support.

cuda, cuda_exp, and combined

I'm confused by your use of the phrase "and combined" here, so I want to be absolutely sure you understand what's being proposed here. As of this PR, there will only be one CUDA implementation of LightGBM.

ChipKerchner · 2023-01-18T19:40:39Z

By "combined" I meant this "one CUDA implementation of LightGBM" approach for this PR.

jameslamb · 2023-01-18T19:54:34Z

ah got it! Yes, the CI as of this PR tests that "one CUDA implementation of LightGBM" on Ubuntu 18.04 for the following combinations of CUDA versions, Python versions, and compilers:

LightGBM/.github/workflows/cuda.yml

Lines 31 to 45 in 967c005

    
           - method: wheel 
        
             compiler: gcc 
        
             python_version: "3.10" 
        
             cuda_version: "11.7.1" 
        
             task: cuda 
        
           - method: source 
        
             compiler: gcc 
        
             python_version: "3.8" 
        
             cuda_version: "10.0" 
        
             task: cuda 
        
           - method: pip 
        
             compiler: clang 
        
             python_version: "3.9" 
        
             cuda_version: "11.7.1" 
        
             task: cuda

you can see the build logs by clicking "Details" next to any of the checks with names starting like "CUDA Version" on this PR, e.g. https://github.com/microsoft/LightGBM/actions/runs/3945899727/jobs/6753199490

jameslamb · 2023-01-27T04:14:24Z

@shiyu1994 @guolinke is there any other information I could provide to help with this?

This type of PR will be difficult to keep up to date with master if other CUDA code is merged, since it touches so many files. I'd really like to do whatever I can to move it forward.

shiyu1994

Thanks for the great work. I've viewed all the changes and just left a few comments about whether to keep cuda_exp as a valid input in compilation and device parameter, in addition with a few change suggestions.

CMakeLists.txt

docs/Parameters.rst

python-package/setup.py

src/boosting/gbdt.h

src/io/config.cpp

src/treelearner/tree_learner.cpp

Co-authored-by: shiyu1994 <[email protected]>

…emove-cuda-v1

jameslamb · 2023-01-30T03:11:11Z

Thanks @shiyu1994 . I just pushed some commits completely removing CUDA_EXP, instead of keeping that configuration option and raising a warning. I think that'll reduce confusion, and it's ok given that it was never officially included in a release.

shiyu1994 · 2023-01-30T11:04:50Z

@jameslamb Thanks. The changes LGTM. It seems that we are encountering some CI issues. One R test failes. And one gpu job in Azure Devops ci test fails. I've tried retrigger the R test again but it fails again. Maybe we should fix the ci issue first.

jameslamb · 2023-01-30T19:02:53Z

Thanks!

@shiyu1994 I've already put up a PR to fix the R CI issues. Can you please review #5689?

I've noticed that the GPU jobs on Azure DevOps have gotten flakier since I merged #5292. It's usually fixed by re-running once or twice. I'll keep doing that. We can turn the Dask tests back off on GPU builds in the future if it gets too annoying or we don't have time to investigate the issues.

shiyu1994 · 2023-01-31T01:59:11Z

I've already put up a PR to fix the R CI issues. Can you please review #5689?

Sorry for the delay. I see that PR is already merged. Thanks.

I've noticed that the GPU jobs on Azure DevOps have gotten flakier since I merged #5292.

I agree. But I'll try to spare some time to investigate it if it happens frequently.

jameslamb · 2023-01-31T03:29:12Z

no problem, thanks @shiyu1994 ! I just merged in the changes from #5689 (thanks to @jmoralez for reviewing!) here so once CI rebuilds on this PR, and if you approve this PR, I think we can merge it.

shiyu1994

Thanks. The changes LGTM.

jameslamb · 2023-02-01T03:50:30Z

awesome, thank you so much @shiyu1994 !!! I'm excited to have shorter CI times and for us to be able to focus on a single CUDA version 😁

And thank you so much @ChipKerchner and your teammates for getting LightGBM started on this CUDA journey back in #3160.

shiyu1994 · 2023-02-01T12:27:15Z

Thanks @ChipKerchner and your team for the contribution to LightGBM CUDA version!

github-actions · 2023-08-16T00:20:42Z

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

jameslamb added 7 commits January 8, 2023 22:35

[ci] speed up if-else, swig, and lint conda setup

ee5bbba

add 'source activate'

12f8f55

python constraint

a981a07

merge master

6df87a7

start removing cuda v1

05ee8a1

comment out CI

a9103a3

remove more references

ae9cdb2

jameslamb added in progress breaking labels Jan 17, 2023

jameslamb added 8 commits January 17, 2023 19:39

revert some unnecessaary changes

0182205

revert a few more mistakes

07a4a92

revert another change that ignored params

0c60b71

sigh

118d32a

remove CUDATreeLearner

e4cc9d0

fix tests, docs

e734d6f

fix quoting in setup.py

0b4df93

restore all CI

967c005

jameslamb removed the in progress label Jan 18, 2023

jameslamb changed the title ~~WIP: consolidate CUDA versions~~ [CUDA] consolidate CUDA versions Jan 18, 2023

jameslamb marked this pull request as ready for review January 18, 2023 05:40

jameslamb requested review from guolinke, shiyu1994, StrikerRUS and jmoralez as code owners January 18, 2023 05:40

jameslamb mentioned this pull request Jan 18, 2023

[RFC] 4.0.0 Release #5153

Closed

60 tasks

jameslamb added the awaiting review label Jan 19, 2023

shiyu1994 reviewed Jan 29, 2023

View reviewed changes

jameslamb and others added 5 commits January 29, 2023 20:46

Apply suggestions from code review

2497ca2

Co-authored-by: shiyu1994 <[email protected]>

Apply suggestions from code review

b111ab6

Merge branch 'master' into remove-cuda-v1

a6695ca

Merge branch 'remove-cuda-v1' of github.com:microsoft/LightGBM into r…

9ab7634

…emove-cuda-v1

completely remove cuda_exp, update docs

ac7ab77

jameslamb requested a review from shiyu1994 January 30, 2023 03:08

Merge branch 'master' into remove-cuda-v1

0dc8b34

shiyu1994 approved these changes Feb 1, 2023

View reviewed changes

shiyu1994 merged commit 4f47547 into master Feb 1, 2023

shiyu1994 deleted the remove-cuda-v1 branch February 1, 2023 03:27

jameslamb mentioned this pull request Feb 12, 2023

[R-package] Accept factor labels and use their levels #5341

Merged

jameslamb removed the awaiting review label Feb 14, 2023

This was referenced Feb 14, 2023

[ci] fix flaky dask tests on GPU #5713

Closed

Minimal Varianсe Sampling (MVS) booster #5091

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] consolidate CUDA versions #5677

[CUDA] consolidate CUDA versions #5677

jameslamb commented Jan 17, 2023 •

edited

Loading

ChipKerchner commented Jan 18, 2023 •

edited

Loading

jameslamb commented Jan 18, 2023

ChipKerchner commented Jan 18, 2023

jameslamb commented Jan 18, 2023

jameslamb commented Jan 27, 2023

shiyu1994 left a comment

jameslamb commented Jan 30, 2023

shiyu1994 commented Jan 30, 2023

jameslamb commented Jan 30, 2023

shiyu1994 commented Jan 31, 2023

jameslamb commented Jan 31, 2023

shiyu1994 left a comment

jameslamb commented Feb 1, 2023

shiyu1994 commented Feb 1, 2023

github-actions bot commented Aug 16, 2023

[CUDA] consolidate CUDA versions #5677

[CUDA] consolidate CUDA versions #5677

Conversation

jameslamb commented Jan 17, 2023 • edited Loading

Changes

History

Motivation for this change

Notes for Reviewers

ChipKerchner commented Jan 18, 2023 • edited Loading

jameslamb commented Jan 18, 2023

ChipKerchner commented Jan 18, 2023

jameslamb commented Jan 18, 2023

jameslamb commented Jan 27, 2023

shiyu1994 left a comment

Choose a reason for hiding this comment

jameslamb commented Jan 30, 2023

shiyu1994 commented Jan 30, 2023

jameslamb commented Jan 30, 2023

shiyu1994 commented Jan 31, 2023

jameslamb commented Jan 31, 2023

shiyu1994 left a comment

Choose a reason for hiding this comment

jameslamb commented Feb 1, 2023

shiyu1994 commented Feb 1, 2023

github-actions bot commented Aug 16, 2023

jameslamb commented Jan 17, 2023 •

edited

Loading

ChipKerchner commented Jan 18, 2023 •

edited

Loading