-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) #6226
Conversation
num_threads | ||
) | ||
return(invisible(NULL)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are currently just setters and getters for LGBM_MAX_NUM_THREADS
threads.
- should they be named
getMaxLGBMthreads()
/setMaxLGBMthreads()
or something else with "max" in the name?- I personally like
getLGBMthreads()
/setLGBMthreads()
for consistency withdata.table::{get/set}DTthreads()
, but could be convinced
- I personally like
- Should
getLGBMthreads()
even be exported in the R package's public API?- I found it useful for testing and thought users might as well, but it'd be easier to add it later than to have to remove it later
- and
{lightgbm}
could use it in its own tests with:::
- if we do keep it in the public interface... should
getLGBMthreads()
actually be a getter forLGBM_MAX_NUM_THREADS
? Or should it return an answer to the question "how many threads will e.g.lgb.train()
use if I don't pass any thread-control parameters throughparams
"?
@@ -9,6 +9,7 @@ S3method(print,lgb.Booster) | |||
S3method(set_field,lgb.Dataset) | |||
S3method(slice,lgb.Dataset) | |||
S3method(summary,lgb.Booster) | |||
export(getLGBMthreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose not to add similar functions to the Python interface, since:
- it's the R package that needs this more urgently
- taking back parts of the public API for the Python package is MUCH harder than for the R package, as the Python package is used so much more widely
- this PR is already bigger than I'm comfortable with
@@ -1346,6 +1354,8 @@ lgb.save <- function(booster, filename, num_iteration = NULL) { | |||
#' @examples | |||
#' \donttest{ | |||
#' library(lightgbm) | |||
#' \dontshow{setLGBMthreads(2L)} | |||
#' \dontshow{data.table::setDTthreads(1L)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://cran.r-project.org/doc/manuals/R-exts.html
...
\dontshow{}
for extra commands for testing that should not be shown to users, but will be run byexample()
These \dontshow{}
blocks hide this code from users, but ensure it runs when CRAN checks the package.
Thanks to @jangorecki for the suggestion (Rdatatable/data.table#5658 (comment)).
This is ready for review. Tagging in some others who might be interested and have opinions about it: @david-cortes @mayer79 @trivialfis @simonpcouch @AlbertoEAF |
// this can only be changed by LGBM_SetMaxThreads() | ||
LIGHTGBM_EXTERN_C int LGBM_MAX_NUM_THREADS; | ||
|
||
// this is modified by OMP_SET_NUM_THREADS(), for example | ||
// by passing num_thread through params | ||
LIGHTGBM_EXTERN_C int LGBM_DEFAULT_NUM_THREADS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes these process-global variables, and therefore not thread safe.
Like @david-cortes alluded to in the description of #6152.
For example, if you created 2 Booster
objects in different threads which had different values of num_threads
in Config
, one's OMP_SET_NUM_THREADS()
call could affect code in the other.
I think that's an acceptable risk for now, in exchange for the other benefits of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can try thread_local
, but not hurry in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I did see that we have that preprocessor macro set up
LightGBM/include/LightGBM/utils/log.h
Lines 27 to 31 in e797985
#if defined(_MSC_VER) | |
#define THREAD_LOCAL __declspec(thread) | |
#else | |
#define THREAD_LOCAL thread_local | |
#endif |
but didn't test it out. Let's save it for a follow-up PR... I think it'd be ok to release this PR's changes without making this configuration thread-safe.
// this can only be changed by LGBM_SetMaxThreads() | ||
LIGHTGBM_EXTERN_C int LGBM_MAX_NUM_THREADS; | ||
|
||
// this is modified by OMP_SET_NUM_THREADS(), for example | ||
// by passing num_thread through params | ||
LIGHTGBM_EXTERN_C int LGBM_DEFAULT_NUM_THREADS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can try thread_local
, but not hurry in this PR.
LGTM, thanks so much. I would not wait too long with resubmission, to not be under pressure if something fails on the first hand. |
^ I agree with this. If there are not any other comments in the next 8 hours or so, I'd like to merge this and try to release a (NOTE: I won't cut a full LightGBM v4.2.0 release, just one to CRAN. I think we should continue with the normal process of completing all the steps at #6191 for the rest of that release. I think it's fine for those to be slightly different given the time pressure from CRAN) |
Given the approvals and no other blocking comments, I'm going to merge this as-is. My plan is as follows:
I'll post updates on #6191. Thanks so much to everyone involved for the reviews and other contributions to getting this working! |
Overview
This PR proposes a fix to the following problems, described in the links above:
> 2
threads in tests and examples (leading to CRAN rejecting the package)omp_set_num_threads()
Related Discussions
contributes to 4 issues, closes 2, replaces 1 PR (click me)
parallel
region #6135How I tested this
Ran all of the following on a
c5a.4xlarge
AWS EC2 instance (16 vCPUs, 32GiB RAM), using Ubuntu 22.04.How I set that up (click me)
Shelled in and ran the following.
To be sure I wasn't cheating, confirmed all OpenMP environment variables were unset.
Then, built the R package from this branch.
how I did that (click me)
First approach: dataset construction
Created a test R script which times construction of a
Dataset
from a numeric R matrix of shape[10_000, 10_000]
.Ran that script with environment variable
OMP_NUM_THREADS=16
. On this branch, I saw what I'd expect if multithreading is working correctly:num_threads = 1
passed to LightGBM have a{CPU}/{elapsed} <= 1
num_threads = 2
passed to LightGBM have a{CPU}/{elapsed} <= 2
On
master
, the value ofnum_threads
passed to LightGBM barely affected how much parallelism was used... even fornum_threads = 1
, I observed{CPU}/{elapsed} > 10
.details (click me)
Created this R script:
Installed the R package
Ran the script like this:
Ratio of
{CPU}/{elapsed}
on this branch:Ratio of
{CPU}/{elapsed}
with latestmaster
(f5b6bd6), I got the following:Second approach:
R CMD check
Ran
R CMD check
as follows on the built package.Rscript -e "remove.packages('lightgbm')" OMP_NUM_THREADS=16 \ _R_CHECK_EXAMPLE_TIMING_THRESHOLD_=0 \ _R_CHECK_EXAMPLE_TIMING_CPU_TO_ELAPSED_THRESHOLD_=2.0 \ R --vanilla CMD check \ --no-codoc \ --no-manual \ --no-tests \ --no-vignettes \ --run-dontrun \ --run-donttest \ --timings \ ./lightgbm_4.1.0.99.tar.gz
On this branch, no examples show
{CPU}/{elapsed} >= 2.0
.timings (click me)
On
master
, several examples show{CPU}/{elapsed} >= 2.0
, and those have ratios> 10.0
.timings (click me)
How this improves multithreading control
problem 1:
OMP_NUM_THREADS()
uses unconstrainedomp_get_num_threads()
threadsdetails (click me)
This block is problematic:
LightGBM/include/LightGBM/utils/openmp_wrapper.h
Lines 22 to 24 in f5b6bd6
With environment variable
OMP_NUM_THREADS=16
set, I think that#pragma omp parallel
creates a team of 16 threads, then runsomp_get_num_threads()
on the master thread, then presumably releases those 16 threads.For the small data sizes used in tests and examples, I think that unnecessary parallelized work happening on each call of
OMP_NUM_THREADS()
is enough to lead to ratios of{CPU}/{elapsed}
.This PR fixes that by replacing it with
#pragma omp single
, which changes this from "run on the master thread" to "run on any single thread in the current team" (docs link).problem 2: some LightGBM operations don't have any thread control, others automatically reset LightGBM to "use
omp_get_num_threads()
threads"details (click me)
For example,
GBDT::LoadModelFromString()
, which creates aBooster
from a text representation (e.g. as is read in from a model file), parallelized some operations over trees:https://github.com/microsoft/LightGBM/blob/f5b6bd60d9d752c8e5a75b11ab771d0422214bb4/src/boosting/gbdt_model_text.cpp#L555-LL556
But:
nthreads
or similar thread-control argumentsOMP_SET_NUM_THREADS()
prior to calling thatFor example, when loading a Booster from a
pickle
file:LightGBM/python-package/lightgbm/basic.py
Lines 3460 to 3469 in f5b6bd6
This PR "solves" that by providing a new mechanism in the C API and public API of the R package to set a process-wide maximum number of threads that LightGBM will use. That's inspired by
data.table::setDTthreads()
(see, for example, Rdatatable/data.table#5658 (comment)).It then proposes calling
lightgbm::setLGBMthreads(2)
in all R-package examples, vignettes, and tests. That should be sufficient to meet CRAN's requirements, while still allowing users of the package to get more parallelism by default.problem 3: some
{lightgbm}
operations use{data.table}
, but don't constrain how much multithreading it usesdetails (click me)
This is described in detail in Rdatatable/data.table#5658.
I fixed this by running
data.table::setDTthreads(1)
in all examples, vignettes, and tests.Notes for Reviewers
I'm sorry this is so large, but unfortunately it was done under considerable duress... CRAN have given us until December 12 to upload a new release (#6221).
I left comments below to call out the main points that I think might be controversial.
References
I consulted all of the following while working through this.
if
andnum_threads()
clauses are evaluatedif()
is false, only 1 thread is used