Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) #6226

Merged
merged 12 commits into from
Dec 7, 2023

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Dec 5, 2023

Overview

This PR proposes a fix to the following problems, described in the links above:

  • the R package using > 2 threads in tests and examples (leading to CRAN rejecting the package)
  • LightGBM having global side-effects on other OpenMP-using routines in the same process by calling omp_set_num_threads()

Related Discussions

contributes to 4 issues, closes 2, replaces 1 PR (click me)

How I tested this

Ran all of the following on a c5a.4xlarge AWS EC2 instance (16 vCPUs, 32GiB RAM), using Ubuntu 22.04.

How I set that up (click me)

Shelled in and ran the following.

sudo apt-get update
sudo apt-get install --no-install-recommends -y \
    software-properties-common

sudo apt-get install --no-install-recommends -y \
    apt-utils \
    build-essential \
    ca-certificates \
    clang \
    cmake \
    curl \
    git \
    iputils-ping \
    jq \
    libcurl4 \
    libicu-dev \
    libomp-dev \
    libssl-dev \
    libunwind8 \
    lldb \
    locales \
    locales-all \
    netcat \
    unzip \
    zip

# use UTF-8 locale
export LANG="en_US.UTF-8"
sudo update-locale LANG=${LANG}
export LC_ALL="${LANG}"

# set up R environment
export CRAN_MIRROR="https://cran.rstudio.com"
export MAKEFLAGS=-j8
export R_LIB_PATH=~/Rlib
export R_LIBS=$R_LIB_PATH
export PATH="$R_LIB_PATH/R/bin:$PATH"
export R_APT_REPO="jammy-cran40/"
export R_LINUX_VERSION="4.3.1-1.2204.0"

mkdir -p $R_LIB_PATH

mkdir -p ~/.gnupg
echo "disable-ipv6" >> ~/.gnupg/dirmngr.conf
sudo apt-key adv \
    --homedir ~/.gnupg \
    --keyserver keyserver.ubuntu.com \
    --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9

sudo add-apt-repository \
    "deb ${CRAN_MIRROR}/bin/linux/ubuntu ${R_APT_REPO}"

sudo apt-get update
sudo apt-get install \
    --no-install-recommends \
    -y \
        autoconf \
        automake \
        devscripts \
        r-base-core=${R_LINUX_VERSION} \
        r-base-dev=${R_LINUX_VERSION} \
        texinfo \
        texlive-latex-extra \
        texlive-latex-recommended \
        texlive-fonts-recommended \
        texlive-fonts-extra \
        tidy \
        qpdf

# install dependencies
Rscript \
    --vanilla \
    -e "install.packages(c('data.table', 'jsonlite', 'knitr', 'Matrix', 'R6', 'RhpcBLASctl', 'rmarkdown', 'testthat'), repos = '${CRAN_MIRROR}', lib = '${R_LIB_PATH}', dependencies = c('Depends', 'Imports', 'LinkingTo'), Ncpus = parallel::detectCores())"

# use clang to compile packages
mkdir -p ${HOME}/.R
cat << EOF > ${HOME}/.R/Makevars
CC=clang
CXX=clang++
CXX17=clang++
EOF

To be sure I wasn't cheating, confirmed all OpenMP environment variables were unset.

env | grep OMP
# (no results)

Then, built the R package from this branch.

how I did that (click me)
cd ${HOME}/repos/LightGBM
git checkout master
git branch -D r/tighter-thread-control || true
git fetch origin r/tighter-thread-control
git checkout r/tighter-thread-control

sh build-cran-package.sh --no-build-vignettes

First approach: dataset construction

Created a test R script which times construction of a Dataset from a numeric R matrix of shape [10_000, 10_000].

Ran that script with environment variable OMP_NUM_THREADS=16. On this branch, I saw what I'd expect if multithreading is working correctly:

  • more threads results in a higher ratio of CPU time to elapsed time
  • runs with num_threads = 1 passed to LightGBM have a {CPU}/{elapsed} <= 1
  • runs with num_threads = 2 passed to LightGBM have a {CPU}/{elapsed} <= 2

On master, the value of num_threads passed to LightGBM barely affected how much parallelism was used... even for num_threads = 1, I observed {CPU}/{elapsed} > 10.

details (click me)

Created this R script:

cat << EOF > check-multithreading.R
library(data.table)
library(lightgbm)

LGBM_NUM_THREADS <- as.integer(
    commandArgs(trailingOnly = TRUE)
)
if (is.na(LGBM_NUM_THREADS)){
    stop("invoke this script with an integer, like 'Rscript check-multithreading.R 6'")
}

# ensure data.table multithreading isn't used
data.table::setDTthreads(1L)

X <- matrix(rnorm(1e5), ncol=1e5)
y <- rnorm(nrow(X))

tic <- proc.time()
print(tic)
dtrain <- lightgbm::lgb.Dataset(
    data = X
    , label = y
    , params = list(
        max_bins = 128L
        , min_data_in_bin = 5L
        , num_threads = LGBM_NUM_THREADS
        , verbosity = -1L
    )
)
dtrain\$construct()
toc <- proc.time() - tic
print(toc)

ratio <- toc[[1]] / toc[[3]]
print(sprintf("ratio: %f", ratio))

# append to file of traces
cat(
    paste0("  ", LGBM_NUM_THREADS, "  -  ", round(ratio, 4))
    , file = "traces.out"
    , append = TRUE
    , sep = "\n"
)
EOF

Installed the R package

R CMD INSTALL \
  --with-keep.source \
  lightgbm_4.1.0.99.tar.gz

Ran the script like this:

rm -f ./traces.out
for i in 1 1 1 1 1 2 2 2 2 2 6 8 16; do
    OMP_NUM_THREADS=16 \
        Rscript --vanilla ./check-multithreading.R ${i}
done
cat ./traces.out

Ratio of {CPU}/{elapsed} on this branch:

  1  -  0.7534
  1  -  0.7755
  1  -  0.863
  1  -  0.7847
  1  -  0.9388
  2  -  1.349
  2  -  1.4595
  2  -  1.3784
  2  -  1.2886
  2  -  1.2867
  6  -  2.7029
  8  -  3.2346
  16  -  6.9559

Ratio of {CPU}/{elapsed} with latest master (f5b6bd6), I got the following:

  1  -  11.7955
  1  -  11.9402
  1  -  11.4011
  1  -  12.4866
  1  -  13.1563
  2  -  11.3018
  2  -  11.1033
  2  -  13
  2  -  9.7208
  2  -  11.2056
  6  -  10.3423
  8  -  10.3786
  16  -  9.7287

Second approach: R CMD check

Ran R CMD check as follows on the built package.

Rscript -e "remove.packages('lightgbm')"

OMP_NUM_THREADS=16 \
_R_CHECK_EXAMPLE_TIMING_THRESHOLD_=0 \
_R_CHECK_EXAMPLE_TIMING_CPU_TO_ELAPSED_THRESHOLD_=2.0 \
R --vanilla CMD check \
    --no-codoc \
    --no-manual \
    --no-tests \
    --no-vignettes \
    --run-dontrun \
    --run-donttest \
    --timings \
    ./lightgbm_4.1.0.99.tar.gz

On this branch, no examples show {CPU}/{elapsed} >= 2.0.

timings (click me)
* checking examples ... OK
Examples with CPU (user + system) or elapsed time > 0s
                             user system elapsed
lgb.plot.interpretation     0.476  0.015   0.245
lgb.interprete              0.338  0.011   0.176
lgb.importance              0.237  0.012   0.125
lgb.model.dt.tree           0.215  0.004   0.110
lgb.cv                      0.201  0.008   0.104
saveRDS.lgb.Booster         0.171  0.008   0.090
lgb.plot.importance         0.166  0.000   0.083
lgb.Dataset.create.valid    0.118  0.036   0.076
lgb.load                    0.105  0.020   0.063
readRDS.lgb.Booster         0.119  0.004   0.061
lgb.restore_handle          0.114  0.005   0.062
predict.lgb.Booster         0.106  0.004   0.055
lgb.dump                    0.105  0.004   0.054
lgb.save                    0.105  0.003   0.054
lgb.train                   0.101  0.005   0.053
lgb.configure_fast_predict  0.096  0.008   0.053
lgb.get.eval.result         0.099  0.003   0.051
lgb.Dataset                 0.078  0.020   0.049
get_field                   0.097  0.000   0.049
set_field                   0.092  0.003   0.048
slice                       0.094  0.001   0.047
lgb.Dataset.set.categorical 0.073  0.005   0.039
lgb.Dataset.save            0.073  0.004   0.039
lgb.Dataset.construct       0.072  0.004   0.037
lgb.Dataset.set.reference   0.068  0.003   0.036
dimnames.lgb.Dataset        0.045  0.008   0.043
dim                         0.039  0.012   0.051
lgb.convert_with_rules      0.028  0.004   0.016

On master, several examples show {CPU}/{elapsed} >= 2.0, and those have ratios > 10.0.

timings (click me)
* checking examples ... OK
Examples with CPU (user + system) or elapsed time > 0s
                             user system elapsed
lgb.plot.interpretation     3.536  0.214   0.274
lgb.interprete              2.346  0.114   0.189
lgb.cv                      1.866  0.133   0.125
saveRDS.lgb.Booster         1.542  0.065   0.100
lgb.Dataset.create.valid    1.225  0.132   0.085
lgb.load                    1.043  0.078   0.070
readRDS.lgb.Booster         1.040  0.078   0.070
lgb.configure_fast_predict  0.914  0.065   0.061
lgb.model.dt.tree           0.886  0.065   0.109
lgb.dump                    0.835  0.038   0.058
lgb.Dataset                 0.745  0.058   0.050
slice                       0.745  0.029   0.048
get_field                   0.711  0.046   0.051
set_field                   0.644  0.050   0.050
lgb.Dataset.save            0.570  0.042   0.039
lgb.Dataset.set.reference   0.572  0.037   0.038
lgb.Dataset.set.categorical 0.578  0.022   0.038
lgb.Dataset.construct       0.553  0.038   0.037
lgb.importance              0.520  0.032   0.121
lgb.restore_handle          0.490  0.032   0.061
lgb.save                    0.401  0.024   0.053
lgb.train                   0.395  0.016   0.051
dimnames.lgb.Dataset        0.286  0.037   0.061
lgb.convert_with_rules      0.292  0.015   0.020
lgb.plot.importance         0.273  0.013   0.081
lgb.get.eval.result         0.218  0.036   0.051
predict.lgb.Booster         0.245  0.007   0.054
dim                         0.049  0.000   0.050
Examples with CPU time > 2 times elapsed time
                          user system elapsed  ratio
saveRDS.lgb.Booster      1.542  0.065   0.100 16.070
lgb.load                 1.043  0.078   0.070 16.014
lgb.cv                   1.866  0.133   0.125 15.992
readRDS.lgb.Booster      1.040  0.078   0.070 15.971
lgb.Dataset.create.valid 1.225  0.132   0.085 15.965
lgb.plot.interpretation  3.536  0.214   0.274 13.686
lgb.interprete           2.346  0.114   0.189 13.016

How this improves multithreading control

problem 1: OMP_NUM_THREADS() uses unconstrained omp_get_num_threads() threads

details (click me)

This block is problematic:

#pragma omp parallel
#pragma omp master
{ ret = omp_get_num_threads(); }

With environment variable OMP_NUM_THREADS=16 set, I think that #pragma omp parallel creates a team of 16 threads, then runs omp_get_num_threads() on the master thread, then presumably releases those 16 threads.

For the small data sizes used in tests and examples, I think that unnecessary parallelized work happening on each call of OMP_NUM_THREADS() is enough to lead to ratios of {CPU}/{elapsed}.

This PR fixes that by replacing it with #pragma omp single, which changes this from "run on the master thread" to "run on any single thread in the current team" (docs link).

problem 2: some LightGBM operations don't have any thread control, others automatically reset LightGBM to "use omp_get_num_threads() threads"

details (click me)

For example, GBDT::LoadModelFromString(), which creates a Booster from a text representation (e.g. as is read in from a model file), parallelized some operations over trees:

https://github.com/microsoft/LightGBM/blob/f5b6bd60d9d752c8e5a75b11ab771d0422214bb4/src/boosting/gbdt_model_text.cpp#L555-LL556

But:

  • it doesn't accept nthreads or similar thread-control arguments
  • code paths from wrappers like the R and Python packages aren't guaranteed to have hit OMP_SET_NUM_THREADS() prior to calling that

For example, when loading a Booster from a pickle file:

def __setstate__(self, state: Dict[str, Any]) -> None:
model_str = state.get('_handle', state.get('handle', None))
if model_str is not None:
handle = ctypes.c_void_p()
out_num_iterations = ctypes.c_int(0)
_safe_call(_LIB.LGBM_BoosterLoadModelFromString(
_c_str(model_str),
ctypes.byref(out_num_iterations),
ctypes.byref(handle)))
state['_handle'] = handle

This PR "solves" that by providing a new mechanism in the C API and public API of the R package to set a process-wide maximum number of threads that LightGBM will use. That's inspired by data.table::setDTthreads() (see, for example, Rdatatable/data.table#5658 (comment)).

It then proposes calling lightgbm::setLGBMthreads(2) in all R-package examples, vignettes, and tests. That should be sufficient to meet CRAN's requirements, while still allowing users of the package to get more parallelism by default.

problem 3: some {lightgbm} operations use {data.table}, but don't constrain how much multithreading it uses

details (click me)

This is described in detail in Rdatatable/data.table#5658.

I fixed this by running data.table::setDTthreads(1) in all examples, vignettes, and tests.

Notes for Reviewers

I'm sorry this is so large, but unfortunately it was done under considerable duress... CRAN have given us until December 12 to upload a new release (#6221).

I left comments below to call out the main points that I think might be controversial.

References

I consulted all of the following while working through this.

@jameslamb jameslamb added the fix label Dec 5, 2023
@jameslamb jameslamb changed the title WIP: [R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) WIP: [R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) Dec 5, 2023
num_threads
)
return(invisible(NULL))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are currently just setters and getters for LGBM_MAX_NUM_THREADS threads.

  1. should they be named getMaxLGBMthreads() / setMaxLGBMthreads() or something else with "max" in the name?
    • I personally like getLGBMthreads() / setLGBMthreads() for consistency with data.table::{get/set}DTthreads(), but could be convinced
  2. Should getLGBMthreads() even be exported in the R package's public API?
    • I found it useful for testing and thought users might as well, but it'd be easier to add it later than to have to remove it later
    • and {lightgbm} could use it in its own tests with :::
  3. if we do keep it in the public interface... should getLGBMthreads() actually be a getter for LGBM_MAX_NUM_THREADS? Or should it return an answer to the question "how many threads will e.g. lgb.train() use if I don't pass any thread-control parameters through params"?

@@ -9,6 +9,7 @@ S3method(print,lgb.Booster)
S3method(set_field,lgb.Dataset)
S3method(slice,lgb.Dataset)
S3method(summary,lgb.Booster)
export(getLGBMthreads)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose not to add similar functions to the Python interface, since:

  • it's the R package that needs this more urgently
  • taking back parts of the public API for the Python package is MUCH harder than for the R package, as the Python package is used so much more widely
  • this PR is already bigger than I'm comfortable with

@jameslamb jameslamb changed the title WIP: [R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) [R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) Dec 6, 2023
@@ -1346,6 +1354,8 @@ lgb.save <- function(booster, filename, num_iteration = NULL) {
#' @examples
#' \donttest{
#' library(lightgbm)
#' \dontshow{setLGBMthreads(2L)}
#' \dontshow{data.table::setDTthreads(1L)}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://cran.r-project.org/doc/manuals/R-exts.html

... \dontshow{} for extra commands for testing that should not be shown to users, but will be run by example()

These \dontshow{} blocks hide this code from users, but ensure it runs when CRAN checks the package.

Thanks to @jangorecki for the suggestion (Rdatatable/data.table#5658 (comment)).

@jameslamb jameslamb marked this pull request as ready for review December 6, 2023 04:44
@jameslamb
Copy link
Collaborator Author

This is ready for review.

Tagging in some others who might be interested and have opinions about it: @david-cortes @mayer79 @trivialfis @simonpcouch @AlbertoEAF

Comment on lines +10 to +15
// this can only be changed by LGBM_SetMaxThreads()
LIGHTGBM_EXTERN_C int LGBM_MAX_NUM_THREADS;

// this is modified by OMP_SET_NUM_THREADS(), for example
// by passing num_thread through params
LIGHTGBM_EXTERN_C int LGBM_DEFAULT_NUM_THREADS;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes these process-global variables, and therefore not thread safe.

Like @david-cortes alluded to in the description of #6152.

For example, if you created 2 Booster objects in different threads which had different values of num_threads in Config, one's OMP_SET_NUM_THREADS() call could affect code in the other.

I think that's an acceptable risk for now, in exchange for the other benefits of this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can try thread_local, but not hurry in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I did see that we have that preprocessor macro set up

#if defined(_MSC_VER)
#define THREAD_LOCAL __declspec(thread)
#else
#define THREAD_LOCAL thread_local
#endif

but didn't test it out. Let's save it for a follow-up PR... I think it'd be ok to release this PR's changes without making this configuration thread-safe.

Comment on lines +10 to +15
// this can only be changed by LGBM_SetMaxThreads()
LIGHTGBM_EXTERN_C int LGBM_MAX_NUM_THREADS;

// this is modified by OMP_SET_NUM_THREADS(), for example
// by passing num_thread through params
LIGHTGBM_EXTERN_C int LGBM_DEFAULT_NUM_THREADS;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can try thread_local, but not hurry in this PR.

@mayer79
Copy link
Contributor

mayer79 commented Dec 6, 2023

LGTM, thanks so much. I would not wait too long with resubmission, to not be under pressure if something fails on the first hand.

@jameslamb
Copy link
Collaborator Author

I would not wait too long with resubmission

^ I agree with this.

If there are not any other comments in the next 8 hours or so, I'd like to merge this and try to release a v4.2.0 to CRAN.

(NOTE: I won't cut a full LightGBM v4.2.0 release, just one to CRAN. I think we should continue with the normal process of completing all the steps at #6191 for the rest of that release. I think it's fine for those to be slightly different given the time pressure from CRAN)

@jameslamb
Copy link
Collaborator Author

Given the approvals and no other blocking comments, I'm going to merge this as-is. My plan is as follows:

  1. merge this right now
  2. update release v4.2.0 #6191 to include it and everything else on master
  3. re-trigger the valgrind checks (these take around 5 hours to complete)
  4. build a v4.2.0 release of the CRAN-style R package only from that branch, and submit it to CRAN

I'll post updates on #6191.

Thanks so much to everyone involved for the reviews and other contributions to getting this working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants