-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset construction uses all threads on the machine #5124
Comments
Thanks for using LightGBM! We need some more information from you before we can help.
|
I think this issue and #4598 have a same root cause. |
Investigating #4598, I found substantial evidence that passing I really think we need a reproducible example to be able to investigate this report further. Otherwise, solving this conclusively will require significant research and guessing to try to figure out what combination of parameters, LightGBM version, and Python code reproduces this behavior. |
#4598 seems to investigate whether or not parallelism is enabled. The intended claim of this issue is that during some stages of the dataset construction ALL threads on the machine are used, ignoring the actual num_threads. The dataset doesn't matter much, it's the behavior of parallelism. At best, I can provide you with a screenshot of htop during the dataset construction. |
I believe this is fixed in newer versions of LightGBM. Specifically, I think that #6226 fixed this. I got a Built LightGBM like this: git clone --recursive https://github.com/microsoft/LightGBM.git
sh build-python.sh bdist_wheel install Created a fairly expensive Dataset construction task:
cat << EOF > make-data.py
import numpy as np
X = np.random.random(size=(1_000_000, 100))
y = np.random.random(size=(X.shape[0],))
np.save("X.npy", X)
np.save("y.npy", y)
EOF
python ./make-data.py cat << EOF > check-multithreading.py
import lightgbm as lgb
import numpy as np
import time
import os
import sys
X = np.load("X.npy")
y = np.load("y.npy")
ds = lgb.Dataset(
X,
y,
params={
"verbose": -1,
"min_data_in_bin": 1,
"max_bin": 10000
}
)
tic = time.time()
ds.construct()
toc = time.time()
num_threads = os.environ.get("OMP_NUM_THREADS", None)
print(f"threads: {num_threads} | execution time (s): {round(toc - tic, 3)}")
EOF Tested with OMP_NUM_THREADS=1 \
python ./check-multithreading.py
# threads: 1 | execution time (s): 22.849 ... and OMP_NUM_THREADS=4 \
python ./check-multithreading.py
# threads: 4 | execution time (s): 6.156 .. and with unset OMP_NUM_THREADS
python ./check-multithreading.py
# threads: None | execution time (s): 2.396 For completeness, I repeated this same exercise but with with environment variable |
Description
Passing nthreads to lightgbm.Dataset constructor (via the params parameter) doesn't seem to be taken into account. construct seems to use all cores on the machine in some phases. I would expect construct to be bound by the maximum number of threads specified.
Reproducible example
Loading large dataset via a hand-crafted Sequence object.
Environment info
LightGBM version or commit hash: 3.2.1
The text was updated successfully, but these errors were encountered: