-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Dask estimators sometimes return an incomplete booster #3918
Comments
I ran a few more experiments tonight and found some interesting things. Basically, I don't think there is a bug and I think this is just a problem that can arise using very very small training data (relative to So I think this can be closed. For binary classification, increasing the dataset size from 100 to 1000 observations makes the problem go away. # ----- binary classification ----- #
client.restart()
X, y, w, dX, dy, dw = _create_data(n_samples=1000, objective="classification")
dX = dX.persist()
dy = dy.persist()
dw = dw.persist()
_ = wait([dX, dy, dw])
dask_summaries = {}
local_summaries = {}
for i in range(100):
if i % 5 == 0:
print(i)
dask_classifier.fit(dX, dy, sample_weight=dw)
num_trees = dask_classifier.booster_.num_trees()
dask_summaries[num_trees] = dask_summaries.get(num_trees, 0) + 1
local_classifier.fit(X, y, sample_weight=w)
num_trees = local_classifier.booster_.num_trees()
local_summaries[num_trees] = local_summaries.get(num_trees, 0) + 1
print(" dask: " + str(dask_summaries))
print("sklearn: " + str(local_summaries)) My working theory right now is that with very small data + mostly random features, and then splitting that data up into two pieces, it's easy to randomly get into a situation where it's not possible to boost for the desired number of rounds and find splits that satisfy the default conditions like I found that when I cut |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Short description of the problem
DaskLGBMRanker
andDaskLGBMRegressor
sometimes return a model with an incomplete booster. Training withnum_iterations = 50
and no early stopping,.fit()
from these classes sometimes returns a booster with less than 50 trees.Reproducible Example
The example code below uses the functions from
lightgbm
's tests to create datasets. I'll try to come back and simplify it further.Running that example, I got results like this:
So it seems like the problem is mostly specific to distributed training.
Although it's confusing to see 48 trees for
LGBMRegressor
.Environment Info
Operating System: Ubuntu 18.04
C++ compiler version:
gcc
9.3.0CMake version: 3.16.3
Python version: 3.8.5, see
conda info
beloowconda info output (click me)
LightGBM version or commit hash: https://github.com/microsoft/LightGBM/tree/ffebc43fea44ba95a0bc2b4366fe9b4ff8275c22 (latest
master
)Other Notes
I'm writing this up now that I've noticed it and have a reproducible example. I don't know yet if this is specific to LGBMRanker, or if the same problem affectsLGBMClassifier
andLGBMRegressor
LightGBM/python-package/lightgbm/dask.py
Lines 339 to 341 in ffebc43
Edits
EDIT 1: Added check that the problem doesn't exist with
lgb.sklearn.LGBMRanker
. It doesn't, so seems the problem is specific to distributed training.EDIT 2 Added tests for regressor and classifier. Binary and multi-class classification seem to also suffer from this problem.
The text was updated successfully, but these errors were encountered: