-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378
Conversation
@jmoralez whenever you have time, could you please take a look at this and let me know what you think? Here's some simpler code you could use to test that might be a little easier to experiment with interactively than the unit tests. import dask.array as da
import numpy as np
import lightgbm as lgb
from dask import delayed
from dask.distributed import Client, LocalCluster
from lightgbm.dask import DaskLGBMClassifier
from lightgbm.sklearn import LGBMClassifier
from scipy.sparse import csc_matrix, csr_matrix
from sklearn.datasets import make_blobs
n_workers = 3
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers)
print(f"View the dashboard: {cluster.dashboard_link}")
chunk_size = 50
X, y = make_blobs(n_samples=100, centers=3, random_state=42)
rnd = np.random.RandomState(42)
dX = da.from_array(X, chunks=(chunk_size, X.shape[1])).map_blocks(csc_matrix)
dy = da.from_array(y, chunks=chunk_size)
dask_clf = DaskLGBMClassifier(n_estimators=5, num_leaves=2, tree_learner="data")
dask_clf.fit(dX, dy)
preds = dask_clf.predict(dX, pred_contrib=True)
preds_computed = preds.compute() |
Nice idea using bags! Although I'm not 100% convinced on the type of I believe we can maybe get I think it's really up to deciding which consistency to have, with the result type of |
hmmm, I couldn't think of a way to do this that didn't involve having If you have an idea about the chain of calls that could produce such an output (even if it's just pseudocode), that would help me. I'll experiment a bit more. |
I was experimenting with this in the if statement for the pred_contrib, definitely not pretty (I imported dask, dask.array and dask.bag without using compat) and delayed_chunks = data.to_delayed()
bag = db.from_delayed(delayed_chunks[:, 0])
predict_function = partial(
_predict_part,
model=model,
raw_score=False,
pred_proba=False,
pred_leaf=False,
pred_contrib=True,
**kwargs
)
@dask.delayed
def extract(l, i):
return l[i]
preds = bag.map_partitions(predict_function)
chunks = data.chunks[0]
out = [[] for _ in range(model.n_classes_)]
for j, partition in enumerate(preds.to_delayed()):
for i in range(model.n_classes_):
part = da.from_delayed(
extract(partition, i),
shape=(chunks[j], model.n_classes_),
meta=data._meta
)
out[i].append(part)
for i in range(model.n_classes_):
out[i] = da.concatenate(out[i])
return out |
oooo nice, thank you! That's very helpful. I'm going to keep experimenting. |
I'll move this back to a draft for now |
All CI jobs except one passed.
Looks like that is coming from
I think only that QEMU job was failing because different versions of I fixed that issue in 95ae45f. |
ah! This test just failed again! It seems the problem is just about about the use of I'm able to reproduce the problem on my Mac by upgrading to the newest pip install --upgrade numpy scipy
cd python-package
python setup.py install
cd ../tests/python_package_test
# manually comment out "if not platform.startswith('linux')"
pytest test_dask.py::test_classifier_pred_contrib I'll investigate this later tonight. |
@StrikerRUS I was able to reproduce this tonight! I believe there was a breaking change in I've opened numpy/numpy#19405 to report it. That issue is specific to using We didn't hit this issue before in LightGBM's tests because of the uses of Given that, I think we should just restore those uses of I restored them in 831927b. |
As mentioned on the numpy issue, this is not so much of a regression. |
Consider using |
Thanks very much for taking the time to come visit this PR and share some advice with us, @rkern! We'll definitely do that here. |
There's also |
ah, even better! I'll make that change here. Thanks again for your help. |
9484ba7
to
57edeaa
Compare
ok switching from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Fixes #3881.
Resolves this comment from #4351 (comment).
Changes in this PR
lgb.dask.DaskLGBMClassifier.predict()
to returna Dask Baga list of Dask Arrays when input data is composed of sparse matrices andpred_contrib=True
lightgbm.dask
so that it now correctly sets local data to a scipy CSR matrix when asked to do so (previously, it would return a dense numpy array)How this improves
lightgbm.dask
lgb.dask.DaskLGBMClassifier.predict()
's behavior more consistent with the equivalent behavior fromlgb.sklearn.LGBMClassifier
lighgbm.dask
not working withdask>2021.4.0
)dask
versions in LightGBM's tests (WIP: [ci] remove pin on dask and distributed in CI (fixes #4285) #4307, [ci] Dask tests on Linux failing #4285)lightgbm.dask
, where the local array data was always annp.ndarray
even in cases where it should have been a scipy CSR matrixBackground
#3000 added support for getting feature contributions (using
.predict(pred_contrib=True)
) for input data stored in sparse matrices. The design decision was made in that PR that for one specific case (multiclass classificaiont + sparse input data +pred_contrib=True
), LightGBM should return a list of sparse matrices (one per class) instead of a single concatenated-together new matrix. Maybe this is because of #3000 (comment) not sure.Notes for Reviewers
When given a Dask Array,
lightgbm.dask._predict()
can use Dask's built-in.map_blocks()
to say "callpredict(**kwargs)
on each chunk of this array, then return me a new Dask Array of all those results". However, operations where you want to call a function that does not return an array on each chunk of a Dask Array and then combine those results are not well supported. See the discussion in dask/dask#7589 for an explanation of why this is a very difficult problem.The best I could come up with was this approach, which produces a Dask Bag where, if you call.compute()
, you'll get a result that is identical to the equivalent non-Dask code withlgb.sklearn.LGBMClassifier.predict()
.I am ok with this change going into the 3.3.0 release (#4310) even though it is a breaking change, since other things are reliant on it, since it is such a narrow part of the API (only applies if using Dask and using sparse Dask Arrays and using multiclass classification and looking for feature contributions), and since it makes the Dask interface more consistent with the scikit-learn interface. The Dask community has, generally speaking, been comfortable with breaking changes if they're made in pursuit of better consistency with scikit-learn / numpy/ scikit-learn. But will defer to your opinion on that @StrikerRUS .