[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

jameslamb · 2021-06-14T22:21:25Z

Fixes #3881.

Resolves this comment from #4351 (comment).

however this is failing for test_classifier_pred_contrib[multiclass-classification-scipy_csr_matrix]

Changes in this PR

changes lgb.dask.DaskLGBMClassifier.predict() to return ~~a Dask Bag~~ a list of Dask Arrays when input data is composed of sparse matrices and pred_contrib=True
changes data creation function in the unit tests for lightgbm.dask so that it now correctly sets local data to a scipy CSR matrix when asked to do so (previously, it would return a dense numpy array)

How this improves `lightgbm.dask`

makes lgb.dask.DaskLGBMClassifier.predict()'s behavior more consistent with the equivalent behavior from lgb.sklearn.LGBMClassifier
gives [dask] determine output shape of array in predict (fixes #4285) #4351 a path forward, to remove a critical incompatibility (lighgbm.dask not working with dask>2021.4.0)
- which is important because it allows removal of the pin on dask versions in LightGBM's tests (WIP: [ci] remove pin on dask and distributed in CI (fixes #4285) #4307, [ci] Dask tests on Linux failing #4285)
fixes a related bug in the unit tests for lightgbm.dask, where the local array data was always an np.ndarray even in cases where it should have been a scipy CSR matrix

Background

#3000 added support for getting feature contributions (using .predict(pred_contrib=True)) for input data stored in sparse matrices. The design decision was made in that PR that for one specific case (multiclass classificaiont + sparse input data + pred_contrib=True), LightGBM should return a list of sparse matrices (one per class) instead of a single concatenated-together new matrix. Maybe this is because of #3000 (comment) not sure.

I think concatenating sparse matrices together into one from multiple is a bit more expensive than for dense data - eg you have to re-calculate the indptr

Notes for Reviewers

When given a Dask Array, lightgbm.dask._predict() can use Dask's built-in .map_blocks() to say "call predict(**kwargs) on each chunk of this array, then return me a new Dask Array of all those results". However, operations where you want to call a function that does not return an array on each chunk of a Dask Array and then combine those results are not well supported. See the discussion in dask/dask#7589 for an explanation of why this is a very difficult problem.

The best I could come up with was this approach, which produces a Dask Bag where, if you call .compute(), you'll get a result that is identical to the equivalent non-Dask code with lgb.sklearn.LGBMClassifier.predict().

I am ok with this change going into the 3.3.0 release (#4310) even though it is a breaking change, since other things are reliant on it, since it is such a narrow part of the API (only applies if using Dask and using sparse Dask Arrays and using multiclass classification and looking for feature contributions), and since it makes the Dask interface more consistent with the scikit-learn interface. The Dask community has, generally speaking, been comfortable with breaking changes if they're made in pursuit of better consistency with scikit-learn / numpy/ scikit-learn. But will defer to your opinion on that @StrikerRUS .

jameslamb · 2021-06-14T22:24:00Z

@jmoralez whenever you have time, could you please take a look at this and let me know what you think?

Here's some simpler code you could use to test that might be a little easier to experiment with interactively than the unit tests.

import dask.array as da
import numpy as np
import lightgbm as lgb
from dask import delayed
from dask.distributed import Client, LocalCluster
from lightgbm.dask import DaskLGBMClassifier
from lightgbm.sklearn import LGBMClassifier
from scipy.sparse import csc_matrix, csr_matrix
from sklearn.datasets import make_blobs

n_workers = 3
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers)

print(f"View the dashboard: {cluster.dashboard_link}")

chunk_size = 50
X, y = make_blobs(n_samples=100, centers=3, random_state=42)
rnd = np.random.RandomState(42)
dX = da.from_array(X, chunks=(chunk_size, X.shape[1])).map_blocks(csc_matrix)
dy = da.from_array(y, chunks=chunk_size)

dask_clf = DaskLGBMClassifier(n_estimators=5, num_leaves=2, tree_learner="data")
dask_clf.fit(dX, dy)

preds = dask_clf.predict(dX, pred_contrib=True)
preds_computed = preds.compute()

jmoralez · 2021-06-15T02:04:01Z

Nice idea using bags! Although I'm not 100% convinced on the type of preds (dask.bag.core.Item) and I think maybe building the bag from delayeds instead of sequence might be a better idea, since I think we're nesting delayeds here (that .compute() in the function looks suspicious).

I believe we can maybe get preds to be List[dask.array.Array] where each array has the same chunks as dX and such that client.compute(preds) returns List[Future] where each future's result type is scipy.sparse.csr_matrix.

I think it's really up to deciding which consistency to have, with the result type of predict or with the result type of preds.compute()

jameslamb · 2021-06-15T02:26:20Z

I believe we can maybe get preds to be List[dask.array.Array] where each array has the same chunks as dX

hmmm, I couldn't think of a way to do this that didn't involve having num_classes_ duplicate calls of predict(). Since each call of predict() on one chunk of dX returns a list with num_classes_ elements.

If you have an idea about the chain of calls that could produce such an output (even if it's just pseudocode), that would help me.

I'll experiment a bit more.

jmoralez · 2021-06-15T02:58:48Z

I was experimenting with this in the if statement for the pred_contrib, definitely not pretty (I imported dask, dask.array and dask.bag without using compat) and client.compute(preds) returns futures of type COO instead of CSR, but maybe it can help you.

delayed_chunks = data.to_delayed()
bag = db.from_delayed(delayed_chunks[:, 0])

predict_function = partial(
 _predict_part,
 model=model,
 raw_score=False,
 pred_proba=False,
 pred_leaf=False,
 pred_contrib=True,
 **kwargs
)

@dask.delayed
def extract(l, i):
    return l[i]

preds = bag.map_partitions(predict_function)
chunks = data.chunks[0]
out = [[] for _ in range(model.n_classes_)]
for j, partition in enumerate(preds.to_delayed()):
    for i in range(model.n_classes_):
        part = da.from_delayed(
            extract(partition, i),
            shape=(chunks[j], model.n_classes_),
            meta=data._meta
        )
        out[i].append(part)
for i in range(model.n_classes_):
     out[i] = da.concatenate(out[i])
return out

jameslamb · 2021-06-15T03:27:16Z

oooo nice, thank you! That's very helpful. I'm going to keep experimenting.

jameslamb · 2021-06-15T03:27:43Z

I'll move this back to a draft for now

jameslamb · 2021-07-04T20:22:10Z

All CI jobs except one passed. QEMU_multiarch bdist failed with this error:

E ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10497&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9

Looks like that is coming from

LightGBM/tests/python_package_test/test_dask.py

Line 406 in 26cc160

assert len(np.unique(preds_with_contrib[:, num_features]) == 1)

I think only that QEMU job was failing because different versions of numpy will treat that case slightly differently. Anyway, it's a mistake in the test (a misplaced parenthesis).

I fixed that issue in 95ae45f.

jameslamb · 2021-07-04T22:20:48Z

ah! This test just failed again!

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10497&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9

It seems the problem is just about about the use of len() on a numpy array, but how np.unique() works in the newest version of numpy.

I'm able to reproduce the problem on my Mac by upgrading to the newest numpy (v1.2.1) and scipy (1.7.0)

pip install --upgrade numpy scipy
cd python-package
python setup.py install
cd ../tests/python_package_test
# manually comment out "if not platform.startswith('linux')"
pytest test_dask.py::test_classifier_pred_contrib

I'll investigate this later tonight.

jameslamb · 2021-07-05T04:39:19Z

@StrikerRUS I was able to reproduce this tonight! I believe there was a breaking change in numpy 1.21.0 (released June 22, 2021), which causes the error I saw in #4378 (comment).

I've opened numpy/numpy#19405 to report it.

That issue is specific to using np.unique() on a np.matrix object. I learned that calling .todense() on a sparse matrix returns a np.matrix, not a np.array.

We didn't hit this issue before in LightGBM's tests because of the uses of np.array() to wrap results of .todense() (removed in #4378 (comment)).

Given that, I think we should just restore those uses of np.array(). They don't introduce that much overhead and they allow these tests to not be blocked by this possible regression in numpy.

I restored them in 831927b.

rkern · 2021-07-05T04:55:10Z

As mentioned on the numpy issue, this is not so much of a regression. np.unique() never worked correctly on np.matrix objects. The fact that it raises an exception in one version and doesn't in an older version is not a regression (arguably, an enhancement, though both are basically accidents of never thinking about np.matrix in the implementation). Coercing to true ndarrays was always the right idea.

rkern · 2021-07-05T04:56:11Z

Consider using np.asarray(), which will not copy the underlying memory.

jameslamb · 2021-07-05T04:59:04Z

Coercing to true ndarrays was always the right idea.
Consider using np.asarray(), which will not copy the underlying memory.

Thanks very much for taking the time to come visit this PR and share some advice with us, @rkern! We'll definitely do that here.

rkern · 2021-07-05T05:14:34Z

There's also toarray() on the scipy.sparse matrix objects, which bypasses np.matrix entirely. That would be the most efficient and idiomatic way to write that.

jameslamb · 2021-07-05T16:44:14Z

There's also toarray() on the scipy.sparse matrix objects, which bypasses np.matrix entirely. That would be the most efficient and idiomatic way to write that.

ah, even better! I'll make that change here. Thanks again for your help.

jameslamb · 2021-07-06T01:42:15Z

ok switching from .todense() to .toarray() worked and all checks are passing 🙌

python-package/lightgbm/compat.py

StrikerRUS

LGTM!

github-actions · 2023-08-23T19:16:44Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 8 commits June 11, 2021 13:26

test_classifier working

a397d23

adding tests

b90040f

Merge branch 'master' into fix/dask-multiclass-sparse-predict

c9032a0

docs

2a09151

tests

7437274

Merge branch 'master' into fix/dask-multiclass-sparse-predict

4e40e97

revert unnecessary changes in tests

058188b

test output type

8e7df9a

jameslamb added the breaking label Jun 14, 2021

jameslamb requested a review from StrikerRUS June 14, 2021 22:21

jameslamb requested review from chivee, henry0312, shiyu1994 and wxchan as code owners June 14, 2021 22:21

jameslamb changed the title ~~[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators~~ [dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) Jun 14, 2021

linting

13b9c3b

linting

f46da71

jameslamb added the in progress label Jun 15, 2021

jameslamb marked this pull request as draft June 15, 2021 03:27

jameslamb removed request for henry0312, chivee, wxchan, shiyu1994 and StrikerRUS June 15, 2021 03:27

jameslamb added 2 commits July 4, 2021 17:01

Merge branch 'master' into fix/dask-multiclass-sparse-predict

43c3a9c

fix test use of len()

a7d6d37

jameslamb added 2 commits July 4, 2021 23:34

Merge branch 'master' into fix/dask-multiclass-sparse-predict

ff49a58

restore np.array() in tests

831927b

use np.asarray() instead

61f31dd

StrikerRUS self-requested a review July 5, 2021 11:12

jameslamb added 2 commits July 5, 2021 18:49

Merge branch 'master' into fix/dask-multiclass-sparse-predict

466917e

use toarray()

57edeaa

jameslamb force-pushed the fix/dask-multiclass-sparse-predict branch from 9484ba7 to 57edeaa Compare July 6, 2021 00:24

jameslamb mentioned this pull request Jul 6, 2021

[python-package] use toarray() instead of todense() in tests and examples #4446

Merged

jameslamb added the awaiting review label Jul 6, 2021

StrikerRUS reviewed Jul 6, 2021

View reviewed changes

python-package/lightgbm/compat.py Outdated Show resolved Hide resolved

StrikerRUS removed the awaiting review label Jul 6, 2021

jameslamb added 2 commits July 6, 2021 17:30

Merge branch 'master' into fix/dask-multiclass-sparse-predict

c4c8da8

remove empty functions in compat

e4fed6f

StrikerRUS approved these changes Jul 7, 2021

View reviewed changes

StrikerRUS merged commit b09da43 into master Jul 7, 2021

StrikerRUS deleted the fix/dask-multiclass-sparse-predict branch July 7, 2021 11:27

jameslamb mentioned this pull request Jul 7, 2021

[dask] determine output shape of array in predict (fixes #4285) #4351

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

jameslamb commented Jun 14, 2021 •

edited

Loading

jameslamb commented Jun 14, 2021

jmoralez commented Jun 15, 2021

jameslamb commented Jun 15, 2021

jmoralez commented Jun 15, 2021

jameslamb commented Jun 15, 2021

jameslamb commented Jun 15, 2021

jameslamb commented Jul 4, 2021

jameslamb commented Jul 4, 2021

jameslamb commented Jul 5, 2021

rkern commented Jul 5, 2021

rkern commented Jul 5, 2021

jameslamb commented Jul 5, 2021

rkern commented Jul 5, 2021

jameslamb commented Jul 5, 2021

jameslamb commented Jul 6, 2021

StrikerRUS left a comment

github-actions bot commented Aug 23, 2023

[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

Conversation

jameslamb commented Jun 14, 2021 • edited Loading

Changes in this PR

How this improves lightgbm.dask

Background

Notes for Reviewers

jameslamb commented Jun 14, 2021

jmoralez commented Jun 15, 2021

jameslamb commented Jun 15, 2021

jmoralez commented Jun 15, 2021

jameslamb commented Jun 15, 2021

jameslamb commented Jun 15, 2021

jameslamb commented Jul 4, 2021

jameslamb commented Jul 4, 2021

jameslamb commented Jul 5, 2021

rkern commented Jul 5, 2021

rkern commented Jul 5, 2021

jameslamb commented Jul 5, 2021

rkern commented Jul 5, 2021

jameslamb commented Jul 5, 2021

jameslamb commented Jul 6, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 23, 2023

jameslamb commented Jun 14, 2021 •

edited

Loading

How this improves `lightgbm.dask`