Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ColumnTransformer #315

Merged
merged 19 commits into from
Sep 4, 2018
Merged

Conversation

TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Jul 25, 2018

This PR implements a daskified column transformer.

There are two issues preventing us from just using scikit-learn's

  1. dask DataFrame doesn't implement .shape ( Add (lazy) shape property to dataframe and series dask#3212)
  2. sklearn.compose.compose._column_transformer._hstack doesn't handle dask objects (or pandas dataframes). Just sparse objects and ndarrays. The _hstack implemented here handles arrays (dask or numpy) dataframes (dask or numpy) and sparse matricies.
In [1]: import dask.dataframe as dd
   ...: import pandas as pd
   ...: import sklearn.compose
   ...: import sklearn.preprocessing
   ...: from sklearn.base import clone
   ...:
   ...: import dask_ml.compose
   ...: import dask_ml.preprocessing
   ...:
   ...: df = pd.DataFrame({"A": pd.Categorical(["a", "a", "b", "a"]), "B": [1.0, 2, 4, 5]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...:
   ...:

In [2]:     b = dask_ml.compose.make_column_transformer(
   ...:         (["A"], dask_ml.preprocessing.OneHotEncoder(sparse=False)),
   ...:         (["B"], dask_ml.preprocessing.StandardScaler()),
   ...:     )
   ...:
   ...:

In [3]: result = b.fit_transform(ddf)

In [4]: result
Out[4]:
Dask DataFrame Structure:
                   A_a      A_b        B
npartitions=2
0              float64  float64  float64
2                  ...      ...      ...
3                  ...      ...      ...
Dask Name: concat-indexed, 22 tasks

In [5]: result.compute()
Out[5]:
   A_a  A_b         B
0  1.0  0.0 -1.264911
1  1.0  0.0 -0.632456
2  0.0  1.0  0.632456
3  1.0  0.0  1.264911

mydask

Long-term, it'd be nice to remove this class entirely, but that'll probably require a lot of work upstream (scipy adopting pydata/sparse, NumPy implementing and libraries adopting __array_function__).

Medium-term, _hstack could become a staticmethod on ColumnTransformer. Then this subclass would just override _hstack, and everything else could be removed.

cc @jorisvandenbossche @ogrisel for that last point. Should I open an issue on scikit-learn to discuss that further?

@ogrisel
Copy link

ogrisel commented Jul 26, 2018

Medium-term, _hstack could become a staticmethod on ColumnTransformer. Then this subclass would just override _hstack, and everything else could be removed.

cc @jorisvandenbossche @ogrisel for that last point. Should I open an issue on scikit-learn to discuss that further?

+1 for opening an issue on scikit-learn and discuss your suggestion (or even a pull request).

@ogrisel
Copy link

ogrisel commented Jul 26, 2018

If you do a quick sklearn PR it could be part of the 0.20 release.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Jul 26, 2018 via email

TomAugspurger added a commit to TomAugspurger/scikit-learn that referenced this pull request Jul 26, 2018
This lets subclasses re-use more of sklearn.compose._column_transformer.

xref dask/dask-ml#315
TomAugspurger added a commit to TomAugspurger/scikit-learn that referenced this pull request Jul 26, 2018
This lets subclasses re-use more of sklearn.compose._column_transformer.

xref dask/dask-ml#315
@TomAugspurger
Copy link
Member Author

Scikit-Learn PR at scikit-learn/scikit-learn#11689

This passes locally for me, but won't pass here till dask/dask#3212 and scikit-learn/scikit-learn#11689 are done.

qinhanmin2014 pushed a commit to scikit-learn/scikit-learn that referenced this pull request Jul 27, 2018
This lets subclasses re-use more of sklearn.compose._column_transformer.
xref dask/dask-ml#315
commit 3f9ba71
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:37:00 2018 -0500

    Removed ndarray special casing

commit ce632b7
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:20:09 2018 -0500

    fix shape

commit e570321
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:12:10 2018 -0500

    fix shape
commit 764872c
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:59:12 2018 -0500

    Handle ndarrays gracefully

commit 3f9ba71
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:37:00 2018 -0500

    Removed ndarray special casing

commit ce632b7
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:20:09 2018 -0500

    fix shape

commit e570321
Author: Tom Augspurger <[email protected]>
Date:   Mon Jul 30 14:12:10 2018 -0500

    fix shape
@TomAugspurger
Copy link
Member Author

The upstream PRs are in. Merging later today.

@TomAugspurger
Copy link
Member Author

Ignoring the coverage failure, since coverage isn't run against sklearn dev.

@TomAugspurger TomAugspurger merged commit 17f4dea into dask:master Sep 4, 2018
@TomAugspurger TomAugspurger deleted the column-transformer branch September 4, 2018 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants