Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

Merged
merged 13 commits into from
Feb 6, 2021

Conversation

jameslamb
Copy link
Collaborator

This PR fixes #3861.

Root Cause

Thanks to @jmoralez for figuring out that the root issue was pandas "category" columns for data frames! lightgbm.dask._predict_part() did some conversions thatt did not handle these columns correctly.

Changes in this PR

  1. Removes code in _predict_part() that converts a chunk of data to array before passing it through to .predict() on the underlying scikit-learn model.

    • LightGBM's scikit-learn estimators already support scoring directly on pandas data frames, so this was useless. And it was the root of the problem! Because when you use .values on a pandas "category" column, the results are strings. Passing a numpy array with string data into .predict() methods in LightGBM leads to errors.
  2. Adds data frames with a "category" column to many of the Dask unit tests

Reproducible example

I've added a reproducible example in #3861 (comment). If you run the code on current master, it will fail with the error described in #3861. If you run that code on this branch, it will succeed 😀 .

References

I found a few issues and PRs related to "category" columns in pandas being supported in the scikit-learn interface. Putting them here for anyone who finds this PR from search in the future.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Thanks a lot! I loved how one-line simplification fixed critical issue!
However, I'm afraid that tests are not very reliable. I remember, the same problem was with pure Pandas tests where I have to fix random_state to force LightGBM to use category column for splits.

np.random.seed(42) # sometimes there is no difference how cols are treated (cat or not cat)

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns?

Is it that in that test, you were using such a small data size that sometimes the column was mostly a single categorical level, and so it got dropped by pre-filtering? (https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#enable-feature-pre-filtering-when-creating-dataset)

@StrikerRUS
Copy link
Collaborator

Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns?

I mean, with some seeds LightGBM models have identical output for cases when one particular column is treated as categorical and when it is not. I think it will be good to ensure that categorical column in Dask test has some impact on LightGBM model.

@jameslamb
Copy link
Collaborator Author

Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns?

I mean, with some seeds LightGBM models have identical output for cases when one particular column is treated as categorical and when it is not. I think it will be good to ensure that categorical column in Dask test has some impact on LightGBM model.

Ok. I could check the output of .trees_to_dataframe() or the feature importances to check that the column was chosen for splits, is that sufficient?

@StrikerRUS
Copy link
Collaborator

I think it will be enough to check that categorical_feature='auto' and categorical_feature=[] result in different outputs of predict(raw_score=True).

@jameslamb
Copy link
Collaborator Author

I think it will be enough to check that categorical_feature='auto' and categorical_feature=[] result in different outputs of predict(raw_score=True).

OHHHH now I understand what you mean. You're saying it's random whether or not LightGBM decides to even treat it as categorical vs. continuous data? That's suprising to me 😬

@StrikerRUS
Copy link
Collaborator

You're saying it's random whether or not LightGBM decides to even treat it as categorical vs. continuous data?

I don't think so. I believe some "unlucky" distributions in cat column were treated as constant column, hence LightGBM drops it. Or maybe some another thing caused such behaviour. I haven't investigated these cases. I just remember my asserts in tests for different outputs w/wo cat column were failing without fixing random seed.

@jameslamb
Copy link
Collaborator Author

I don't think so. I believe some "unlucky" distributions in cat column were treated as constant column, hence LightGBM drops it.

This is exactly what I meant in #3908 (comment). And if that's the case, then checking that the column was used for splits (#3908 (comment)) would be enough.

@StrikerRUS
Copy link
Collaborator

Yeah, if you think it will be easier, sure, please try with trees_to_dataframe().

@jameslamb
Copy link
Collaborator Author

Thanks! I think they're both equally easy, but the trees_to_dataframe() one is a more direct way to test for the thing we're worried about.

The results of predict(raw_score=True) could be different for reasons other than "this categorical column was not used by LightGBM".

I'll add the test with trees_to_dataframe() shortly. I'll also update the test models to use min_data_in_leaf=1, which should reduce the risk of getting into the situation where a feature is filtered out because it's unsplittable.

@jameslamb
Copy link
Collaborator Author

Ok I think I got this working! I ended up not needing the min_data_in_leaf trick and went with something more direct.

I was really really struggling with getting LightGBM to use a single categorical feature every test run because of the small data size, so I changed the approach to "just add multiple categorical columns and make sure that at least one of them was used". This seemed to work! I re-ran the tests several times and I think this approach is reliable.

For classification and ranking --> add 5 category columns, each of which is a random draw from ["a", "b"] (based on the np.RandomState object we already set up in _create_data())

For regression --> make sure that the training data from _create_data() is ONLY category columns

For regression tasks with just 100 observations, the random continuous features always overwhelm the categorical columns and get chosen for splits. I saw that even when setting n_features=1 in make_regression().

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enhancing the test! Please check my new comment below

if dX.dtypes[col].name == 'category'
]
tree_df = dask_classifier.booster_.trees_to_dataframe()
assert tree_df['split_feature'].isin(cat_cols).sum() > 0
Copy link
Collaborator

@StrikerRUS StrikerRUS Feb 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this assert is enough to say that at least one of cat_cols was categorical. I believe you should check for comparison sign as well to ensure these features were treated as categorical but not as numerical.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was not to check that the columns are treated as categorical features by LightGBM. My goal was to ensure that having pandas "category" columns in your training data does not break LightGBM's Dask estimators (as it currently does on master, documented in #3861).

As far as I can tell, "category" columns will only be automatically treated as categorical features if you set categorical_feature = 'auto':

if categorical_feature == 'auto': # use cat cols from DataFrame
categorical_feature = cat_cols_not_ordered
, which is not the default, and of course you can force it to happen by setting categorical_featue to the column names / indices in the parameters.

The tests like assert_eq(p1_local, p2) already test that the Dask interface is producing a model that is similar to the one produced by the scikit-learn interface.

I can add categorical_feature = [col for col in dX.columns if col.startswith('cat_')] and test, that's fine. I will try to do that tonight. But it wasn't the intention of this PR or these tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these new checks in 07c9dca

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for I didn't get the purpose of this PR!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the fix and enhancements of tests!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dask] DaskRegressor.predict() fails on DataFrame / Series input
2 participants