[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

jameslamb · 2021-02-04T04:07:21Z

This PR fixes #3861.

Root Cause

Thanks to @jmoralez for figuring out that the root issue was pandas "category" columns for data frames! lightgbm.dask._predict_part() did some conversions thatt did not handle these columns correctly.

Changes in this PR

Removes code in _predict_part() that converts a chunk of data to array before passing it through to .predict() on the underlying scikit-learn model.
- LightGBM's scikit-learn estimators already support scoring directly on pandas data frames, so this was useless. And it was the root of the problem! Because when you use .values on a pandas "category" column, the results are strings. Passing a numpy array with string data into .predict() methods in LightGBM leads to errors.
Adds data frames with a "category" column to many of the Dask unit tests

Reproducible example

I've added a reproducible example in #3861 (comment). If you run the code on current master, it will fail with the error described in #3861. If you run that code on this branch, it will succeed 😀 .

References

I found a few issues and PRs related to "category" columns in pandas being supported in the scikit-learn interface. Putting them here for anyone who finds this PR from search in the future.

[python] refined pandas categorical field loading #1979
[python] save pandas_categorical to model string and JSON #1766
Only copy pandas categorical attribute if it exists #1764

LightGBM/python-package/lightgbm/basic.py

Lines 497 to 500 in 8ef874b

    
           cat_cols = list(data.select_dtypes(include=['category']).columns) 
        
           cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered] 
        
           if pandas_categorical is None:  # train dataset 
        
               pandas_categorical = [list(data[col].cat.categories) for col in cat_cols]

StrikerRUS

@jameslamb Thanks a lot! I loved how one-line simplification fixed critical issue!
However, I'm afraid that tests are not very reliable. I remember, the same problem was with pure Pandas tests where I have to fix random_state to force LightGBM to use category column for splits.

LightGBM/tests/python_package_test/test_engine.py

Line 884 in fc2da2e

    
           np.random.seed(42)  # sometimes there is no difference how cols are treated (cat or not cat)

tests/python_package_test/test_dask.py

Co-authored-by: Nikita Titov <[email protected]>

jameslamb · 2021-02-04T15:37:24Z

Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns?

Is it that in that test, you were using such a small data size that sometimes the column was mostly a single categorical level, and so it got dropped by pre-filtering? (https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#enable-feature-pre-filtering-when-creating-dataset)

tests/python_package_test/test_dask.py

StrikerRUS · 2021-02-04T16:10:23Z

Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns?

I mean, with some seeds LightGBM models have identical output for cases when one particular column is treated as categorical and when it is not. I think it will be good to ensure that categorical column in Dask test has some impact on LightGBM model.

jameslamb · 2021-02-04T16:17:37Z

Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns?

I mean, with some seeds LightGBM models have identical output for cases when one particular column is treated as categorical and when it is not. I think it will be good to ensure that categorical column in Dask test has some impact on LightGBM model.

Ok. I could check the output of .trees_to_dataframe() or the feature importances to check that the column was chosen for splits, is that sufficient?

StrikerRUS · 2021-02-04T16:25:28Z

I think it will be enough to check that categorical_feature='auto' and categorical_feature=[] result in different outputs of predict(raw_score=True).

jameslamb · 2021-02-04T16:27:18Z

I think it will be enough to check that categorical_feature='auto' and categorical_feature=[] result in different outputs of predict(raw_score=True).

OHHHH now I understand what you mean. You're saying it's random whether or not LightGBM decides to even treat it as categorical vs. continuous data? That's suprising to me 😬

StrikerRUS · 2021-02-04T17:15:27Z

You're saying it's random whether or not LightGBM decides to even treat it as categorical vs. continuous data?

I don't think so. I believe some "unlucky" distributions in cat column were treated as constant column, hence LightGBM drops it. Or maybe some another thing caused such behaviour. I haven't investigated these cases. I just remember my asserts in tests for different outputs w/wo cat column were failing without fixing random seed.

jameslamb · 2021-02-04T17:31:43Z

I don't think so. I believe some "unlucky" distributions in cat column were treated as constant column, hence LightGBM drops it.

This is exactly what I meant in #3908 (comment). And if that's the case, then checking that the column was used for splits (#3908 (comment)) would be enough.

StrikerRUS · 2021-02-04T18:11:21Z

Yeah, if you think it will be easier, sure, please try with trees_to_dataframe().

jameslamb · 2021-02-04T18:17:19Z

Thanks! I think they're both equally easy, but the trees_to_dataframe() one is a more direct way to test for the thing we're worried about.

The results of predict(raw_score=True) could be different for reasons other than "this categorical column was not used by LightGBM".

I'll add the test with trees_to_dataframe() shortly. I'll also update the test models to use min_data_in_leaf=1, which should reduce the risk of getting into the situation where a feature is filtered out because it's unsplittable.

jameslamb · 2021-02-05T06:11:05Z

Ok I think I got this working! I ended up not needing the min_data_in_leaf trick and went with something more direct.

I was really really struggling with getting LightGBM to use a single categorical feature every test run because of the small data size, so I changed the approach to "just add multiple categorical columns and make sure that at least one of them was used". This seemed to work! I re-ran the tests several times and I think this approach is reliable.

For classification and ranking --> add 5 category columns, each of which is a random draw from ["a", "b"] (based on the np.RandomState object we already set up in _create_data())

For regression --> make sure that the training data from _create_data() is ONLY category columns

For regression tasks with just 100 observations, the random continuous features always overwhelm the categorical columns and get chosen for splits. I saw that even when setting n_features=1 in make_regression().

StrikerRUS

Thanks for enhancing the test! Please check my new comment below

StrikerRUS · 2021-02-06T00:34:41Z

tests/python_package_test/test_dask.py

+            if dX.dtypes[col].name == 'category'
+        ]
+        tree_df = dask_classifier.booster_.trees_to_dataframe()
+        assert tree_df['split_feature'].isin(cat_cols).sum() > 0


I'm not sure this assert is enough to say that at least one of cat_cols was categorical. I believe you should check for comparison sign as well to ensure these features were treated as categorical but not as numerical.

My goal was not to check that the columns are treated as categorical features by LightGBM. My goal was to ensure that having pandas "category" columns in your training data does not break LightGBM's Dask estimators (as it currently does on master, documented in #3861).

As far as I can tell, "category" columns will only be automatically treated as categorical features if you set categorical_feature = 'auto':

LightGBM/python-package/lightgbm/basic.py

Lines 513 to 514 in 8ef874b

if categorical_feature == 'auto': # use cat cols from DataFrame

categorical_feature = cat_cols_not_ordered

, which is not the default, and of course you can force it to happen by setting categorical_featue to the column names / indices in the parameters.

The tests like assert_eq(p1_local, p2) already test that the Dask interface is producing a model that is similar to the one produced by the scikit-learn interface.

I can add categorical_feature = [col for col in dX.columns if col.startswith('cat_')] and test, that's fine. I will try to do that tonight. But it wasn't the intention of this PR or these tests.

I've added these new checks in 07c9dca

Sorry for I didn't get the purpose of this PR!

StrikerRUS

Thanks a lot for the fix and enhancements of tests!

github-actions · 2023-08-24T01:26:55Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 5 commits February 3, 2021 19:01

add support for pandas categorical columns

a399763

remove commented code

61c1bed

quotes

64d60ba

syntax error

55f4c3c

Merge branch 'master' into fix/pandas-categoricals

1f8610f

jameslamb added the maintenance label Feb 4, 2021

jameslamb requested a review from StrikerRUS February 4, 2021 04:07

jameslamb mentioned this pull request Feb 4, 2021

[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

Closed

fix shape for ranker test

d513b61

StrikerRUS reviewed Feb 4, 2021

View reviewed changes

Apply suggestions from code review

ba68288

Co-authored-by: Nikita Titov <[email protected]>

jameslamb commented Feb 4, 2021

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

Update tests/python_package_test/test_dask.py

a77fbd7

jameslamb mentioned this pull request Feb 4, 2021

[dask] Cannot use a mix of Array and DataFrame types in training #3909

Closed

jameslamb added 4 commits February 4, 2021 17:50

trying

4b57d10

fix tests

cf84c13

remove unnecessary debugging stuff

6c4ac72

skip accuracy checks on categorical

2cf2f4c

StrikerRUS requested changes Feb 6, 2021

View reviewed changes

use category columns as categorical features

07c9dca

jameslamb requested a review from StrikerRUS February 6, 2021 03:39

StrikerRUS approved these changes Feb 6, 2021

View reviewed changes

StrikerRUS merged commit fc6b71e into microsoft:master Feb 6, 2021

jameslamb deleted the fix/pandas-categoricals branch February 6, 2021 22:20

StrikerRUS mentioned this pull request Feb 9, 2021

[dask] [docs] Fix inaccuracies in API docs for Dask module (fixes #3871) #3930

Merged

jameslamb mentioned this pull request Feb 9, 2021

[dask] test that Dask automatically treats 'category' columns as categorical features #3932

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

jameslamb commented Feb 4, 2021

StrikerRUS left a comment

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

jameslamb commented Feb 5, 2021

StrikerRUS left a comment

StrikerRUS Feb 6, 2021 •

edited

Loading

jameslamb Feb 6, 2021

jameslamb Feb 6, 2021

StrikerRUS Feb 6, 2021

StrikerRUS left a comment

github-actions bot commented Aug 24, 2023

	cat_cols = list(data.select_dtypes(include=['category']).columns)
	cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
	if pandas_categorical is None: # train dataset
	pandas_categorical = [list(data[col].cat.categories) for col in cat_cols]

	if categorical_feature == 'auto': # use cat cols from DataFrame
	categorical_feature = cat_cols_not_ordered

[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

Conversation

jameslamb commented Feb 4, 2021

Root Cause

Changes in this PR

Reproducible example

References

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

StrikerRUS commented Feb 4, 2021

jameslamb commented Feb 4, 2021

jameslamb commented Feb 5, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Feb 6, 2021 • edited Loading

Choose a reason for hiding this comment

jameslamb Feb 6, 2021

Choose a reason for hiding this comment

jameslamb Feb 6, 2021

Choose a reason for hiding this comment

StrikerRUS Feb 6, 2021

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

StrikerRUS Feb 6, 2021 •

edited

Loading