[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818

moziada · 2023-04-04T11:29:55Z

fixes #5687

jmoralez

Thanks for your contribution! I've left some minor comments

python-package/lightgbm/plotting.py

moziada · 2023-04-08T11:08:38Z

I can's see why R-package checks are failing

jameslamb · 2023-04-08T14:47:26Z

please merge laster master into this branch to get the changes from #5821

python-package/lightgbm/plotting.py

jmoralez

LGTM. Thanks a lot for your contribution!

jmoralez · 2023-04-14T16:24:55Z

@jameslamb do you want to review this as well?

jameslamb

Thanks very much for this!

Can you please just add unit tests covering this functionality? That would give us more confidence that this is working and prevent it from being broken in the future.

Please add two tests, both with datasets that have at least one categorical feature that is informative and used in splits.

one where the len(category_values) > max_category_values condition you've added is True
one where that condition is false

You could, for example, copy this test:

LightGBM/tests/python_package_test/test_plotting.py

Line 160 in ffb2986

def test_create_tree_digraph(breast_cancer_split):

moziada · 2023-04-18T11:40:16Z

After looking at sklearn datasets I did not find any classification dataset that contains categorical features, I am thinking of creating multiple base datasets with sklearn.datasets.make_classification each one has different distribution then merging them into one dataset and use the label of each base dataset as a categorical feature. what do you think?

jmoralez · 2023-04-18T17:25:12Z

I think that could work, you just need to make sure that the target depends on those features somehow (so that they're chosen for some splits), maybe something like what we do here

LightGBM/tests/python_package_test/test_dask.py

Lines 185 to 205 in 5989405

    
           if output == 'dataframe-with-categorical': 
        
               num_cat_cols = 2 
        
               for i in range(num_cat_cols): 
        
                   col_name = f"cat_col{i}" 
        
                   cat_values = rnd.choice(['a', 'b'], X.shape[0]) 
        
                   cat_series = pd.Series( 
        
                       cat_values, 
        
                       dtype='category' 
        
                   ) 
        
                   X_df[col_name] = cat_series 
        
                   X = np.hstack((X, cat_series.cat.codes.values.reshape(-1, 1))) 
        
               # make one categorical feature relevant to the target 
        
               cat_col_is_a = X_df['cat_col0'] == 'a' 
        
               if objective == 'regression': 
        
                   y = np.where(cat_col_is_a, y, 2 * y) 
        
               elif objective == 'binary-classification': 
        
                   y = np.where(cat_col_is_a, y, 1 - y) 
        
               elif objective == 'multiclass-classification': 
        
                   n_classes = 3 
        
                   y = np.where(cat_col_is_a, y, (1 + y) % n_classes)

moziada · 2023-05-01T10:48:12Z

I have created a quantized version of breast cancer dataset and used the features as a categorical features. I made only one test case where the condition is false (categorical values should not be compressed), but the problem is the way that lightgbm splits on categorical features is not deterministic, for example a feature could have 30 different categorical values but a random indexed tree may be only splits on one categorical value. what do you suggest and I need your review

jameslamb

Nice idea binning the breast_cancer dataset to create an informative dataset of categorical features! I think that's a great approach for this PR's tests.

Please see my suggestions on how to proceed, and please add at least one test where the number of categories in a feature is greater than the value of max_category_values as well.

tests/python_package_test/test_plotting.py

jameslamb · 2023-05-04T17:45:10Z

@moziada sorry for the failing R CI jobs...those are not related to your PR. We've recently fixed that issue in #5859 .

Can you please merge latest master into this branch? Once you do that, I'll review the testing changes you've pushed.

moziada · 2023-05-16T09:44:38Z

Are there any updates? @jameslamb

jameslamb

Looks good to me, thanks for the contribution!

github-actions · 2023-09-13T00:19:14Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

moziada added 2 commits April 3, 2023 13:28

added max_category_values parameter to create_tree_digraph

52ff4f1

adding warning to max_category_values parameter docstring

9b31be6

moziada requested review from StrikerRUS, shiyu1994, jameslamb and jmoralez as code owners April 4, 2023 11:29

moziada mentioned this pull request Apr 4, 2023

plot tree with high cardinality feature #5687

Closed

jmoralez requested changes Apr 4, 2023

View reviewed changes

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

moziada added 2 commits April 5, 2023 09:17

changed parameters order, simplified if condition

7734224

changed parameters order

be8f233

jameslamb added in progress feature labels Apr 6, 2023

jameslamb changed the title ~~adding max_category_values parameter to create_tree_digraph method~~ [python-package] adding max_category_values parameter to create_tree_digraph method Apr 6, 2023

fixing threshold KeyError

2d88b2e

Merge branch 'microsoft:master' into create_tree_with_max_cat_param

8ba8f5e

jameslamb reviewed Apr 8, 2023

View reviewed changes

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

jmoralez reviewed Apr 11, 2023

View reviewed changes

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

moziada and others added 4 commits April 12, 2023 09:50

defaulting max_category_values to 32

83cb641

adjust max_category_values parameter description

7b65511

Merge branch 'microsoft:master' into create_tree_with_max_cat_param

4b1ea04

defaulting max_category_values to 10

d58b87b

jmoralez approved these changes Apr 13, 2023

View reviewed changes

jameslamb requested changes Apr 17, 2023

View reviewed changes

initial test case for max_category_values functionality

c8a3b90

jameslamb requested changes May 2, 2023

View reviewed changes

tests/python_package_test/test_plotting.py Outdated Show resolved Hide resolved

tests/python_package_test/test_plotting.py Outdated Show resolved Hide resolved

tests/python_package_test/test_plotting.py Outdated Show resolved Hide resolved

moziada and others added 3 commits May 2, 2023 09:34

adding two tests for max_category_values functionality

9f4e93b

Merge branch 'microsoft:master' into create_tree_with_max_cat_param

ec031b7

fix python linting

af1637a

Merge branch 'microsoft:master' into create_tree_with_max_cat_param

a06f140

jameslamb changed the title ~~[python-package] adding max_category_values parameter to create_tree_digraph method~~ [python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) Jun 4, 2023

jameslamb approved these changes Jun 4, 2023

View reviewed changes

jameslamb removed the in progress label Jun 4, 2023

jameslamb added 2 commits June 3, 2023 22:30

Merge branch 'master' into create_tree_with_max_cat_param

fa1f057

Merge branch 'master' into create_tree_with_max_cat_param

91567b4

jameslamb merged commit 15e3aec into microsoft:master Jun 10, 2023

jameslamb mentioned this pull request Jun 27, 2023

[docs] add versionadded notes for v4.0.0 features #5948

Merged

github-actions bot locked as resolved and limited conversation to collaborators Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818

[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818

moziada commented Apr 4, 2023 •

edited by jameslamb

Loading

jmoralez left a comment

moziada commented Apr 8, 2023

jameslamb commented Apr 8, 2023

jmoralez left a comment

jmoralez commented Apr 14, 2023

jameslamb left a comment

moziada commented Apr 18, 2023

jmoralez commented Apr 18, 2023

moziada commented May 1, 2023

jameslamb left a comment

jameslamb commented May 4, 2023

moziada commented May 16, 2023

jameslamb left a comment

github-actions bot commented Sep 13, 2023

[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818

[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818

Conversation

moziada commented Apr 4, 2023 • edited by jameslamb Loading

jmoralez left a comment

Choose a reason for hiding this comment

moziada commented Apr 8, 2023

jameslamb commented Apr 8, 2023

jmoralez left a comment

Choose a reason for hiding this comment

jmoralez commented Apr 14, 2023

jameslamb left a comment

Choose a reason for hiding this comment

moziada commented Apr 18, 2023

jmoralez commented Apr 18, 2023

moziada commented May 1, 2023

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented May 4, 2023

moziada commented May 16, 2023

jameslamb left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 13, 2023

moziada commented Apr 4, 2023 •

edited by jameslamb

Loading