-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818
[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I've left some minor comments
I can's see why R-package checks are failing |
please merge laster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot for your contribution!
@jameslamb do you want to review this as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this!
Can you please just add unit tests covering this functionality? That would give us more confidence that this is working and prevent it from being broken in the future.
Please add two tests, both with datasets that have at least one categorical feature that is informative and used in splits.
- one where the
len(category_values) > max_category_values
condition you've added is True - one where that condition is false
You could, for example, copy this test:
def test_create_tree_digraph(breast_cancer_split): |
After looking at sklearn datasets I did not find any classification dataset that contains categorical features, I am thinking of creating multiple base datasets with sklearn.datasets.make_classification each one has different distribution then merging them into one dataset and use the label of each base dataset as a categorical feature. what do you think? |
I think that could work, you just need to make sure that the target depends on those features somehow (so that they're chosen for some splits), maybe something like what we do here LightGBM/tests/python_package_test/test_dask.py Lines 185 to 205 in 5989405
|
I have created a quantized version of breast cancer dataset and used the features as a categorical features. I made only one test case where the condition is false (categorical values should not be compressed), but the problem is the way that lightgbm splits on categorical features is not deterministic, for example a feature could have 30 different categorical values but a random indexed tree may be only splits on one categorical value. what do you suggest and I need your review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea binning the breast_cancer
dataset to create an informative dataset of categorical features! I think that's a great approach for this PR's tests.
Please see my suggestions on how to proceed, and please add at least one test where the number of categories in a feature is greater than the value of max_category_values
as well.
Are there any updates? @jameslamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the contribution!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
fixes #5687