-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for one hot split. #5949
Conversation
@hcho3 Right now I'm reusing the split cond in |
@trivialfis We will want to add additional fields to the JSON schema to indicate categorical splits. For example, LightGBM stores It should not be too difficult to add new fields to the current JSON schema. (I'm assuming that vector with multiple categories will be JSON only and won't support legacy binary serialization.) |
You are right. But at the same time we can't break the binary format. For example, we can't add anything to
I think so. Just to make sure that we have considered enough different cases. |
Got it. Let's discuss about how to add the necessary fields without touching Another idea is to set the |
Codecov Report
@@ Coverage Diff @@
## master #5949 +/- ##
=======================================
Coverage 78.49% 78.49%
=======================================
Files 12 12
Lines 3013 3018 +5
=======================================
+ Hits 2365 2369 +4
- Misses 648 649 +1
Continue to review full report at Codecov.
|
81e2a0e
to
49bc2ab
Compare
@hcho3 I changed the categories in tree into bitfield. |
bbb81dd
to
d8ac122
Compare
I reverted change of min value to reduce the size of this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general approach looks good. I am looking forward to reviewing specifics once this PR gets broken up into smaller PRs.
|
||
auto is_cat = candidate.split.is_cat; | ||
if (is_cat) { | ||
auto cat = common::AsCat(candidate.split.fvalue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: in one-hot encoded setting, there is only one matching category in every categorical split. However, the split_categories_
structure can later store multiple matching categories per split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reminder to myself: Treelite must support JSON format of XGBoost.
a4795d1
to
d86f917
Compare
d86f917
to
cb77e3a
Compare
cb77e3a
to
18f734a
Compare
All merged. |
This PR aims to have a working pipeline for categorical data, but the support is very limited at current form. To run tests, one needs to:
gbtree
.gpu_hist
tree method. Other tree methods are coming.DMatrix
,DeviceQuantileDMatrix
is not yet supported.enable_categorical
forDMatrix
.gpu_predictor
explicitly.enable_experimental_json_serialization
even if you don't use pickle.Limitations