-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[breaking] Save booster feature info in JSON, remove feature name generation. #6605
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical about merging FeatureMap::Type
into FeatureType
. They were designed for different purposes, the first for dumping the model and the second for tweaking the algorithm (categorical vs numerical). Blurring distinction will subtly break many things. See my comment about serialization in particular.
The linked issue only suggests saving the feature names in the Booster object, not this kind of refactoring.
I agree with the your concern, but also want some suggestions on the design. When I was creating the refactoring, a few things were considered:
Does it change anything? |
Yes, the deserializer assigns |
Given the experimental status of categorical data support, we can be more lenient about breaking backward compatibility. My suggestion is to explicitly serialize |
Em, thanks for catching that. There is an inconsistency here. XGBoost supports splitting only on |
@trivialfis There are two concepts of feature types in use:
|
The xgboost/src/tree/tree_model.cc Line 1002 in f0fd762
|
@trivialfis Sorry I missed that. I took another look at the serializer function, and the issue I pointed out only happens when you load a model from older version of XGBoost, which lacks the split type information. Right now, XGBoost handles legacy models by assigning |
That's a good suggestion. A way to do it is we put |
@trivialfis Or you can explicitly fill the |
@hcho3 Let's say we are loading an old model with both numerical and categorical data, with #6605 (comment): Backward compatibility:
When using a new xgboost to load the new model: Forward compatibility:
|
Ah so there is an argument for assigning 0 to |
afb650d
to
de5a021
Compare
|
de5a021
to
9287a83
Compare
279690b
to
2ae3fd2
Compare
I have reverted most of the changes. |
In the future, we can move the feature validation into c++. |
Codecov Report
@@ Coverage Diff @@
## master #6605 +/- ##
==========================================
+ Coverage 81.55% 81.69% +0.13%
==========================================
Files 13 13
Lines 3719 3791 +72
==========================================
+ Hits 3033 3097 +64
- Misses 686 694 +8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good to me. I've gone through all of the code, and it looks to be in fine shape modulo some small tweaks. I may not have the best context for understanding the larger implications here, so I've also asked a couple questions that will help me better assess the impact of this.
The one thing I'm trying to make sure I understand is whether it's strictly necessary for this to be a breaking change. Must we drop feature name generation if we want to support JSON serialization of booster feature info? The answer is probably yes, but I wasn't quite able to convince myself of that without additional context.
The only other general thing I'll mention is a comment that I've made before: it's helpful to avoid unrelated formatting changes as part of a PR, since it makes the diffs easier to review.
Sorry, they don't have be tied together. This PR is to address the issue in unreliable feature name validation. During prediction, the Python package performs validation on test dataset to ensure it matches the training dataset. The feature names from training dataset is copied into booster, so later prediction can compare feature names from test dataset and from booster (which comes from training dataset). There are 2 issues in there:
This PR addresses both of the issues to make feature name validation useful. So yes it can be split into 2 PRs, but seems unnecessary? If you think that's better for maintenance I can follow up on split PRs. |
Let me look harder. Toke 2 formatting changes out. Sorry for the noise in the change. |
b67c338
to
f5dca5e
Compare
No, no, I think that's totally fine! I just wanted to make sure there wasn't some inherent cross-dependency that I was missing. I think a single PR is no problem here.
Really don't sweat it! I was just mentioning that in case your dev environment still had some autoformatting going on that you weren't aware of. No need to pull those changes out now that they're already in and I've already reviewed. |
The linting issue is addressed in #6726 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest changes, this looks good to me! The implementation is sensible, I've got a better understanding of the context now, and I'll be glad to see this feature added.
* Pass feature names and types into libxgboost. * Save them in JSON model.
982eca3
to
cfaddf1
Compare
The feature name generation is still preserved as many other places likefeature_importance_
andplotting
are depending on it.One concern of this PR is it might introduce overhead in wide datasets, some sparse datasets can scale to millions of features.
Close #6520