Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[breaking] Save booster feature info in JSON, remove feature name generation. #6605

Merged
merged 4 commits into from
Feb 25, 2021

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jan 14, 2021

  • [breaking] Remove feature name generation.
  • Pass feature names and types into libxgboost.
  • Save them in JSON model.

The feature name generation is still preserved as many other places like feature_importance_ and plotting are depending on it.

One concern of this PR is it might introduce overhead in wide datasets, some sparse datasets can scale to millions of features.

Close #6520

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical about merging FeatureMap::Type into FeatureType. They were designed for different purposes, the first for dumping the model and the second for tweaking the algorithm (categorical vs numerical). Blurring distinction will subtly break many things. See my comment about serialization in particular.

The linked issue only suggests saving the feature names in the Booster object, not this kind of refactoring.

include/xgboost/data.h Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Jan 21, 2021

The linked issue only suggests saving the feature names in the Booster object, not this kind of refactoring.

Blurring distinction will subtly break many things. See my comment about serialization in particular.

I agree with the your concern, but also want some suggestions on the design.

When I was creating the refactoring, a few things were considered:

  • Does categorical data enum need be to presented on tree dumping? If so then it might have to be part of FeatureMap::Type. And I think that's true.
  • What do we represent as feature types? The full string name or an mere integer indicator (enum). I chose the latter one for efficiency. But we don't have a formally defined enum to represent the feature type, so I merged them.
  • Am I making it harder to be backward compatible? Maybe. There are definitely overheads.

We don't want to initialize split_types_ to kIndicator.

Does it change anything?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

Does it change anything?

Yes, the deserializer assigns kIndicator value to all non-categorical features, which is incorrect. The current deserializer function doesn't have enough information to tell apart different kinds of numerical features; it only knows whether a feature is categorical or not categorical. So if you want to refactor the feature type enum, you will also need to update the serializer / deserializer as well.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

Given the experimental status of categorical data support, we can be more lenient about breaking backward compatibility. My suggestion is to explicitly serialize split_types_ array so that the deserializer don't have to second-guess about what type each feature is.

@trivialfis
Copy link
Member Author

trivialfis commented Jan 21, 2021

Yes, the deserializer assigns kIndicator value to all non-categorical features, which is incorrect.

Em, thanks for catching that. There is an inconsistency here. XGBoost supports splitting only on numerical/categorical data, but we accept more than 2 feature types. What would be your suggestion.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

@trivialfis There are two concepts of feature types in use:

  1. Numerical vs Categorical. This affects the core tree fitting algorithm.
  2. Indicator / Quantitative / Integer / Float. This distinction is meaningless to the core tree fitting algorithm. It's only useful for dump trees as human-readable format.

@trivialfis
Copy link
Member Author

My suggestion is to explicitly serialize split_types_ array so that the deserializer don't have to second-guess about what type each feature is.

The split_type is explicitly serialized:

split_type[i] = static_cast<I>(this->NodeSplitType(i));

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

@trivialfis Sorry I missed that. I took another look at the serializer function, and the issue I pointed out only happens when you load a model from older version of XGBoost, which lacks the split type information. Right now, XGBoost handles legacy models by assigning kNumerical to all features. So we can salvage this PR by assigning kFloat to all features in legacy models. WDYT?

@trivialfis
Copy link
Member Author

So we can salvage this PR by assigning kFloat to all features in legacy models. WDYT?

That's a good suggestion. A way to do it is we put kFloat as 0 and kCategorical as 1, then the rest of other enums. WDYT?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

@trivialfis Or you can explicitly fill the split_type_ vector with a nonzero value.

@trivialfis
Copy link
Member Author

trivialfis commented Jan 21, 2021

@hcho3 Let's say we are loading an old model with both numerical and categorical data, with #6605 (comment):

Backward compatibility:

When using a new xgboost to load the old model:
If the split type is numerical (which is the first field in old feature type enum):
    The loaded model will become float.
Else if the split type is categorical (second field in old feature type enum):
    The loaded model will continue to be categorical, nothing is changed.

When using a new xgboost to load the new model:
When loading a new model after this PR, it works out of box, all split types are either Float or Categorical.

Forward compatibility:

When using an old xgboost to load the new model:
If the split type is Float (== 0):
   Loaded model is numerical.  No breaking.
Else if the split type is Categorical (==1):
   Nothing is changed.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

Ah so there is an argument for assigning 0 to kFloat. Got it.

@trivialfis trivialfis force-pushed the booster-feature-names branch from afb650d to de5a021 Compare January 21, 2021 09:07
@trivialfis
Copy link
Member Author

trivialfis commented Jan 21, 2021

I think I can revert some changes to avoid these altogether.

@trivialfis trivialfis changed the title Save booster feature info in JSON. [WIP] Save booster feature info in JSON. Jan 21, 2021
@trivialfis trivialfis force-pushed the booster-feature-names branch from de5a021 to 9287a83 Compare January 21, 2021 09:24
@trivialfis trivialfis force-pushed the booster-feature-names branch 2 times, most recently from 279690b to 2ae3fd2 Compare February 18, 2021 14:56
@trivialfis
Copy link
Member Author

I have reverted most of the changes.

@trivialfis trivialfis changed the title [WIP] Save booster feature info in JSON. Save booster feature info in JSON. Feb 18, 2021
@trivialfis
Copy link
Member Author

In the future, we can move the feature validation into c++.

@trivialfis trivialfis changed the title Save booster feature info in JSON. [breaking] Save booster feature info in JSON, remove feature name generation. Feb 19, 2021
@codecov-io
Copy link

codecov-io commented Feb 19, 2021

Codecov Report

Merging #6605 (b67c338) into master (a4101de) will increase coverage by 0.13%.
The diff coverage is 90.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6605      +/-   ##
==========================================
+ Coverage   81.55%   81.69%   +0.13%     
==========================================
  Files          13       13              
  Lines        3719     3791      +72     
==========================================
+ Hits         3033     3097      +64     
- Misses        686      694       +8     
Impacted Files Coverage Δ
python-package/xgboost/data.py 62.63% <77.77%> (+0.07%) ⬆️
python-package/xgboost/sklearn.py 90.61% <86.66%> (-0.10%) ⬇️
python-package/xgboost/dask.py 82.57% <90.54%> (-0.10%) ⬇️
python-package/xgboost/core.py 82.39% <93.18%> (+0.54%) ⬆️
python-package/xgboost/training.py 95.60% <100.00%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f15b9e...b67c338. Read the comment docs.

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good to me. I've gone through all of the code, and it looks to be in fine shape modulo some small tweaks. I may not have the best context for understanding the larger implications here, so I've also asked a couple questions that will help me better assess the impact of this.

The one thing I'm trying to make sure I understand is whether it's strictly necessary for this to be a breaking change. Must we drop feature name generation if we want to support JSON serialization of booster feature info? The answer is probably yes, but I wasn't quite able to convince myself of that without additional context.

The only other general thing I'll mention is a comment that I've made before: it's helpful to avoid unrelated formatting changes as part of a PR, since it makes the diffs easier to review.

python-package/xgboost/core.py Show resolved Hide resolved
python-package/xgboost/core.py Show resolved Hide resolved
src/learner.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Feb 23, 2021

The one thing I'm trying to make sure I understand is whether it's strictly necessary for this to be a breaking change. Must we drop feature name generation if we want to support JSON serialization of booster feature info?

Sorry, they don't have be tied together. This PR is to address the issue in unreliable feature name validation. During prediction, the Python package performs validation on test dataset to ensure it matches the training dataset. The feature names from training dataset is copied into booster, so later prediction can compare feature names from test dataset and from booster (which comes from training dataset). There are 2 issues in there:

  • Firstly the feature names are not saved in model, so if a model is reloaded then the validation is not useful.
  • The second issue is many data types like np.ndarray don't have name for features, and xgboost generates the feature name automatically. The feature name generated by xgboost doesn't match those generated by libraries like pandas and cudf, leading to false positive during validation.

This PR addresses both of the issues to make feature name validation useful. So yes it can be split into 2 PRs, but seems unnecessary? If you think that's better for maintenance I can follow up on split PRs.

@trivialfis
Copy link
Member Author

trivialfis commented Feb 23, 2021

The only other general thing I'll mention is a comment that I've made before: it's helpful to avoid unrelated formatting changes as part of a PR, since it makes the diffs easier to review.

Let me look harder. Toke 2 formatting changes out. Sorry for the noise in the change.

@trivialfis trivialfis force-pushed the booster-feature-names branch from b67c338 to f5dca5e Compare February 23, 2021 17:56
@wphicks
Copy link
Contributor

wphicks commented Feb 23, 2021

This PR address both of the issues to make feature name validation useful. So yes it can be split into 2 PRs, but seems unnecessary? If you think that's better for maintenance I can follow up on split PRs.

No, no, I think that's totally fine! I just wanted to make sure there wasn't some inherent cross-dependency that I was missing. I think a single PR is no problem here.

Let me look harder. Toke 2 formatting changes out. Sorry for the noise in the change.

Really don't sweat it! I was just mentioning that in case your dev environment still had some autoformatting going on that you weren't aware of. No need to pull those changes out now that they're already in and I've already reviewed.

@trivialfis
Copy link
Member Author

The linting issue is addressed in #6726 .

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest changes, this looks good to me! The implementation is sensible, I've got a better understanding of the context now, and I'll be glad to see this feature added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save feature names in the C++ booster object
4 participants