[breaking] Save booster feature info in JSON, remove feature name generation. #6605

trivialfis · 2021-01-14T07:38:07Z

[breaking] Remove feature name generation.
Pass feature names and types into libxgboost.
Save them in JSON model.

~~The feature name generation is still preserved as many other places like feature_importance_ and plotting are depending on it.~~

One concern of this PR is it might introduce overhead in wide datasets, some sparse datasets can scale to millions of features.

Close #6520

hcho3

I'm skeptical about merging FeatureMap::Type into FeatureType. They were designed for different purposes, the first for dumping the model and the second for tweaking the algorithm (categorical vs numerical). Blurring distinction will subtly break many things. See my comment about serialization in particular.

The linked issue only suggests saving the feature names in the Booster object, not this kind of refactoring.

include/xgboost/data.h

trivialfis · 2021-01-21T07:12:09Z

The linked issue only suggests saving the feature names in the Booster object, not this kind of refactoring.

Blurring distinction will subtly break many things. See my comment about serialization in particular.

I agree with the your concern, but also want some suggestions on the design.

When I was creating the refactoring, a few things were considered:

Does categorical data enum need be to presented on tree dumping? If so then it might have to be part of FeatureMap::Type. And I think that's true.
What do we represent as feature types? The full string name or an mere integer indicator (enum). I chose the latter one for efficiency. But we don't have a formally defined enum to represent the feature type, so I merged them.
Am I making it harder to be backward compatible? Maybe. There are definitely overheads.

We don't want to initialize split_types_ to kIndicator.

Does it change anything?

hcho3 · 2021-01-21T07:18:09Z

Does it change anything?

Yes, the deserializer assigns kIndicator value to all non-categorical features, which is incorrect. The current deserializer function doesn't have enough information to tell apart different kinds of numerical features; it only knows whether a feature is categorical or not categorical. So if you want to refactor the feature type enum, you will also need to update the serializer / deserializer as well.

hcho3 · 2021-01-21T07:25:29Z

Given the experimental status of categorical data support, we can be more lenient about breaking backward compatibility. My suggestion is to explicitly serialize split_types_ array so that the deserializer don't have to second-guess about what type each feature is.

trivialfis · 2021-01-21T07:25:50Z

Yes, the deserializer assigns kIndicator value to all non-categorical features, which is incorrect.

Em, thanks for catching that. There is an inconsistency here. XGBoost supports splitting only on numerical/categorical data, but we accept more than 2 feature types. What would be your suggestion.

hcho3 · 2021-01-21T07:29:10Z

@trivialfis There are two concepts of feature types in use:

Numerical vs Categorical. This affects the core tree fitting algorithm.
Indicator / Quantitative / Integer / Float. This distinction is meaningless to the core tree fitting algorithm. It's only useful for dump trees as human-readable format.

trivialfis · 2021-01-21T07:30:44Z

My suggestion is to explicitly serialize split_types_ array so that the deserializer don't have to second-guess about what type each feature is.

The split_type is explicitly serialized:

xgboost/src/tree/tree_model.cc

Line 1002 in f0fd762

split_type[i] = static_cast<I>(this->NodeSplitType(i));

hcho3 · 2021-01-21T07:37:19Z

@trivialfis Sorry I missed that. I took another look at the serializer function, and the issue I pointed out only happens when you load a model from older version of XGBoost, which lacks the split type information. Right now, XGBoost handles legacy models by assigning kNumerical to all features. So we can salvage this PR by assigning kFloat to all features in legacy models. WDYT?

trivialfis · 2021-01-21T07:42:55Z

So we can salvage this PR by assigning kFloat to all features in legacy models. WDYT?

That's a good suggestion. A way to do it is we put kFloat as 0 and kCategorical as 1, then the rest of other enums. WDYT?

hcho3 · 2021-01-21T07:46:13Z

@trivialfis Or you can explicitly fill the split_type_ vector with a nonzero value.

trivialfis · 2021-01-21T07:53:15Z

@hcho3 Let's say we are loading an old model with both numerical and categorical data, with #6605 (comment):

Backward compatibility:

When using a new xgboost to load the old model:
If the split type is numerical (which is the first field in old feature type enum):
    The loaded model will become float.
Else if the split type is categorical (second field in old feature type enum):
    The loaded model will continue to be categorical, nothing is changed.

When using a new xgboost to load the new model:
When loading a new model after this PR, it works out of box, all split types are either Float or Categorical.

Forward compatibility:

When using an old xgboost to load the new model:
If the split type is Float (== 0):
   Loaded model is numerical.  No breaking.
Else if the split type is Categorical (==1):
   Nothing is changed.

hcho3 · 2021-01-21T07:59:39Z

Ah so there is an argument for assigning 0 to kFloat. Got it.

trivialfis · 2021-01-21T09:22:23Z

~~I think I can revert some changes to avoid these altogether.~~

trivialfis · 2021-02-18T14:59:58Z

I have reverted most of the changes.

trivialfis · 2021-02-18T15:10:05Z

In the future, we can move the feature validation into c++.

codecov-io · 2021-02-19T10:56:26Z

Codecov Report

Merging #6605 (b67c338) into master (a4101de) will increase coverage by 0.13%.
The diff coverage is 90.64%.

@@            Coverage Diff             @@
##           master    #6605      +/-   ##
==========================================
+ Coverage   81.55%   81.69%   +0.13%     
==========================================
  Files          13       13              
  Lines        3719     3791      +72     
==========================================
+ Hits         3033     3097      +64     
- Misses        686      694       +8

Impacted Files	Coverage Δ
python-package/xgboost/data.py	`62.63% <77.77%> (+0.07%)`	⬆️
python-package/xgboost/sklearn.py	`90.61% <86.66%> (-0.10%)`	⬇️
python-package/xgboost/dask.py	`82.57% <90.54%> (-0.10%)`	⬇️
python-package/xgboost/core.py	`82.39% <93.18%> (+0.54%)`	⬆️
python-package/xgboost/training.py	`95.60% <100.00%> (+0.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f15b9e...b67c338. Read the comment docs.

wphicks

This is looking pretty good to me. I've gone through all of the code, and it looks to be in fine shape modulo some small tweaks. I may not have the best context for understanding the larger implications here, so I've also asked a couple questions that will help me better assess the impact of this.

The one thing I'm trying to make sure I understand is whether it's strictly necessary for this to be a breaking change. Must we drop feature name generation if we want to support JSON serialization of booster feature info? The answer is probably yes, but I wasn't quite able to convince myself of that without additional context.

The only other general thing I'll mention is a comment that I've made before: it's helpful to avoid unrelated formatting changes as part of a PR, since it makes the diffs easier to review.

python-package/xgboost/core.py

src/learner.cc

trivialfis · 2021-02-23T17:45:08Z

The one thing I'm trying to make sure I understand is whether it's strictly necessary for this to be a breaking change. Must we drop feature name generation if we want to support JSON serialization of booster feature info?

Sorry, they don't have be tied together. This PR is to address the issue in unreliable feature name validation. During prediction, the Python package performs validation on test dataset to ensure it matches the training dataset. The feature names from training dataset is copied into booster, so later prediction can compare feature names from test dataset and from booster (which comes from training dataset). There are 2 issues in there:

Firstly the feature names are not saved in model, so if a model is reloaded then the validation is not useful.
The second issue is many data types like np.ndarray don't have name for features, and xgboost generates the feature name automatically. The feature name generated by xgboost doesn't match those generated by libraries like pandas and cudf, leading to false positive during validation.

This PR addresses both of the issues to make feature name validation useful. So yes it can be split into 2 PRs, but seems unnecessary? If you think that's better for maintenance I can follow up on split PRs.

trivialfis · 2021-02-23T17:45:39Z

The only other general thing I'll mention is a comment that I've made before: it's helpful to avoid unrelated formatting changes as part of a PR, since it makes the diffs easier to review.

Let me look harder. Toke 2 formatting changes out. Sorry for the noise in the change.

wphicks · 2021-02-23T18:09:21Z

This PR address both of the issues to make feature name validation useful. So yes it can be split into 2 PRs, but seems unnecessary? If you think that's better for maintenance I can follow up on split PRs.

No, no, I think that's totally fine! I just wanted to make sure there wasn't some inherent cross-dependency that I was missing. I think a single PR is no problem here.

Let me look harder. Toke 2 formatting changes out. Sorry for the noise in the change.

Really don't sweat it! I was just mentioning that in case your dev environment still had some autoformatting going on that you weren't aware of. No need to pull those changes out now that they're already in and I've already reviewed.

trivialfis · 2021-02-23T18:26:31Z

The linting issue is addressed in #6726 .

wphicks

With the latest changes, this looks good to me! The implementation is sensible, I've got a better understanding of the context now, and I'll be glad to see this feature added.

* Pass feature names and types into libxgboost. * Save them in JSON model.

hcho3 requested changes Jan 21, 2021

View reviewed changes

include/xgboost/data.h Outdated Show resolved Hide resolved

trivialfis force-pushed the booster-feature-names branch from afb650d to de5a021 Compare January 21, 2021 09:07

trivialfis added the status: WIP label Jan 21, 2021

trivialfis changed the title ~~Save booster feature info in JSON.~~ [WIP] Save booster feature info in JSON. Jan 21, 2021

trivialfis force-pushed the booster-feature-names branch from de5a021 to 9287a83 Compare January 21, 2021 09:24

trivialfis mentioned this pull request Jan 25, 2021

Eliminate text parsing from feature importances and evaluation metrics #6091

Open

trivialfis force-pushed the booster-feature-names branch 2 times, most recently from 279690b to 2ae3fd2 Compare February 18, 2021 14:56

trivialfis changed the title ~~[WIP] Save booster feature info in JSON.~~ Save booster feature info in JSON. Feb 18, 2021

trivialfis removed the status: WIP label Feb 18, 2021

trivialfis changed the title ~~Save booster feature info in JSON.~~ [breaking] Save booster feature info in JSON, remove feature name generation. Feb 19, 2021

wphicks reviewed Feb 22, 2021

View reviewed changes

python-package/xgboost/core.py Show resolved Hide resolved

python-package/xgboost/core.py Show resolved Hide resolved

src/learner.cc Outdated Show resolved Hide resolved

trivialfis force-pushed the booster-feature-names branch from b67c338 to f5dca5e Compare February 23, 2021 17:56

wphicks approved these changes Feb 25, 2021

View reviewed changes

trivialfis added 4 commits February 25, 2021 13:46

Save feature info in booster in JSON model.

bae9493

* Pass feature names and types into libxgboost. * Save them in JSON model.

Remove formatting changes.

c8e041d

Fix reformat.

6693d09

Use iterator in map.

cfaddf1

trivialfis force-pushed the booster-feature-names branch from 982eca3 to cfaddf1 Compare February 25, 2021 05:46

trivialfis merged commit 9da2287 into dmlc:master Feb 25, 2021

trivialfis deleted the booster-feature-names branch February 25, 2021 10:54

StrikerRUS mentioned this pull request Dec 26, 2021

[python-package] check feature names in predict with dataframe (fixes #812) microsoft/LightGBM#4909

Merged

haixiw mentioned this pull request Mar 11, 2022

update version to 1.5.2 && resolve feature name breaking aws/sagemaker-xgboost-container#253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[breaking] Save booster feature info in JSON, remove feature name generation. #6605

[breaking] Save booster feature info in JSON, remove feature name generation. #6605

trivialfis commented Jan 14, 2021 •

edited

Loading

hcho3 left a comment

trivialfis commented Jan 21, 2021 •

edited

Loading

hcho3 commented Jan 21, 2021

hcho3 commented Jan 21, 2021 •

edited

Loading

trivialfis commented Jan 21, 2021 •

edited

Loading

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021 •

edited

Loading

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021 •

edited

Loading

trivialfis commented Feb 18, 2021

trivialfis commented Feb 18, 2021

codecov-io commented Feb 19, 2021 •

edited

Loading

wphicks left a comment •

edited

Loading

trivialfis commented Feb 23, 2021 •

edited

Loading

trivialfis commented Feb 23, 2021 •

edited

Loading

wphicks commented Feb 23, 2021

trivialfis commented Feb 23, 2021

wphicks left a comment

[breaking] Save booster feature info in JSON, remove feature name generation. #6605

[breaking] Save booster feature info in JSON, remove feature name generation. #6605

Conversation

trivialfis commented Jan 14, 2021 • edited Loading

hcho3 left a comment

Choose a reason for hiding this comment

trivialfis commented Jan 21, 2021 • edited Loading

hcho3 commented Jan 21, 2021

hcho3 commented Jan 21, 2021 • edited Loading

trivialfis commented Jan 21, 2021 • edited Loading

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021 • edited Loading

hcho3 commented Jan 21, 2021

trivialfis commented Jan 21, 2021 • edited Loading

trivialfis commented Feb 18, 2021

trivialfis commented Feb 18, 2021

codecov-io commented Feb 19, 2021 • edited Loading

Codecov Report

wphicks left a comment • edited Loading

Choose a reason for hiding this comment

trivialfis commented Feb 23, 2021 • edited Loading

trivialfis commented Feb 23, 2021 • edited Loading

wphicks commented Feb 23, 2021

trivialfis commented Feb 23, 2021

wphicks left a comment

Choose a reason for hiding this comment

trivialfis commented Jan 14, 2021 •

edited

Loading

trivialfis commented Jan 21, 2021 •

edited

Loading

hcho3 commented Jan 21, 2021 •

edited

Loading

trivialfis commented Jan 21, 2021 •

edited

Loading

trivialfis commented Jan 21, 2021 •

edited

Loading

trivialfis commented Jan 21, 2021 •

edited

Loading

codecov-io commented Feb 19, 2021 •

edited

Loading

wphicks left a comment •

edited

Loading

trivialfis commented Feb 23, 2021 •

edited

Loading

trivialfis commented Feb 23, 2021 •

edited

Loading