Fix number of features mismatching in continual training (fix #5156) #5157

shiyu1994 · 2022-04-17T15:56:17Z

This is to fix #5156.

src/io/parser.cpp

shiyu1994 · 2022-04-21T03:30:17Z

I found that passing the number of features in the loaded model to dataset constructors requires touching a lot of interfaces in python/R/C++. I'll switch to the strategy that directly checks the length of feature_importance in gbdt_model_text.cpp, which should be much simpler.

shiyu1994 · 2022-04-21T05:12:17Z

A branch v3.2.1-patched is created based on v3.2.1, including this fix. Because the fix is urgent for a usage in our product team.

StrikerRUS · 2022-04-23T01:45:06Z

Could you please add a regression test for this issue?

jameslamb · 2022-04-23T02:00:04Z

Because the fix is urgent for a usage in our product team.

This is an open source project, not a Microsoft commercial product. I feel strongly that choices we make in this repo should be based on what is best for the project's large community of users, not Microsoft's product priorities.

Instead of introducing a long-lived branch in this repo just for use by a product team at Microsoft, I think it would be preferable to just get this PR merged, and for whatever product teams you're referring to to then pin to a specific commit of LightGBM in their builds.

If the fix is needed so urgently that that team cannot wait for it to go through the normal testing and review process, you should tell them to temporarily build from an internal fork of this project that contains this fix, and then later replace builds from that fork with builds pinned to a commit in this public repository once this PR is merged.

src/boosting/gbdt_model_text.cpp

shiyu1994 · 2022-05-09T05:47:45Z

Could you please add a regression test for this issue?

Done via bda5bdb

tests/python_package_test/test_engine.py

StrikerRUS · 2022-05-09T22:40:46Z

tests/python_package_test/test_engine.py

+    np.random.seed(0)
+    train_X = np.random.randn(100, 10)
+    train_y = np.sum(train_X, axis=1)


Hmmm, strange. With 3.3.2 version this test doesn't fail on my local machine each time. It fails sometimes.
How about using not a random but concrete trivial dataset, something like

train_X = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) train_y = np.array([0, 1, 2])

?

Now a warning is provided, see https://github.com/microsoft/LightGBM/pull/5157/files#r868752144.
And the test case just checks whether the warning message is provided. Now it should always fail in 3.3.2.

Now it should always fail in 3.3.2.

Isn't true for me 😕
I can sometimes successfully run the test on my Windows machine with LightGBM 3.3.2 installed from PyPI.

Output:

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000 [LightGBM] [Debug] init for col-wise cost 0.000004 seconds, init for row-wise cost 0.000180 seconds [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000209 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 175 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 D:\Miniconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning: Found `num_trees` in params. Will use it instead of argument _log_warning(f"Found `{alias}` in params. Will use it instead of argument") <lightgbm.basic.Booster at 0x2b6dc985760>

Sorry about that and thanks for checking.
I've updated the test case so that the initial model will surely use feature 8 or 9, and the continual training dataset contains only 5 features. Now the test case should always fail in previous versions.

Thank you very much for working on the test!
However, now this test never fails (was run successfully 50 times in a row) on my machine with LightGBM 3.3.2 😄

Log:

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000 [LightGBM] [Debug] init for col-wise cost 0.000003 seconds, init for row-wise cost 0.000161 seconds [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 175 [LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2 <lightgbm.basic.Booster at 0x1dadc11aa90>

Model:

tree version=v3 num_class=1 num_tree_per_iteration=1 label_index=0 max_feature_idx=4 objective=regression feature_names=Column_0 Column_1 Column_2 Column_3 Column_4 feature_infos=[0.015606064446828216:0.99033894739670436] [0.011714084185001972:0.99201124341447411] [0.01203622289765427:0.98837383805922618] [0.01323685775889949:0.99440078964767942] [0.0046954761925470656:0.99884700656786651] tree_sizes=433 456 454 456 454 457 452 457 458 382 463 387 464 463 462 463 387 465 464 387 Tree=0 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=999600 132192 117600 threshold=103.00000000000001 55.000000000000007 151.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=186.40000000000001 206.19999999999999 196.59999999999999 216 leaf_weight=27 24 24 25 leaf_count=27 24 24 25 internal_value=201 191.2 211.2 internal_weight=0 51 49 internal_count=100 51 49 is_linear=0 shrinkage=1 Tree=1 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=813246 114310 91614.6 threshold=97.000000000000014 145.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-14.339999643961589 3.6000000285605589 13.005000032697405 -5.5333332962459991 leaf_weight=21 24 28 27 leaf_count=21 24 28 27 internal_value=0 8.66423 -9.38625 internal_weight=0 52 48 internal_count=100 52 48 is_linear=0 shrinkage=0.1 Tree=2 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=661912 106022 66947.7 threshold=109.00000000000001 61.000000000000007 157.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=-11.472199859619142 5.1623749971389774 -2.5549999723210934 12.799499893188477 leaf_weight=30 24 24 22 leaf_count=30 24 24 22 internal_value=0 -7.509 8.81491 internal_weight=0 54 46 internal_count=100 54 46 is_linear=0 shrinkage=0.1 Tree=3 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=541286 90395.6 51743.2 threshold=91.000000000000014 139.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-11.758779907226563 2.0473306971912582 10.221966577345327 -4.961771702766419 leaf_weight=21 24 31 24 leaf_count=21 24 31 24 internal_value=0 6.65485 -8.13371 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=4 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=439960 83077 36862.1 threshold=115.00000000000001 67.000000000000014 157.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=-9.0168283057935312 4.6399736268179756 -1.2844118246187768 10.49735318964178 leaf_weight=33 21 24 22 leaf_count=33 21 24 22 internal_value=0 -5.76107 7.63677 internal_weight=0 57 43 internal_count=100 57 43 is_linear=0 shrinkage=0.1 Tree=5 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=359523 70011.3 29162.8 threshold=85.000000000000014 133.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-9.6812194460914256 0.9671219994624457 8.0214061737060547 -4.4111016545976911 leaf_weight=21 24 34 21 leaf_count=21 24 34 21 internal_value=0 5.10239 -7.04616 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=6 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=291965 51781 29631.8 threshold=91.000000000000014 157.00000000000003 49.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-8.374019781748455 2.382246724854816 8.645477294921875 -3.2303891863141745 leaf_weight=24 33 22 21 leaf_count=24 33 22 21 internal_value=0 4.88754 -5.97366 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=7 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=241666 44241.4 20265.2 threshold=115.00000000000001 49.000000000000007 157.00000000000003 decision_type=2 2 2 left_child=1 -1 -2 right_child=2 -3 -4 leaf_value=-7.5366175810496019 3.4379374277024044 -1.8938883086045584 7.7809295654296875 leaf_weight=24 21 33 22 leaf_count=24 21 33 22 internal_value=0 -4.26977 5.65993 internal_weight=0 57 43 internal_count=100 57 43 is_linear=0 shrinkage=0.1 Tree=8 num_leaves=4 num_cat=0 split_feature=8 8 8 split_gain=196769 38913.2 15310.5 threshold=85.000000000000014 133.00000000000003 43.000000000000007 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-7.1220336823236385 0.69178846441209318 5.9509620666503906 -3.3034729072025848 leaf_weight=21 24 34 21 leaf_count=21 24 34 21 internal_value=0 3.77475 -5.21275 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=9 num_leaves=3 num_cat=0 split_feature=8 8 split_gain=160757 35733.3 threshold=121.00000000000001 73.000000000000014 decision_type=2 2 left_child=1 -1 right_child=-2 -3 leaf_value=-5.2662804179721405 4.9105544734001159 -0.28483681498716273 leaf_weight=36 40 24 leaf_count=36 40 24 internal_value=0 -3.2737 internal_weight=0 60 internal_count=100 60 is_linear=0 shrinkage=0.1 Tree=10 num_leaves=4 num_cat=0 split_feature=2 3 1 split_gain=13.801 10.6247 5.30422 threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-10.008088938395183 -9.9444599969046461 -9.8565416689272283 -9.9392709641229544 leaf_weight=24 28 27 21 leaf_count=24 28 27 21 internal_value=0 -9.9013 -9.97597 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=11 num_leaves=3 num_cat=0 split_feature=4 0 split_gain=11.5297 8.62398 threshold=0.67098217697241236 0.49642413305341432 decision_type=2 2 left_child=1 -1 right_child=-2 -3 leaf_value=-8.9968132420590052 -8.8941043404971847 -8.9236732210431793 leaf_weight=38 34 28 leaf_count=38 34 28 internal_value=0 -8.96578 internal_weight=0 66 internal_count=100 66 is_linear=0 shrinkage=0.1 Tree=12 num_leaves=4 num_cat=0 split_feature=1 0 3 split_gain=11.1402 5.70155 5.37138 threshold=0.4329138381487897 0.31488953069403935 0.48185106917748455 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-8.1240014648437491 -8.0604864211309515 -7.9952484955658791 -8.0523968436501256 leaf_weight=20 21 37 22 leaf_count=20 21 37 22 internal_value=0 -8.01887 -8.08649 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=13 num_leaves=4 num_cat=0 split_feature=2 3 1 split_gain=9.77513 6.994 3.37145 threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-7.3027134259541828 -7.2492815290178569 -7.1779496934678821 -7.2478478931245354 leaf_weight=24 28 27 21 leaf_count=24 28 27 21 internal_value=0 -7.21426 -7.27711 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=14 num_leaves=4 num_cat=0 split_feature=4 3 2 split_gain=8.72052 6.28816 2.9544 threshold=0.38168423736294915 0.39970451119142136 0.57474024719752059 decision_type=2 2 2 left_child=-1 -2 -3 right_child=1 2 -4 leaf_value=-6.5552221249311406 -6.540647659301758 -6.499760017395019 -6.4460565839494981 leaf_weight=39 20 20 21 leaf_count=39 20 20 21 internal_value=0 -6.49468 -6.47225 internal_weight=0 61 41 internal_count=100 61 41 is_linear=0 shrinkage=0.1 Tree=15 num_leaves=4 num_cat=0 split_feature=1 0 3 split_gain=7.48096 4.12722 3.46943 threshold=0.4329138381487897 0.39974028150985036 0.48185106917748455 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-5.9287466812133793 -5.8698616948621023 -5.8165104438518656 -5.8711990703235983 leaf_weight=20 29 29 22 leaf_count=20 29 29 22 internal_value=0 -5.84319 -5.8986 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=16 num_leaves=3 num_cat=0 split_feature=4 0 split_gain=7.09244 4.97742 threshold=0.67098217697241236 0.49642413305341432 decision_type=2 2 left_child=1 -1 right_child=-2 -3 leaf_value=-5.3225026983963817 -5.2427101247450887 -5.2669374602181573 leaf_weight=38 34 28 leaf_count=38 34 28 internal_value=0 -5.29893 internal_weight=0 66 internal_count=100 66 is_linear=0 shrinkage=0.1 Tree=17 num_leaves=4 num_cat=0 split_feature=2 3 0 split_gain=6.14628 3.95565 2.09787 threshold=0.39124460793750004 0.57209255545676052 0.31488953069403935 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-4.8033817863464359 -4.7557433400835309 -4.7020983519377531 -4.7599296722412108 leaf_weight=20 28 27 25 leaf_count=20 28 27 25 internal_value=0 -4.72941 -4.77924 internal_weight=0 55 45 internal_count=100 55 45 is_linear=0 shrinkage=0.1 Tree=18 num_leaves=4 num_cat=0 split_feature=1 4 3 split_gain=5.78174 3.21241 2.42224 threshold=0.4329138381487897 0.52674300595287049 0.48185106917748455 decision_type=2 2 2 left_child=2 -2 -1 right_child=1 -3 -4 leaf_value=-4.3300938415527339 -4.2781520597396359 -4.2309710891158501 -4.2820091941139919 leaf_weight=20 31 27 22 leaf_count=20 31 27 22 internal_value=0 -4.25619 -4.30491 internal_weight=0 58 42 internal_count=100 58 42 is_linear=0 shrinkage=0.1 Tree=19 num_leaves=3 num_cat=0 split_feature=4 2 split_gain=5.06465 3.34179 threshold=0.38168423736294915 0.60984868647320056 decision_type=2 2 left_child=-1 -2 right_child=1 -3 leaf_value=-3.8771305084228516 -3.8485698895576679 -3.7998270381580705 leaf_weight=39 39 22 leaf_count=39 39 22 internal_value=0 -3.83099 internal_weight=0 61 internal_count=100 61 is_linear=0 shrinkage=0.1 end of trees feature_importances: Column_3=7 Column_0=5 Column_1=5 Column_2=5 Column_4=5 parameters: [boosting: gbdt] [objective: regression] [metric: l2] [tree_learner: serial] [device_type: cpu] [data: ] [valid: ] [num_iterations: 10] [learning_rate: 0.1] [num_leaves: 31] [num_threads: 0] [deterministic: 0] [force_col_wise: 0] [force_row_wise: 0] [histogram_pool_size: -1] [max_depth: -1] [min_data_in_leaf: 20] [min_sum_hessian_in_leaf: 0.001] [bagging_fraction: 1] [pos_bagging_fraction: 1] [neg_bagging_fraction: 1] [bagging_freq: 0] [bagging_seed: 3] [feature_fraction: 1] [feature_fraction_bynode: 1] [feature_fraction_seed: 2] [extra_trees: 0] [extra_seed: 6] [early_stopping_round: 0] [first_metric_only: 0] [max_delta_step: 0] [lambda_l1: 0] [lambda_l2: 0] [linear_lambda: 0] [min_gain_to_split: 0] [drop_rate: 0.1] [max_drop: 50] [skip_drop: 0.5] [xgboost_dart_mode: 0] [uniform_drop: 0] [drop_seed: 4] [top_rate: 0.2] [other_rate: 0.1] [min_data_per_group: 100] [max_cat_threshold: 32] [cat_l2: 10] [cat_smooth: 10] [max_cat_to_onehot: 4] [top_k: 20] [monotone_constraints: ] [monotone_constraints_method: basic] [monotone_penalty: 0] [feature_contri: ] [forcedsplits_filename: ] [refit_decay_rate: 0.9] [cegb_tradeoff: 1] [cegb_penalty_split: 0] [cegb_penalty_feature_lazy: ] [cegb_penalty_feature_coupled: ] [path_smooth: 0] [interaction_constraints: ] [verbosity: 2] [saved_feature_importance_type: 0] [linear_tree: 0] [max_bin: 255] [max_bin_by_feature: ] [min_data_in_bin: 3] [bin_construct_sample_cnt: 200000] [data_random_seed: 1] [is_enable_sparse: 1] [enable_bundle: 1] [use_missing: 1] [zero_as_missing: 0] [feature_pre_filter: 1] [pre_partition: 0] [two_round: 0] [header: 0] [label_column: ] [weight_column: ] [group_column: ] [ignore_column: ] [categorical_feature: ] [forcedbins_filename: ] [precise_float_parser: 0] [objective_seed: 5] [num_class: 1] [is_unbalance: 0] [scale_pos_weight: 1] [sigmoid: 1] [boost_from_average: 1] [reg_sqrt: 0] [alpha: 0.9] [fair_c: 1] [poisson_max_delta_step: 0.7] [tweedie_variance_power: 1.5] [lambdarank_truncation_level: 30] [lambdarank_norm: 1] [label_gain: ] [eval_at: ] [multi_error_top_k: 1] [auc_mu_weights: ] [num_machines: 1] [local_listen_port: 12400] [time_out: 120] [machine_list_filename: ] [machines: ] [gpu_platform_id: -1] [gpu_device_id: -1] [gpu_use_dp: 0] [num_gpu: 1] end of parameters pandas_categorical:null

@StrikerRUS
Did you test the test case in the latest commit of this PR?

I've tested with v3.3.2 with the test case

def test_continue_train_different_feature_size(capsys): np.random.seed(0) train_X = np.hstack([np.ones(800).reshape(-1, 8), np.arange(200, 0, -1).reshape(-1, 2)]) train_y = np.sum(train_X[:, -2:], axis=1) train_data = lgb.Dataset(train_X, label=train_y) params = { "objective": "regression", "num_trees": 10, "num_leaves": 31, "verbose": -1, 'predict_disable_shape_check': True, } model = lgb.train(train_set=train_data, params=params) train_X_cont = np.random.rand(100, 5) train_y_cont = np.sum(train_X_cont, axis=1) train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont) params.update({"verbose": 2}) lgb.train(train_set=train_data_cont, params=params, init_model=model) captured = capsys.readouterr() assert captured.out.find("features found in dataset for continual training, but at least") != -1

And the output is

FAILED test_engine.py::test_continue_train_different_feature_size - AssertionError: assert -1 != -1

which shows that the test case fails in v3.3.2.

@shiyu1994 This test should segfault or something like this (I don't remember exactly) in v3.3.2 but it doesn't.

Warning was added after starting this review thread and we don't speak about it here.

@StrikerRUS Thanks. With the latest commit, it should now fail in v3.3.2. It will segfaults or the AssertionError will arise by the line

assert captured.out.find("features found in dataset for continual training, but at least") != -1

in the latest version of test case.

With the latest commit, it should now fail in v3.3.2.

Still doesn't fail in v3.3.2.

Please use fixed dataset, not a random one, to constantly reproduce segfaults in versions without workaround from this PR.

Co-authored-by: Nikita Titov <[email protected]>

…ix-5156

… fix-5156

shiyu1994 · 2022-05-10T02:17:39Z

src/boosting/gbdt_model_text.cpp

 #endif
-          feature_importances[models_[iter]->split_feature(split_idx)] += 1.0;
+          if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) {
+            warn_about_feature_number = true;


Provide warning instead of enlarging feature importance vector. Since GBDT::FeatureImportance is used by Python API and the python code doesn't know the final size of feature importance vector, dynamically enlarging the vector will cause out of bound access.

jameslamb

I'm very confused by these changes, sorry.

My understanding of what was written in #5156 is that given initial model has been trained on a Dataset with n features, if you attempt to perform continued training on a Dataset with m features (m < n), LightGBM will not complain but there might be some out-of-bound access.

I also believe that #5156 mentions LibSVM files as an example because construction of a Dataset from a file might drop unsplittable columns (like those with all missing values) if feature_pre_filter=True.

Is my interpretation correct? If it is, that'll help me review this more effectively.

jameslamb · 2022-05-11T03:42:54Z

tests/python_package_test/test_engine.py

+        "num_trees": 10,
+        "num_leaves": 31,
+        "verbose": -1,
+        'predict_disable_shape_check': True,


Why is it necessary to pass predict_disable_shape_check? This test isn't generating any predictions. Can this be removed?

Ah, good observation! But predict() is called internally here when you set init scores by a previously trained model

LightGBM/python-package/lightgbm/engine.py

Lines 153 to 154 in 0018206

elif isinstance(init_model, Booster):

predictor = init_model._to_predictor(dict(init_model.params, **params))

LightGBM/python-package/lightgbm/engine.py

Lines 160 to 161 in 0018206

train_set._update_params(params) \

._set_predictor(predictor) \

LightGBM/python-package/lightgbm/basic.py

Lines 1519 to 1522 in 0018206

if isinstance(predictor, _InnerPredictor):

if self._predictor is None and init_score is not None:

_log_warning("The init_score will be overridden by the prediction of init_model.")

self._set_init_score_by_predictor(predictor, data)

LightGBM/python-package/lightgbm/basic.py

Line 1393 in 0018206

def _set_init_score_by_predictor(self, predictor, data, used_indices=None):

LightGBM/python-package/lightgbm/basic.py

Lines 1399 to 1402 in 0018206

if predictor is not None:

init_score = predictor.predict(data,

raw_score=True,

data_has_header=data_has_header)

Ah, I see! So then is the bug addressed by this PR only present when you've also set predict_disable_shape_check = True? I expect that most LightGBM users will not be aware that that prediction parameter is relevant for training continuation.

Yes, without setting predict_disable_shape_check = True users get the following error:

LightGBMError: The number of features in data (5) is not the same as it was in training data (10). You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.

I think the second sentence of the error "You can set predict_disable_shape_check=true to discard this error ..." helps understand this complicated relationship.

predict_disable_shape_check is necessary because here a numpy matrix is used in the test case. If we use a LibSVM file as dataset in the test case, we don't need predict_disable_shape_check=True. I use a numpy matrix here because I think it is more convenient than LivSVM in that it does not requiring generating files in test cases.

jameslamb · 2022-05-11T03:53:31Z

src/boosting/gbdt_model_text.cpp

        }
      }
    }
  } else {
    Log::Fatal("Unknown importance type: only support split=0 and gain=1");
  }
+  if (warn_about_feature_number) {
+    Log::Warning("Only %d features found in dataset for continual training, but at least %d features found in initial model.",


Is it possible to add some information here that can help users understand what they should do about this warning? Or is this warning just being raised for the purpose of the unit test?

I've added

Log::Warning("Please check the number of features used in continual training.");

in d719d78.

Do you think it is appropriate?

Thanks! But I don't think "please check" will be any more informative for users, and I think that providing two separate WARN-level log messages for the same problem might be confusing.

Would you consider this?

Log::Warning( "Only %d features found in dataset, but at least %d features found in initial model. " "If this is intentional and the features in dataset match those used to train the initial model, no action is required." "If not, training continuation may fail or produce bad results." "To suppress this warning, provide a dataset with at least %d features" static_cast<int>(feature_importances.size()), max_feature_index_found + 1, max_feature_index_found + 1 )

shiyu1994 · 2022-05-30T02:55:31Z

I also believe that #5156 mentions LibSVM files as an example because construction of a Dataset from a file might drop unsplittable columns (like those with all missing values) if feature_pre_filter=True.

Here we only focus on the original features, not features preprocessed by LightGBM. This is to fix such cases: First, a model A is trained with m features, then the user errorneously drop some features in continued training with the model A. So there's nothing to do with feature_pre_filter .

StrikerRUS · 2022-06-04T22:54:16Z

Just a reminder: merge-blocking thread is #5157 (comment).

jameslamb · 2022-06-08T01:22:56Z

Ah ok thanks @shiyu1994 , now I understand. I guess the phrase "When the input dataset is from LibSVM file" was used in #5156 because loading from a file bypasses predict_disable_shape_check? #5157 (comment)

shiyu1994 · 2022-06-09T08:13:14Z

@jameslamb

I guess the phrase "When the input dataset is from LibSVM file" was used in #5156 because loading from a file bypasses predict_disable_shape_check? #5157 (comment)

Exactly.

shiyu1994 · 2022-06-10T04:53:03Z

Seems that the QEMU test fails due to environment issues

shiyu1994 · 2022-06-13T10:48:27Z

Close and reopen to retrigger ci tests.

StrikerRUS

Just to prevent accidental merge.
#5157 (comment)

shiyu1994 · 2022-07-29T03:04:19Z

@guolinke @jameslamb @StrikerRUS Could you please help to verify again whether the latest test case fails with v3.3.2? I've tested on my Linux machine and it fails every time. Note that now we use a fixed dataset, without any randomness. Thanks!

StrikerRUS · 2022-07-30T23:56:41Z

Could you please help to verify again whether the latest test case fails with v3.3.2?

Just checked again, the latest test doesn't fail on my Windows machine with v3.3.2.

jameslamb · 2024-06-10T03:09:38Z

@shiyu1994 could you update this PR? Let me know if you need help with any part of it.

consider number of features from model in continual training

24b7df0

shiyu1994 added the in progress label Apr 17, 2022

shiyu1994 requested review from guolinke, btrotta, hzy46 and tongwu-sh as code owners April 17, 2022 15:56

StrikerRUS reviewed Apr 17, 2022

View reviewed changes

src/io/parser.cpp Outdated Show resolved Hide resolved

shiyu1994 added 2 commits April 21, 2022 03:34

rollback changes

442c03a

handle mismatches only within gbdt_model_text.cpp

2887ac0

shiyu1994 changed the title ~~[WIP] Fix number of features mismatching in continual training (fix #5156)~~ Fix number of features mismatching in continual training (fix #5156) Apr 21, 2022

shiyu1994 added awaiting review and removed in progress labels Apr 21, 2022

shiyu1994 added the fix label Apr 21, 2022

jameslamb requested changes May 4, 2022

View reviewed changes

src/boosting/gbdt_model_text.cpp Outdated Show resolved Hide resolved

shiyu1994 added 2 commits May 5, 2022 15:14

remove useless debug information

0a20a13

add a test case

bda5bdb

shiyu1994 requested a review from jmoralez as a code owner May 9, 2022 04:28

StrikerRUS reviewed May 9, 2022

View reviewed changes

shiyu1994 and others added 5 commits May 10, 2022 08:53

Update tests/python_package_test/test_engine.py

7e7190f

Co-authored-by: Nikita Titov <[email protected]>

Merge branch 'master' of https://github.com/Microsoft/LightGBM into f…

adb8982

…ix-5156

provide warning instead of enlarging feature importances

c823137

Merge branch 'fix-5156' of https://github.com/shiyu1994/LightGBM into…

750d35c

… fix-5156

remove useless comments

3e52538

shiyu1994 commented May 10, 2022

View reviewed changes

shiyu1994 requested a review from StrikerRUS May 10, 2022 02:20

shiyu1994 requested a review from jameslamb May 10, 2022 02:20

StrikerRUS mentioned this pull request May 10, 2022

Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

Merged

ensure the warning is raised in test case

23e9991

jameslamb requested changes May 11, 2022

View reviewed changes

add suggestion for the warning

d719d78

shiyu1994 requested a review from jameslamb June 1, 2022 14:16

shiyu1994 closed this Jun 13, 2022

shiyu1994 reopened this Jun 13, 2022

jameslamb added in progress and removed awaiting review labels Jun 14, 2022

guolinke approved these changes Jul 23, 2022

View reviewed changes

StrikerRUS requested changes Jul 24, 2022

View reviewed changes

shiyu1994 added 2 commits March 1, 2023 12:28

Merge branch 'master' into fix-5156

aab8d1f

Merge remote-tracking branch 'origin/master' into fix-5156

1f09936

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix number of features mismatching in continual training (fix #5156) #5157

Fix number of features mismatching in continual training (fix #5156) #5157

shiyu1994 commented Apr 17, 2022

shiyu1994 commented Apr 21, 2022

shiyu1994 commented Apr 21, 2022

StrikerRUS commented Apr 23, 2022

jameslamb commented Apr 23, 2022

shiyu1994 commented May 9, 2022

StrikerRUS May 9, 2022

shiyu1994 May 10, 2022

StrikerRUS May 10, 2022

shiyu1994 May 11, 2022

StrikerRUS May 11, 2022

shiyu1994 Jun 9, 2022

StrikerRUS Jun 12, 2022

shiyu1994 Jun 13, 2022

StrikerRUS Jun 13, 2022

shiyu1994 May 10, 2022

jameslamb left a comment •

edited by StrikerRUS

Loading

jameslamb May 11, 2022

StrikerRUS May 11, 2022 •

edited

Loading

jameslamb May 12, 2022

StrikerRUS May 14, 2022

shiyu1994 May 30, 2022

jameslamb May 11, 2022

shiyu1994 May 30, 2022

jameslamb Jun 8, 2022

shiyu1994 commented May 30, 2022

StrikerRUS commented Jun 4, 2022

jameslamb commented Jun 8, 2022

shiyu1994 commented Jun 9, 2022

shiyu1994 commented Jun 10, 2022

shiyu1994 commented Jun 13, 2022

StrikerRUS left a comment

shiyu1994 commented Jul 29, 2022

StrikerRUS commented Jul 30, 2022

jameslamb commented Jun 10, 2024

	elif isinstance(init_model, Booster):
	predictor = init_model._to_predictor(dict(init_model.params, **params))

	train_set._update_params(params) \
	._set_predictor(predictor) \

	if isinstance(predictor, _InnerPredictor):
	if self._predictor is None and init_score is not None:
	_log_warning("The init_score will be overridden by the prediction of init_model.")
	self._set_init_score_by_predictor(predictor, data)

	if predictor is not None:
	init_score = predictor.predict(data,
	raw_score=True,
	data_has_header=data_has_header)

Fix number of features mismatching in continual training (fix #5156) #5157

Are you sure you want to change the base?

Fix number of features mismatching in continual training (fix #5156) #5157

Conversation

shiyu1994 commented Apr 17, 2022

shiyu1994 commented Apr 21, 2022

shiyu1994 commented Apr 21, 2022

StrikerRUS commented Apr 23, 2022

jameslamb commented Apr 23, 2022

shiyu1994 commented May 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb left a comment • edited by StrikerRUS Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyu1994 commented May 30, 2022

StrikerRUS commented Jun 4, 2022

jameslamb commented Jun 8, 2022

shiyu1994 commented Jun 9, 2022

shiyu1994 commented Jun 10, 2022

shiyu1994 commented Jun 13, 2022

StrikerRUS left a comment

Choose a reason for hiding this comment

shiyu1994 commented Jul 29, 2022

StrikerRUS commented Jul 30, 2022

jameslamb commented Jun 10, 2024

jameslamb left a comment •

edited by StrikerRUS

Loading

StrikerRUS May 11, 2022 •

edited

Loading