-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix number of features mismatching in continual training (fix #5156) #5157
base: master
Are you sure you want to change the base?
Conversation
I found that passing the number of features in the loaded model to dataset constructors requires touching a lot of interfaces in python/R/C++. I'll switch to the strategy that directly checks the length of |
A branch |
Could you please add a regression test for this issue? |
This is an open source project, not a Microsoft commercial product. I feel strongly that choices we make in this repo should be based on what is best for the project's large community of users, not Microsoft's product priorities. Instead of introducing a long-lived branch in this repo just for use by a product team at Microsoft, I think it would be preferable to just get this PR merged, and for whatever product teams you're referring to to then pin to a specific commit of LightGBM in their builds. If the fix is needed so urgently that that team cannot wait for it to go through the normal testing and review process, you should tell them to temporarily build from an internal fork of this project that contains this fix, and then later replace builds from that fork with builds pinned to a commit in this public repository once this PR is merged. |
Done via bda5bdb |
np.random.seed(0) | ||
train_X = np.random.randn(100, 10) | ||
train_y = np.sum(train_X, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, strange. With 3.3.2
version this test doesn't fail on my local machine each time. It fails sometimes.
How about using not a random but concrete trivial dataset, something like
train_X = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
train_y = np.array([0, 1, 2])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now a warning is provided, see https://github.com/microsoft/LightGBM/pull/5157/files#r868752144.
And the test case just checks whether the warning message is provided. Now it should always fail in 3.3.2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it should always fail in
3.3.2
.
Isn't true for me 😕
I can sometimes successfully run the test on my Windows machine with LightGBM 3.3.2
installed from PyPI.
Output:
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000004 seconds, init for row-wise cost 0.000180 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 175
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
D:\Miniconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning: Found `num_trees` in params. Will use it instead of argument
_log_warning(f"Found `{alias}` in params. Will use it instead of argument")
<lightgbm.basic.Booster at 0x2b6dc985760>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that and thanks for checking.
I've updated the test case so that the initial model will surely use feature 8 or 9, and the continual training dataset contains only 5 features. Now the test case should always fail in previous versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working on the test!
However, now this test never fails (was run successfully 50 times in a row) on my machine with LightGBM 3.3.2
😄
Log:
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000003 seconds, init for row-wise cost 0.000161 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 175
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 5
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 3
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4 and depth = 2
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3 and depth = 2
<lightgbm.basic.Booster at 0x1dadc11aa90>
Model:
tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=4
objective=regression
feature_names=Column_0 Column_1 Column_2 Column_3 Column_4
feature_infos=[0.015606064446828216:0.99033894739670436] [0.011714084185001972:0.99201124341447411] [0.01203622289765427:0.98837383805922618] [0.01323685775889949:0.99440078964767942] [0.0046954761925470656:0.99884700656786651]
tree_sizes=433 456 454 456 454 457 452 457 458 382 463 387 464 463 462 463 387 465 464 387
Tree=0
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=999600 132192 117600
threshold=103.00000000000001 55.000000000000007 151.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=186.40000000000001 206.19999999999999 196.59999999999999 216
leaf_weight=27 24 24 25
leaf_count=27 24 24 25
internal_value=201 191.2 211.2
internal_weight=0 51 49
internal_count=100 51 49
is_linear=0
shrinkage=1
Tree=1
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=813246 114310 91614.6
threshold=97.000000000000014 145.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-14.339999643961589 3.6000000285605589 13.005000032697405 -5.5333332962459991
leaf_weight=21 24 28 27
leaf_count=21 24 28 27
internal_value=0 8.66423 -9.38625
internal_weight=0 52 48
internal_count=100 52 48
is_linear=0
shrinkage=0.1
Tree=2
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=661912 106022 66947.7
threshold=109.00000000000001 61.000000000000007 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-11.472199859619142 5.1623749971389774 -2.5549999723210934 12.799499893188477
leaf_weight=30 24 24 22
leaf_count=30 24 24 22
internal_value=0 -7.509 8.81491
internal_weight=0 54 46
internal_count=100 54 46
is_linear=0
shrinkage=0.1
Tree=3
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=541286 90395.6 51743.2
threshold=91.000000000000014 139.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-11.758779907226563 2.0473306971912582 10.221966577345327 -4.961771702766419
leaf_weight=21 24 31 24
leaf_count=21 24 31 24
internal_value=0 6.65485 -8.13371
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1
Tree=4
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=439960 83077 36862.1
threshold=115.00000000000001 67.000000000000014 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-9.0168283057935312 4.6399736268179756 -1.2844118246187768 10.49735318964178
leaf_weight=33 21 24 22
leaf_count=33 21 24 22
internal_value=0 -5.76107 7.63677
internal_weight=0 57 43
internal_count=100 57 43
is_linear=0
shrinkage=0.1
Tree=5
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=359523 70011.3 29162.8
threshold=85.000000000000014 133.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-9.6812194460914256 0.9671219994624457 8.0214061737060547 -4.4111016545976911
leaf_weight=21 24 34 21
leaf_count=21 24 34 21
internal_value=0 5.10239 -7.04616
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1
Tree=6
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=291965 51781 29631.8
threshold=91.000000000000014 157.00000000000003 49.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-8.374019781748455 2.382246724854816 8.645477294921875 -3.2303891863141745
leaf_weight=24 33 22 21
leaf_count=24 33 22 21
internal_value=0 4.88754 -5.97366
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1
Tree=7
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=241666 44241.4 20265.2
threshold=115.00000000000001 49.000000000000007 157.00000000000003
decision_type=2 2 2
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=-7.5366175810496019 3.4379374277024044 -1.8938883086045584 7.7809295654296875
leaf_weight=24 21 33 22
leaf_count=24 21 33 22
internal_value=0 -4.26977 5.65993
internal_weight=0 57 43
internal_count=100 57 43
is_linear=0
shrinkage=0.1
Tree=8
num_leaves=4
num_cat=0
split_feature=8 8 8
split_gain=196769 38913.2 15310.5
threshold=85.000000000000014 133.00000000000003 43.000000000000007
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-7.1220336823236385 0.69178846441209318 5.9509620666503906 -3.3034729072025848
leaf_weight=21 24 34 21
leaf_count=21 24 34 21
internal_value=0 3.77475 -5.21275
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1
Tree=9
num_leaves=3
num_cat=0
split_feature=8 8
split_gain=160757 35733.3
threshold=121.00000000000001 73.000000000000014
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-5.2662804179721405 4.9105544734001159 -0.28483681498716273
leaf_weight=36 40 24
leaf_count=36 40 24
internal_value=0 -3.2737
internal_weight=0 60
internal_count=100 60
is_linear=0
shrinkage=0.1
Tree=10
num_leaves=4
num_cat=0
split_feature=2 3 1
split_gain=13.801 10.6247 5.30422
threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-10.008088938395183 -9.9444599969046461 -9.8565416689272283 -9.9392709641229544
leaf_weight=24 28 27 21
leaf_count=24 28 27 21
internal_value=0 -9.9013 -9.97597
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1
Tree=11
num_leaves=3
num_cat=0
split_feature=4 0
split_gain=11.5297 8.62398
threshold=0.67098217697241236 0.49642413305341432
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-8.9968132420590052 -8.8941043404971847 -8.9236732210431793
leaf_weight=38 34 28
leaf_count=38 34 28
internal_value=0 -8.96578
internal_weight=0 66
internal_count=100 66
is_linear=0
shrinkage=0.1
Tree=12
num_leaves=4
num_cat=0
split_feature=1 0 3
split_gain=11.1402 5.70155 5.37138
threshold=0.4329138381487897 0.31488953069403935 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-8.1240014648437491 -8.0604864211309515 -7.9952484955658791 -8.0523968436501256
leaf_weight=20 21 37 22
leaf_count=20 21 37 22
internal_value=0 -8.01887 -8.08649
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1
Tree=13
num_leaves=4
num_cat=0
split_feature=2 3 1
split_gain=9.77513 6.994 3.37145
threshold=0.39124460793750004 0.57209255545676052 0.51007976537911426
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-7.3027134259541828 -7.2492815290178569 -7.1779496934678821 -7.2478478931245354
leaf_weight=24 28 27 21
leaf_count=24 28 27 21
internal_value=0 -7.21426 -7.27711
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1
Tree=14
num_leaves=4
num_cat=0
split_feature=4 3 2
split_gain=8.72052 6.28816 2.9544
threshold=0.38168423736294915 0.39970451119142136 0.57474024719752059
decision_type=2 2 2
left_child=-1 -2 -3
right_child=1 2 -4
leaf_value=-6.5552221249311406 -6.540647659301758 -6.499760017395019 -6.4460565839494981
leaf_weight=39 20 20 21
leaf_count=39 20 20 21
internal_value=0 -6.49468 -6.47225
internal_weight=0 61 41
internal_count=100 61 41
is_linear=0
shrinkage=0.1
Tree=15
num_leaves=4
num_cat=0
split_feature=1 0 3
split_gain=7.48096 4.12722 3.46943
threshold=0.4329138381487897 0.39974028150985036 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-5.9287466812133793 -5.8698616948621023 -5.8165104438518656 -5.8711990703235983
leaf_weight=20 29 29 22
leaf_count=20 29 29 22
internal_value=0 -5.84319 -5.8986
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1
Tree=16
num_leaves=3
num_cat=0
split_feature=4 0
split_gain=7.09244 4.97742
threshold=0.67098217697241236 0.49642413305341432
decision_type=2 2
left_child=1 -1
right_child=-2 -3
leaf_value=-5.3225026983963817 -5.2427101247450887 -5.2669374602181573
leaf_weight=38 34 28
leaf_count=38 34 28
internal_value=0 -5.29893
internal_weight=0 66
internal_count=100 66
is_linear=0
shrinkage=0.1
Tree=17
num_leaves=4
num_cat=0
split_feature=2 3 0
split_gain=6.14628 3.95565 2.09787
threshold=0.39124460793750004 0.57209255545676052 0.31488953069403935
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-4.8033817863464359 -4.7557433400835309 -4.7020983519377531 -4.7599296722412108
leaf_weight=20 28 27 25
leaf_count=20 28 27 25
internal_value=0 -4.72941 -4.77924
internal_weight=0 55 45
internal_count=100 55 45
is_linear=0
shrinkage=0.1
Tree=18
num_leaves=4
num_cat=0
split_feature=1 4 3
split_gain=5.78174 3.21241 2.42224
threshold=0.4329138381487897 0.52674300595287049 0.48185106917748455
decision_type=2 2 2
left_child=2 -2 -1
right_child=1 -3 -4
leaf_value=-4.3300938415527339 -4.2781520597396359 -4.2309710891158501 -4.2820091941139919
leaf_weight=20 31 27 22
leaf_count=20 31 27 22
internal_value=0 -4.25619 -4.30491
internal_weight=0 58 42
internal_count=100 58 42
is_linear=0
shrinkage=0.1
Tree=19
num_leaves=3
num_cat=0
split_feature=4 2
split_gain=5.06465 3.34179
threshold=0.38168423736294915 0.60984868647320056
decision_type=2 2
left_child=-1 -2
right_child=1 -3
leaf_value=-3.8771305084228516 -3.8485698895576679 -3.7998270381580705
leaf_weight=39 39 22
leaf_count=39 39 22
internal_value=0 -3.83099
internal_weight=0 61
internal_count=100 61
is_linear=0
shrinkage=0.1
end of trees
feature_importances:
Column_3=7
Column_0=5
Column_1=5
Column_2=5
Column_4=5
parameters:
[boosting: gbdt]
[objective: regression]
[metric: l2]
[tree_learner: serial]
[device_type: cpu]
[data: ]
[valid: ]
[num_iterations: 10]
[learning_rate: 0.1]
[num_leaves: 31]
[num_threads: 0]
[deterministic: 0]
[force_col_wise: 0]
[force_row_wise: 0]
[histogram_pool_size: -1]
[max_depth: -1]
[min_data_in_leaf: 20]
[min_sum_hessian_in_leaf: 0.001]
[bagging_fraction: 1]
[pos_bagging_fraction: 1]
[neg_bagging_fraction: 1]
[bagging_freq: 0]
[bagging_seed: 3]
[feature_fraction: 1]
[feature_fraction_bynode: 1]
[feature_fraction_seed: 2]
[extra_trees: 0]
[extra_seed: 6]
[early_stopping_round: 0]
[first_metric_only: 0]
[max_delta_step: 0]
[lambda_l1: 0]
[lambda_l2: 0]
[linear_lambda: 0]
[min_gain_to_split: 0]
[drop_rate: 0.1]
[max_drop: 50]
[skip_drop: 0.5]
[xgboost_dart_mode: 0]
[uniform_drop: 0]
[drop_seed: 4]
[top_rate: 0.2]
[other_rate: 0.1]
[min_data_per_group: 100]
[max_cat_threshold: 32]
[cat_l2: 10]
[cat_smooth: 10]
[max_cat_to_onehot: 4]
[top_k: 20]
[monotone_constraints: ]
[monotone_constraints_method: basic]
[monotone_penalty: 0]
[feature_contri: ]
[forcedsplits_filename: ]
[refit_decay_rate: 0.9]
[cegb_tradeoff: 1]
[cegb_penalty_split: 0]
[cegb_penalty_feature_lazy: ]
[cegb_penalty_feature_coupled: ]
[path_smooth: 0]
[interaction_constraints: ]
[verbosity: 2]
[saved_feature_importance_type: 0]
[linear_tree: 0]
[max_bin: 255]
[max_bin_by_feature: ]
[min_data_in_bin: 3]
[bin_construct_sample_cnt: 200000]
[data_random_seed: 1]
[is_enable_sparse: 1]
[enable_bundle: 1]
[use_missing: 1]
[zero_as_missing: 0]
[feature_pre_filter: 1]
[pre_partition: 0]
[two_round: 0]
[header: 0]
[label_column: ]
[weight_column: ]
[group_column: ]
[ignore_column: ]
[categorical_feature: ]
[forcedbins_filename: ]
[precise_float_parser: 0]
[objective_seed: 5]
[num_class: 1]
[is_unbalance: 0]
[scale_pos_weight: 1]
[sigmoid: 1]
[boost_from_average: 1]
[reg_sqrt: 0]
[alpha: 0.9]
[fair_c: 1]
[poisson_max_delta_step: 0.7]
[tweedie_variance_power: 1.5]
[lambdarank_truncation_level: 30]
[lambdarank_norm: 1]
[label_gain: ]
[eval_at: ]
[multi_error_top_k: 1]
[auc_mu_weights: ]
[num_machines: 1]
[local_listen_port: 12400]
[time_out: 120]
[machine_list_filename: ]
[machines: ]
[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]
[num_gpu: 1]
end of parameters
pandas_categorical:null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS
Did you test the test case in the latest commit of this PR?
I've tested with v3.3.2 with the test case
def test_continue_train_different_feature_size(capsys):
np.random.seed(0)
train_X = np.hstack([np.ones(800).reshape(-1, 8), np.arange(200, 0, -1).reshape(-1, 2)])
train_y = np.sum(train_X[:, -2:], axis=1)
train_data = lgb.Dataset(train_X, label=train_y)
params = {
"objective": "regression",
"num_trees": 10,
"num_leaves": 31,
"verbose": -1,
'predict_disable_shape_check': True,
}
model = lgb.train(train_set=train_data, params=params)
train_X_cont = np.random.rand(100, 5)
train_y_cont = np.sum(train_X_cont, axis=1)
train_data_cont = lgb.Dataset(train_X_cont, label=train_y_cont)
params.update({"verbose": 2})
lgb.train(train_set=train_data_cont, params=params, init_model=model)
captured = capsys.readouterr()
assert captured.out.find("features found in dataset for continual training, but at least") != -1
And the output is
FAILED test_engine.py::test_continue_train_different_feature_size - AssertionError: assert -1 != -1
which shows that the test case fails in v3.3.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiyu1994 This test should segfault or something like this (I don't remember exactly) in v3.3.2
but it doesn't.
Warning was added after starting this review thread and we don't speak about it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS Thanks. With the latest commit, it should now fail in v3.3.2
. It will segfaults or the AssertionError will arise by the line
assert captured.out.find("features found in dataset for continual training, but at least") != -1
in the latest version of test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest commit, it should now fail in
v3.3.2
.
Still doesn't fail in v3.3.2
.
Please use fixed dataset, not a random one, to constantly reproduce segfaults in versions without workaround from this PR.
Co-authored-by: Nikita Titov <[email protected]>
#endif | ||
feature_importances[models_[iter]->split_feature(split_idx)] += 1.0; | ||
if (static_cast<size_t>(real_feature_index) >= feature_importances.size()) { | ||
warn_about_feature_number = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide warning instead of enlarging feature importance vector. Since GBDT::FeatureImportance
is used by Python API and the python code doesn't know the final size of feature importance vector, dynamically enlarging the vector will cause out of bound access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confused by these changes, sorry.
My understanding of what was written in #5156 is that given initial model has been trained on a Dataset
with n
features, if you attempt to perform continued training on a Dataset
with m
features (m < n
), LightGBM will not complain but there might be some out-of-bound access.
I also believe that #5156 mentions LibSVM files as an example because construction of a Dataset
from a file might drop unsplittable columns (like those with all missing values) if feature_pre_filter=True
.
Is my interpretation correct? If it is, that'll help me review this more effectively.
"num_trees": 10, | ||
"num_leaves": 31, | ||
"verbose": -1, | ||
'predict_disable_shape_check': True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to pass predict_disable_shape_check
? This test isn't generating any predictions. Can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good observation! But predict()
is called internally here when you set init scores by a previously trained model
LightGBM/python-package/lightgbm/engine.py
Lines 153 to 154 in 0018206
elif isinstance(init_model, Booster): | |
predictor = init_model._to_predictor(dict(init_model.params, **params)) |
LightGBM/python-package/lightgbm/engine.py
Lines 160 to 161 in 0018206
train_set._update_params(params) \ | |
._set_predictor(predictor) \ |
LightGBM/python-package/lightgbm/basic.py
Lines 1519 to 1522 in 0018206
if isinstance(predictor, _InnerPredictor): | |
if self._predictor is None and init_score is not None: | |
_log_warning("The init_score will be overridden by the prediction of init_model.") | |
self._set_init_score_by_predictor(predictor, data) |
LightGBM/python-package/lightgbm/basic.py
Line 1393 in 0018206
def _set_init_score_by_predictor(self, predictor, data, used_indices=None): |
LightGBM/python-package/lightgbm/basic.py
Lines 1399 to 1402 in 0018206
if predictor is not None: | |
init_score = predictor.predict(data, | |
raw_score=True, | |
data_has_header=data_has_header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! So then is the bug addressed by this PR only present when you've also set predict_disable_shape_check = True
? I expect that most LightGBM users will not be aware that that prediction parameter is relevant for training continuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, without setting predict_disable_shape_check = True
users get the following error:
LightGBMError: The number of features in data (5) is not the same as it was in training data (10).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.
I think the second sentence of the error "You can set predict_disable_shape_check=true
to discard this error ..." helps understand this complicated relationship.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict_disable_shape_check
is necessary because here a numpy matrix is used in the test case. If we use a LibSVM file as dataset in the test case, we don't need predict_disable_shape_check=True
. I use a numpy matrix here because I think it is more convenient than LivSVM in that it does not requiring generating files in test cases.
} | ||
} | ||
} | ||
} else { | ||
Log::Fatal("Unknown importance type: only support split=0 and gain=1"); | ||
} | ||
if (warn_about_feature_number) { | ||
Log::Warning("Only %d features found in dataset for continual training, but at least %d features found in initial model.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add some information here that can help users understand what they should do about this warning? Or is this warning just being raised for the purpose of the unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added
Log::Warning("Please check the number of features used in continual training.");
in d719d78.
Do you think it is appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! But I don't think "please check" will be any more informative for users, and I think that providing two separate WARN-level log messages for the same problem might be confusing.
Would you consider this?
Log::Warning(
"Only %d features found in dataset, but at least %d features found in initial model. "
"If this is intentional and the features in dataset match those used to train the initial model, no action is required."
"If not, training continuation may fail or produce bad results."
"To suppress this warning, provide a dataset with at least %d features"
static_cast<int>(feature_importances.size()),
max_feature_index_found + 1,
max_feature_index_found + 1
)
Here we only focus on the original features, not features preprocessed by LightGBM. This is to fix such cases: First, a model A is trained with |
Just a reminder: merge-blocking thread is #5157 (comment). |
Ah ok thanks @shiyu1994 , now I understand. I guess the phrase "When the input dataset is from LibSVM file" was used in #5156 because loading from a file bypasses |
Exactly. |
Close and reopen to retrigger ci tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to prevent accidental merge.
#5157 (comment)
@guolinke @jameslamb @StrikerRUS Could you please help to verify again whether the latest test case fails with v3.3.2? I've tested on my Linux machine and it fails every time. Note that now we use a fixed dataset, without any randomness. Thanks! |
Just checked again, the latest test doesn't fail on my Windows machine with |
@shiyu1994 could you update this PR? Let me know if you need help with any part of it. |
This is to fix #5156.