Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

shiyu1994 · 2022-04-21T08:53:13Z

This is to fix the case provided by @mshivers in #3679. It turns out that the bug is not related to the core algorithm of LightGBM, but due to the cost-efficient gradient boosting module.

In the code snippet below, the best split of SerialTreeLearner can be updated by CostEfficientGradientBoosting,

LightGBM/src/treelearner/cost_effective_gradient_boosting.hpp

Lines 85 to 110 in fc0c8fd

    
           void UpdateLeafBestSplits(Tree* tree, int best_leaf, 
        
                                     const SplitInfo* best_split_info, 
        
                                     std::vector<SplitInfo>* best_split_per_leaf) { 
        
             auto config = tree_learner_->config_; 
        
             auto train_data = tree_learner_->train_data_; 
        
             const int inner_feature_index = 
        
                 train_data->InnerFeatureIndex(best_split_info->feature); 
        
             auto& ref_best_split_per_leaf = *best_split_per_leaf; 
        
             if (!config->cegb_penalty_feature_coupled.empty() && 
        
                 !is_feature_used_in_split_[inner_feature_index]) { 
        
               is_feature_used_in_split_[inner_feature_index] = true; 
        
               for (int i = 0; i < tree->num_leaves(); ++i) { 
        
                 if (i == best_leaf) continue; 
        
                 auto split = &splits_per_leaf_[static_cast<size_t>(i) * 
        
                                                    train_data->num_features() + 
        
                                                inner_feature_index]; 
        
                 split->gain += 
        
                     config->cegb_tradeoff * 
        
                     config->cegb_penalty_feature_coupled[best_split_info->feature]; 
        
                 // Avoid to update the leaf that cannot split 
        
                 if (ref_best_split_per_leaf[i].gain > kMinScore && 
        
                     *split > ref_best_split_per_leaf[i]) { 
        
                   ref_best_split_per_leaf[i] = *split; 
        
                 } 
        
               } 
        
             }

However, CostEfficientGradientBoosting did not clear its buffer of splits splits_per_leaf_ before a new boosting iteration starts, which causes it to contain splits from previous trees. And these splits with wrong split information (sum of gradients and hessians, gain, etc.) will be mixed into the current tree.

jameslamb

Did you accidentally tag the wrong issue in the description? This doesn't seem related to #4969 at all.

Also, is it possible to create a test that reproduces the bug this PR addresses, to ensure this fix is working and prevent it from being accidentally re-introduced in the future?

shiyu1994 · 2022-04-22T03:32:49Z

Did you accidentally tag the wrong issue in the description? This doesn't seem related to #4969 at all.

Sorry, it should be #4946. Corrected.

bluesummerv · 2022-04-24T16:38:25Z

n1 = 800000
n2 = 2200
x = np.random.random((n1,n2))
y= np.random.random((n1))
model = lgb.LGBMRegressor(num_leaves=100, n_estimators=20, device='gpu')
model.fix(x,y)

Segmentation fault (core dumped)

Thanks! @shiyu1994

shiyu1994 · 2022-05-05T02:32:28Z

@bluesummerv Thanks for using LightGBM. It seems that your example is not related with this PR. Could you please open a new issue for your bug report?

shiyu1994 · 2022-05-05T02:36:28Z

Sorry, it should be #4946. Corrected.

More accurately, this PR only fixes the case provided by @mshivers in #3679.

… fix-4946

shiyu1994 · 2022-05-05T03:05:07Z

Also, is it possible to create a test that reproduces the bug this PR addresses, to ensure this fix is working and prevent it from being accidentally re-introduced in the future?

@jameslamb Thanks for the suggestion. But TBH I don't think we need a test case for this fix. It just fixes a logical mistake in the code of CEGB, which is not a corner case that is likely to be introduced again by future modifications. I think it makes sense to add test cases to guarantee the code to work in some tricky scenarios. But perhaps it is not necessary to add a test case for obvious logical mistakes in programming. WDYT?

jameslamb · 2022-05-05T03:46:37Z

WDYT?

I believe a test should be added in this PR.

I strongly believe that PRs which fix bugs in software should also introduce tests confirming that the software no longer exhibits those bugs, unless the effort involved in creating the tests or the cost of running the tests is too large.

Given that we have already have a clear reproducible example of a bug in LightGBM (#3679 (comment)), the effort required to convert it to a test case should not be too much, and I think it is worth it for the increased confidence that this bug won't be reintroduced. That test would also give us some additional code coverage of using CEGB, which isn't tested thoroughly in the project today.

In my opinion, tests should not just be used for "corner cases" or "tricky scenarios". Tests should try to cover the ways that users might use the software, and check that the software behaves as those uses would expect it to. Testing gives us confidence that once a bug is fixed, it won't be reintroduced.

This project is very large and there are many, many possible code paths based on e.g. different combinations of training parameters, compilation options, data characteristics, operating systems, etc. That set of combinations is far too large for us to just rely on maintainers' knowledge and pull request reviews to catch regressions.

shiyu1994 · 2022-05-09T03:38:09Z

@jameslamb Done with adding the test case. Please check.

src/treelearner/cost_effective_gradient_boosting.hpp

StrikerRUS · 2022-05-09T22:02:37Z

@guolinke This is very important bugfix for our long living bug. Could you please help to review it?

shiyu1994 · 2022-05-10T02:33:20Z

bugfix for our long living bug

Note that this only fixes the case when CEGB is used (that's why the word partially is used in the title of this PR). I need further verification to see if the same bug (Check failed: (best_split_info.left_count) > (0)) still happens in normal cases.

StrikerRUS · 2022-05-10T21:09:35Z

Note that this only fixes the case when CEGB is used (that's why the word partially is used in the title of this PR).

Thanks for the clarification!

However, GitHub autoclosing mechanism doesn't understand this and will close #3679 after merging this PR.
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
I propose changing the title to "fix partially #3679" to not close that issue automatically.

tests/python_package_test/test_basic.py

tests/python_package_test/test_engine.py

StrikerRUS

LGTM, thanks!

shiyu1994 · 2022-06-07T07:50:26Z

@StrikerRUS

LGTM, thanks!

Could you please remove the change requests so that we can merge this RP? Currently it is blocked by the change requests. Thanks!

jameslamb

Could you please remove the change requests

Looks like that's from my initial review.

I've removed it here. Thanks for the fix!

github-actions · 2023-08-19T03:52:08Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

clear split info buffer in cegb_ before every iteration

703d11a

shiyu1994 requested review from guolinke, btrotta, hzy46 and tongwu-sh as code owners April 21, 2022 08:53

shiyu1994 mentioned this pull request Apr 21, 2022

Fix program stop when split data count equals zero #5087

Closed

shiyu1994 added the fix label Apr 21, 2022

jameslamb self-requested a review April 22, 2022 03:21

jameslamb requested changes Apr 22, 2022

View reviewed changes

shiyu1994 changed the title ~~Clear split info buffer in cost efficient gradient boosting before every iteration (fix #4969)~~ Clear split info buffer in cost efficient gradient boosting before every iteration (fix #4946) Apr 22, 2022

jameslamb mentioned this pull request Apr 22, 2022

fix typo in CEGB method name #5168

Merged

Merge branch 'master' into fix-4946

b31e1d8

shiyu1994 changed the title ~~Clear split info buffer in cost efficient gradient boosting before every iteration (fix #4946)~~ Clear split info buffer in cost efficient gradient boosting before every iteration (partially fix #3679) May 5, 2022

shiyu1994 added 2 commits May 5, 2022 02:47

check nullable of cegb_ in serial_tree_learner.cpp

fad48d4

Merge branch 'fix-4946' of https://github.com/shiyu1994/LightGBM into…

ff7d89d

… fix-4946

add a test case for checking the split buffer in CEGB

0bc4333

shiyu1994 requested review from StrikerRUS and jmoralez as code owners May 9, 2022 03:37

shiyu1994 requested a review from jameslamb May 9, 2022 05:49

shiyu1994 added the awaiting review label May 9, 2022

StrikerRUS reviewed May 9, 2022

View reviewed changes

src/treelearner/cost_effective_gradient_boosting.hpp Outdated Show resolved Hide resolved

swith to Threading::For instead of raw OpenMP

45de0a4

StrikerRUS changed the title ~~Clear split info buffer in cost efficient gradient boosting before every iteration (partially fix #3679)~~ Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) May 10, 2022

StrikerRUS requested changes May 10, 2022

View reviewed changes

apply review suggestions

a04d004

StrikerRUS reviewed May 11, 2022

View reviewed changes

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

apply review comments

f8e8170

StrikerRUS reviewed May 14, 2022

View reviewed changes

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

remove device cpu

5235131

shiyu1994 requested a review from StrikerRUS June 1, 2022 14:15

StrikerRUS approved these changes Jun 5, 2022

View reviewed changes

guolinke approved these changes Jun 7, 2022

View reviewed changes

jameslamb approved these changes Jun 8, 2022

View reviewed changes

jameslamb removed the awaiting review label Jun 8, 2022

jameslamb merged commit f1328d5 into microsoft:master Jun 8, 2022

jameslamb mentioned this pull request Jun 24, 2022

Check failed: (best_split_info.left_count) > (0) when using the ''cegb_penalty_feature_coupled'' parameter #5317

Closed

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

shiyu1994 commented Apr 21, 2022 •

edited

Loading

jameslamb left a comment •

edited

Loading

shiyu1994 commented Apr 22, 2022

bluesummerv commented Apr 24, 2022 •

edited

Loading

shiyu1994 commented May 5, 2022

shiyu1994 commented May 5, 2022

shiyu1994 commented May 5, 2022 •

edited

Loading

jameslamb commented May 5, 2022

shiyu1994 commented May 9, 2022

StrikerRUS commented May 9, 2022

shiyu1994 commented May 10, 2022

StrikerRUS commented May 10, 2022

StrikerRUS left a comment

shiyu1994 commented Jun 7, 2022

jameslamb left a comment

github-actions bot commented Aug 19, 2023

	void UpdateLeafBestSplits(Tree* tree, int best_leaf,
	const SplitInfo* best_split_info,
	std::vector<SplitInfo>* best_split_per_leaf) {
	auto config = tree_learner_->config_;
	auto train_data = tree_learner_->train_data_;
	const int inner_feature_index =
	train_data->InnerFeatureIndex(best_split_info->feature);
	auto& ref_best_split_per_leaf = *best_split_per_leaf;
	if (!config->cegb_penalty_feature_coupled.empty() &&
	!is_feature_used_in_split_[inner_feature_index]) {
	is_feature_used_in_split_[inner_feature_index] = true;
	for (int i = 0; i < tree->num_leaves(); ++i) {
	if (i == best_leaf) continue;
	auto split = &splits_per_leaf_[static_cast<size_t>(i) *
	train_data->num_features() +
	inner_feature_index];
	split->gain +=
	config->cegb_tradeoff *
	config->cegb_penalty_feature_coupled[best_split_info->feature];
	// Avoid to update the leaf that cannot split
	if (ref_best_split_per_leaf[i].gain > kMinScore &&
	*split > ref_best_split_per_leaf[i]) {
	ref_best_split_per_leaf[i] = *split;
	}
	}
	}

Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

Conversation

shiyu1994 commented Apr 21, 2022 • edited Loading

jameslamb left a comment • edited Loading

Choose a reason for hiding this comment

shiyu1994 commented Apr 22, 2022

bluesummerv commented Apr 24, 2022 • edited Loading

shiyu1994 commented May 5, 2022

shiyu1994 commented May 5, 2022

shiyu1994 commented May 5, 2022 • edited Loading

jameslamb commented May 5, 2022

shiyu1994 commented May 9, 2022

StrikerRUS commented May 9, 2022

shiyu1994 commented May 10, 2022

StrikerRUS commented May 10, 2022

StrikerRUS left a comment

Choose a reason for hiding this comment

shiyu1994 commented Jun 7, 2022

jameslamb left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 19, 2023

shiyu1994 commented Apr 21, 2022 •

edited

Loading

jameslamb left a comment •

edited

Loading

bluesummerv commented Apr 24, 2022 •

edited

Loading

shiyu1994 commented May 5, 2022 •

edited

Loading