more correct way to build node stats in distributed fast hist #4140

CodingCat · 2019-02-14T04:55:25Z

the current approach is to choose the first feature's histogram to build node stats,

however, when this feature is too sparse, it will lead to a tree without any growth

so this PR addresses the issue by

(1) only syncing the root's stats

(2) utilizing the stats cache implemented in #4015 to get all the others' stats

closes #4127

CodingCat · 2019-02-14T05:36:23Z

there is a bug when dealing with dense data layout................sigh....

hcho3 · 2019-02-14T05:39:13Z

@CodingCat Do you need help?

CodingCat · 2019-02-14T06:43:04Z

@hcho3 thanks, it's fine now, though the logic for depthwise becomes a bit tricky

CodingCat · 2019-02-14T06:43:41Z

it's ready to review now

CodingCat · 2019-02-14T17:15:58Z

@trivialfis @RAMitchell @hcho3 I believe it is the last commit I would put in distributed hist....thanks for the review

RAMitchell · 2019-02-14T22:20:02Z

Can you please explain a little bit in words the purpose of this PR and what led you to make these changes? I think I have an idea but I want to make sure I understand.

I don't like using the last slot of the histogram to store the node stats. I believe if something has a distinct purpose it should be a separate variable. This can very easily create problems later given that the histogram array may not always be the same size depending on some state. I definitely recommend reading this as a reference for c++ architecture: https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rp-direct

Some recent changes of mine (#4015) may actually mean that we can get the summary statistics from a cache after split finding and we do not need to recalculate these at every node, only the root. This might simplify your PR.

CodingCat · 2019-02-14T23:22:59Z

@RAMitchell so the current approach in master branch is to use first feature's histogram to build the node's stats, which assumes that the feature is not "too sparse" and summing up all values in the bins across workers will make an approximate result....but it's not true for some cases, and it will lead to a no-grow tree as indicated in #4127

using the last bin (this bin actually is out of touch by histogram building algorithm) to store stats is to mimic the behavior of approx which is motivated by saving a rabit call (which is slow and the main reason that we don't see a significant speedup when increasing worker number beyond some values)

If I understand #4015 correctly, a lot of code (e.g.

xgboost/src/tree/updater_quantile_hist.cc

Lines 757 to 796 in d506a8b

    
               if (rabit::IsDistributed()) { 
        
                 // in distributed mode, the node's stats should be calculated from histogram, otherwise, 
        
                 // we will have wrong results in EnumerateSplit() 
        
                 // here we take the last feature in cut 
        
                 auto begin = hist.data(); 
        
                 for (size_t i = gmat.cut.row_ptr[0]; i < gmat.cut.row_ptr[1]; i++) { 
        
                   stats.Add(begin[i].sum_grad, begin[i].sum_hess); 
        
                 } 
        
               } else { 
        
                 if (data_layout_ == kDenseDataZeroBased || data_layout_ == kDenseDataOneBased || 
        
                     rabit::IsDistributed()) { 
        
                   /* specialized code for dense data 
        
                      For dense data (with no missing value), 
        
                      the sum of gradient histogram is equal to snode[nid] 
        
                      GHistRow hist = hist_[nid];*/ 
        
                   const std::vector<uint32_t>& row_ptr = gmat.cut.row_ptr; 
        
                   const uint32_t ibegin = row_ptr[fid_least_bins_]; 
        
                   const uint32_t iend = row_ptr[fid_least_bins_ + 1]; 
        
                   auto begin = hist.data(); 
        
                   for (uint32_t i = ibegin; i < iend; ++i) { 
        
                     const GradStats et = begin[i]; 
        
                     stats.Add(et.sum_grad, et.sum_hess); 
        
                   } 
        
                 } else { 
        
                   const RowSetCollection::Elem e = row_set_collection_[nid]; 
        
                   for (const size_t* it = e.begin; it < e.end; ++it) { 
        
                     stats.Add(gpair[*it]); 
        
                   } 
        
                 } 
        
               } 
        
               // calculating the weights 
        
               { 
        
                 bst_uint parentid = tree[nid].Parent(); 
        
                 snode_[nid].weight = static_cast<float>( 
        
                     spliteval_->ComputeWeight(parentid, snode_[nid].stats)); 
        
                 snode_[nid].root_gain = static_cast<float>( 
        
                     spliteval_->ComputeScore(parentid, snode_[nid].stats, snode_[nid].weight)); 
        
               }

) is unnecessary (after its first call)?

CodingCat · 2019-02-14T23:58:06Z

I like the idea in #4015 @RAMitchell , this should save a significant amount of cpu cycles.....I would move forward to remove InitNewNode in hist here....

CodingCat · 2019-02-15T04:02:04Z

@RAMitchell I have made the changes based on #4015, thanks for the review

hcho3 · 2019-02-18T09:54:08Z

@CodingCat Is this ready to merge?

CodingCat · 2019-02-18T14:22:25Z

I think it's ready, waiting for confirm from anyone of you

CodingCat changed the title ~~more correct way to build node stats in distributed fast hist~~ [WIP]more correct way to build node stats in distributed fast hist Feb 14, 2019

CodingCat changed the title ~~[WIP]more correct way to build node stats in distributed fast hist~~ more correct way to build node stats in distributed fast hist Feb 14, 2019

CodingCat and others added 19 commits February 15, 2019 08:28

fix scalastyle error

39827ef

add back train method but mark as deprecated

e21a3ce

fix scalastyle error

b594ec3

add back train method but mark as deprecated

b0eab7b

fix scalastyle error

8977818

add back train method but mark as deprecated

70d7659

fix scalastyle error

c620ac8

more changes

76df41b

temp

2eb9d67

update

9df2e7a

udpate rabit

919a388

change the histogram

4ebf91b

update kfactor

2bd4f86

sync per node stats

2785f27

temp

2047720

update

8fa8985

final

827b634

code clean

77db409

update rabit

de4abbd

Nan Zhu added 23 commits February 15, 2019 08:28

fix test compilation issue

4e44bb7

fix lint issue

2dd980b

resolve compilation issue

7877f0d

fix issues of lint caused by rebase

e057e9b

fix stylistic changes and change variable names

46b5c44

modularize depth width

98c8362

address the comments

c9d6b6d

fix failed tests

26c374a

wrap perf timers with class

b6d58c6

temp

ec53c53

pass all lossguide

edc7dbf

pass tests

aacc87b

add comments

ea00cdf

more changes

acfbb5d

use separate flow for single and tests

276551b

add test for lossguide hist

151301a

remove duplications

0150a2a

syncing stats for only once

7e60ae8

recover more changes

1e23305

recover more changes

f597a7b

fix root-stats

6d5b602

simplify code

6c64c26

remove outdated comments

03ef60a

CodingCat force-pushed the feature_selection_rebase branch from 4d086b9 to 03ef60a Compare February 15, 2019 16:29

hcho3 approved these changes Feb 18, 2019

View reviewed changes

CodingCat merged commit 1dac5e2 into dmlc:master Feb 18, 2019

hcho3 mentioned this pull request Mar 4, 2019

[RFC] Version 0.82 release candidate #4201

Merged

lock bot locked as resolved and limited conversation to collaborators May 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more correct way to build node stats in distributed fast hist #4140

more correct way to build node stats in distributed fast hist #4140

CodingCat commented Feb 14, 2019 •

edited

Loading

CodingCat commented Feb 14, 2019

hcho3 commented Feb 14, 2019

CodingCat commented Feb 14, 2019

CodingCat commented Feb 14, 2019

CodingCat commented Feb 14, 2019

RAMitchell commented Feb 14, 2019

CodingCat commented Feb 14, 2019 •

edited

Loading

CodingCat commented Feb 14, 2019

CodingCat commented Feb 15, 2019

hcho3 commented Feb 18, 2019

CodingCat commented Feb 18, 2019

more correct way to build node stats in distributed fast hist #4140

more correct way to build node stats in distributed fast hist #4140

Conversation

CodingCat commented Feb 14, 2019 • edited Loading

CodingCat commented Feb 14, 2019

hcho3 commented Feb 14, 2019

CodingCat commented Feb 14, 2019

CodingCat commented Feb 14, 2019

CodingCat commented Feb 14, 2019

RAMitchell commented Feb 14, 2019

CodingCat commented Feb 14, 2019 • edited Loading

CodingCat commented Feb 14, 2019

CodingCat commented Feb 15, 2019

hcho3 commented Feb 18, 2019

CodingCat commented Feb 18, 2019

CodingCat commented Feb 14, 2019 •

edited

Loading

CodingCat commented Feb 14, 2019 •

edited

Loading