Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more correct way to build node stats in distributed fast hist #4140

Merged
merged 72 commits into from
Feb 18, 2019

Conversation

CodingCat
Copy link
Member

@CodingCat CodingCat commented Feb 14, 2019

the current approach is to choose the first feature's histogram to build node stats,

however, when this feature is too sparse, it will lead to a tree without any growth

so this PR addresses the issue by

(1) only syncing the root's stats

(2) utilizing the stats cache implemented in #4015 to get all the others' stats

closes #4127

@CodingCat CodingCat changed the title more correct way to build node stats in distributed fast hist [WIP]more correct way to build node stats in distributed fast hist Feb 14, 2019
@CodingCat
Copy link
Member Author

there is a bug when dealing with dense data layout................sigh....

@hcho3
Copy link
Collaborator

hcho3 commented Feb 14, 2019

@CodingCat Do you need help?

@CodingCat CodingCat changed the title [WIP]more correct way to build node stats in distributed fast hist more correct way to build node stats in distributed fast hist Feb 14, 2019
@CodingCat
Copy link
Member Author

@hcho3 thanks, it's fine now, though the logic for depthwise becomes a bit tricky

@CodingCat
Copy link
Member Author

it's ready to review now

@CodingCat
Copy link
Member Author

@trivialfis @RAMitchell @hcho3 I believe it is the last commit I would put in distributed hist....thanks for the review

@RAMitchell
Copy link
Member

Can you please explain a little bit in words the purpose of this PR and what led you to make these changes? I think I have an idea but I want to make sure I understand.

I don't like using the last slot of the histogram to store the node stats. I believe if something has a distinct purpose it should be a separate variable. This can very easily create problems later given that the histogram array may not always be the same size depending on some state. I definitely recommend reading this as a reference for c++ architecture: https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rp-direct

Some recent changes of mine (#4015) may actually mean that we can get the summary statistics from a cache after split finding and we do not need to recalculate these at every node, only the root. This might simplify your PR.

@CodingCat
Copy link
Member Author

CodingCat commented Feb 14, 2019

@RAMitchell so the current approach in master branch is to use first feature's histogram to build the node's stats, which assumes that the feature is not "too sparse" and summing up all values in the bins across workers will make an approximate result....but it's not true for some cases, and it will lead to a no-grow tree as indicated in #4127

using the last bin (this bin actually is out of touch by histogram building algorithm) to store stats is to mimic the behavior of approx which is motivated by saving a rabit call (which is slow and the main reason that we don't see a significant speedup when increasing worker number beyond some values)

If I understand #4015 correctly, a lot of code (e.g.

if (rabit::IsDistributed()) {
// in distributed mode, the node's stats should be calculated from histogram, otherwise,
// we will have wrong results in EnumerateSplit()
// here we take the last feature in cut
auto begin = hist.data();
for (size_t i = gmat.cut.row_ptr[0]; i < gmat.cut.row_ptr[1]; i++) {
stats.Add(begin[i].sum_grad, begin[i].sum_hess);
}
} else {
if (data_layout_ == kDenseDataZeroBased || data_layout_ == kDenseDataOneBased ||
rabit::IsDistributed()) {
/* specialized code for dense data
For dense data (with no missing value),
the sum of gradient histogram is equal to snode[nid]
GHistRow hist = hist_[nid];*/
const std::vector<uint32_t>& row_ptr = gmat.cut.row_ptr;
const uint32_t ibegin = row_ptr[fid_least_bins_];
const uint32_t iend = row_ptr[fid_least_bins_ + 1];
auto begin = hist.data();
for (uint32_t i = ibegin; i < iend; ++i) {
const GradStats et = begin[i];
stats.Add(et.sum_grad, et.sum_hess);
}
} else {
const RowSetCollection::Elem e = row_set_collection_[nid];
for (const size_t* it = e.begin; it < e.end; ++it) {
stats.Add(gpair[*it]);
}
}
}
// calculating the weights
{
bst_uint parentid = tree[nid].Parent();
snode_[nid].weight = static_cast<float>(
spliteval_->ComputeWeight(parentid, snode_[nid].stats));
snode_[nid].root_gain = static_cast<float>(
spliteval_->ComputeScore(parentid, snode_[nid].stats, snode_[nid].weight));
}
) is unnecessary (after its first call)?

@CodingCat
Copy link
Member Author

I like the idea in #4015 @RAMitchell , this should save a significant amount of cpu cycles.....I would move forward to remove InitNewNode in hist here....

@CodingCat
Copy link
Member Author

@RAMitchell I have made the changes based on #4015, thanks for the review

@hcho3
Copy link
Collaborator

hcho3 commented Feb 18, 2019

@CodingCat Is this ready to merge?

@CodingCat
Copy link
Member Author

I think it's ready, waiting for confirm from anyone of you

@CodingCat CodingCat merged commit 1dac5e2 into dmlc:master Feb 18, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distributed hist tree method not working?
3 participants