-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more correct way to build node stats in distributed fast hist #4140
Conversation
there is a bug when dealing with dense data layout................sigh.... |
@CodingCat Do you need help? |
@hcho3 thanks, it's fine now, though the logic for depthwise becomes a bit tricky |
it's ready to review now |
@trivialfis @RAMitchell @hcho3 I believe it is the last commit I would put in distributed hist....thanks for the review |
Can you please explain a little bit in words the purpose of this PR and what led you to make these changes? I think I have an idea but I want to make sure I understand. I don't like using the last slot of the histogram to store the node stats. I believe if something has a distinct purpose it should be a separate variable. This can very easily create problems later given that the histogram array may not always be the same size depending on some state. I definitely recommend reading this as a reference for c++ architecture: https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rp-direct Some recent changes of mine (#4015) may actually mean that we can get the summary statistics from a cache after split finding and we do not need to recalculate these at every node, only the root. This might simplify your PR. |
@RAMitchell so the current approach in master branch is to use first feature's histogram to build the node's stats, which assumes that the feature is not "too sparse" and summing up all values in the bins across workers will make an approximate result....but it's not true for some cases, and it will lead to a no-grow tree as indicated in #4127 using the last bin (this bin actually is out of touch by histogram building algorithm) to store stats is to mimic the behavior of If I understand #4015 correctly, a lot of code (e.g. xgboost/src/tree/updater_quantile_hist.cc Lines 757 to 796 in d506a8b
|
I like the idea in #4015 @RAMitchell , this should save a significant amount of cpu cycles.....I would move forward to remove |
@RAMitchell I have made the changes based on #4015, thanks for the review |
4d086b9
to
03ef60a
Compare
@CodingCat Is this ready to merge? |
I think it's ready, waiting for confirm from anyone of you |
the current approach is to choose the first feature's histogram to build node stats,
however, when this feature is too sparse, it will lead to a tree without any growth
so this PR addresses the issue by
(1) only syncing the root's stats
(2) utilizing the stats cache implemented in #4015 to get all the others' stats
closes #4127