Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations of distributed 'hist' mode for CPUs #4824

Closed
wants to merge 5 commits into from

Conversation

SmirnovEgorRu
Copy link
Contributor

I observed that distributed XGBoost on 1 node is slower than non-distributed (Batch) XGBoost. Difference is 1.5-3x times depending on data set and parameters. The reasons of this:

  • We took always left tree node for complex histogram computation and right node - computed by subtraction trick. But I added choosing of the smallest tree node across of computation nodes.
  • Poor threading for histogram reduction in distributed case - also fixed.

Performance measurements:

Data set  higgs1m
1 node (before), s 5.29
1 node (this PR), s 3.66
Batch mode, s 3.53

So, we see 1.5x speedup against prev version and similar time as in Batch mode.

@CodingCat
Copy link
Member

except optimizing, could you check #4716 which involves a correctness issue in the current syncing strategy

@hcho3
Copy link
Collaborator

hcho3 commented Sep 5, 2019

@CodingCat Good point, we cannot let the sync issue unaddressed before 1.0.0. I'll see if there's anyone in my org to help, since it affects our work e.g. here (Also will try to get some time myself too.)

Right now, #4716 solved the correctness issue. However, I suspect that the current XGBoost code still performs sync per-node instead of per-depth, slowing down distributed training.

cc @ericangelokim @iyerr3

@SmirnovEgorRu
Copy link
Contributor Author

SmirnovEgorRu commented Sep 10, 2019

@CodingCat @hcho3 I have found an issue that distributed workers can build different trees and I have fixed this.

I also checked log-loss for higgs (4 local workers)- it is exactly the same for 2 runs:
1 run: 0.48785404282370270490076791247702203691005706787109375
2 run: 0.48785404282370270490076791247702203691005706787109375

Is it enough for this PR?

@codecov-io
Copy link

Codecov Report

Merging #4824 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4824   +/-   ##
=======================================
  Coverage   77.51%   77.51%           
=======================================
  Files          11       11           
  Lines        2055     2055           
=======================================
  Hits         1593     1593           
  Misses        462      462

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2aed0ae...e27fdeb. Read the comment docs.

@SmirnovEgorRu
Copy link
Contributor Author

@thvasilo it was done to reuse this->histred_ rabit::Reducer object and don't create new once special for size_t. So, we just compute sum of all samples in each tree node using floats instead of size_t.

@thvasilo
Copy link
Contributor

@thvasilo it was done to reuse this->histred_ rabit::Reducer object and don't create new once special for size_t. So, we just compute sum of all samples in each tree node using floats instead of size_t.

I'm afraid this would confuse potential future maintainers, so a comment should be added at least.

Is it possible to use one of the existing aggregation ops (Sum) here, like in the rabit usage example? We shouldn't need a Reducer object to sum/max some integers I think.

@SmirnovEgorRu
Copy link
Contributor Author

@thvasilo I added your variant of Reduce

@thvasilo
Copy link
Contributor

Thanks @SmirnovEgorRu, I think this better gets across the purpose of the variables now.

@SmirnovEgorRu
Copy link
Contributor Author

CI contains following error, looks like it is not my error:
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)
../../../miniconda/envs/python3/lib/python3.7/ssl.py:1139: SSLCertVerificationError

@trivialfis
Copy link
Member

I can see. It might be Travis has outdated ssl certificate, not sure yet.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long wait. In general, I prefer clean and well structured code. For example, the Partition*Kernel, is it possible to just use std::partition? If not then I would add some comments for it. If it's just for performance consideration can we make a partition function that can closely resemble the interface of std::partition? Another example, since we are creating task in quantile hist, so are we using task graph and async? If so is it possible to define an explicit reusable structure for it, even the implementation is not as good as other parallel library?

To me the amount of work for cleaning up quantile hist is just huge ...

// TODO(egorsmir): add parallel for
for (auto elem : nodes) {
if (elem.sibling_nid > -1) {
SubtractionTrick(hist_[elem.sibling_nid], hist_[elem.nid],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So SubtractionTrick can be deleted and the following becomes an implicit "subtraction trick"? It would be surprising that the subtraction trick is taking significant time. Maybe removing the pruner can help performance better, see #4874

common::GradStatHist::GradType* hist_data =
reinterpret_cast<common::GradStatHist::GradType*>(hist_[nid].data());

ReduceHistograms(hist_data, nullptr, nullptr, 0, hist_builder_.GetNumBins() * 2, i,
ReduceHistograms(hist_data, nullptr, nullptr, cut_ptr[fid] * 2, cut_ptr[fid + 1] * 2, node,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass a Span instead of flying pointers.

@@ -102,3 +102,5 @@ List of Contributors
* [Haoda Fu](https://github.com/fuhaoda)
* [Evan Kepner](https://github.com/EvanKepner)
- Evan Kepner added support for os.PathLike file paths in Python
* [Egor Smirnov](https://github.com/SmirnovEgorRu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can open a separated PR for noting your contribution from previous PRs. I think I don't prefer merging this one. But @hcho3 will make the decision.

@@ -589,7 +596,7 @@ void QuantileHistMaker::Builder::BuildHistsBatch(const std::vector<ExpandEntry>&

// 3. Merge grad stats for each node
// Sync histograms in case of distributed computation
SyncHistograms(p_tree, nodes, hist_buffers, hist_is_init, grad_stats);
SyncHistograms(p_tree, nodes, hist_buffers, hist_is_init, grad_stats, gmat);

perf_monitor.UpdatePerfTimer(TreeGrowingPerfMonitor::timer_name::BUILD_HIST);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this perf_monitor with common::Monitor before and somehow it's still here ...

@SmirnovEgorRu
Copy link
Contributor Author

SmirnovEgorRu commented Feb 25, 2020

Closed since it's not relevant code. Anyway, optimizations idea can be used in farther commits.

CC @ShvetsKS

@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants