Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for CPU #4529

Merged
merged 38 commits into from
Jun 27, 2019
Merged

Optimizations for CPU #4529

merged 38 commits into from
Jun 27, 2019

Conversation

SmirnovEgorRu
Copy link
Contributor

It is this PR #4433. But I have merged all commits in once and applied comment in the review.
My next steps here:

  • Make some coding cleanup and add comments where needed
  • Attach performance data

@trivialfis trivialfis self-requested a review June 2, 2019 19:04
@chenqin
Copy link
Contributor

chenqin commented Jun 17, 2019

Thanks for huge effort!

A few questions in general

  • proactive prefetch data into all level of cache can be great help speeding up. why T0(all level)? L3 is abundant enough I assume.

  • given compiler also do optimization, do we plan to test turn on flags and benchmark against https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

  • Time for hist computation was reduced well and now computation of new gradients/hessians on each iteration spends significant time. Here are many opportunities to use AVX/AVX512 jointly with mask instructions

@SmirnovEgorRu
Copy link
Contributor Author

@chenqin Hi!
I prefetch data which will be used very soon (after reading of the next 10 rows of bin-matrix), so better to prefetch this to L1. Also this data can be already in L3 and we need to fetch this to L1 to reduce latency.

given compiler also do optimization, do we plan to test turn on flags and benchmark against

Here I mean - vector instruction sets mostly. Now XGBoost uses sse2 only, but for gradient/hessians computations (as an example) AVX512 can be used as well:

  • Some of objective functions contain many code branches. They can be replaced by vector masks instructions from AVX512. From time to time it provides 10x speedup or even more.
  • Usage of wider vector registers also should help.

One problem here - that compiler in some cases can't auto-vectorize your code, instrinsic-functions help a lot here. Also, if we want to support different instructions sets (for example AVX512 and sse2) in one library-file - we need to have separate paths of code for them. It can be done, but requires some changes in build system.

I think, if topic of gradient boosting optimizations are interested - we can write some article for web site (as it was done for GPU). From my side I can provide materials about this topic. @chenqin, @hcho3 What do you think?

@trivialfis trivialfis requested a review from hcho3 June 18, 2019 08:55
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work, this is incredible. ;-) I'm going to work on this along with DMatrix soon. So will approve as long as all tests passed (as in current case). @hcho3

@SmirnovEgorRu
Copy link
Contributor Author

Hi @hcho3,
Do you have any estimations when you will be free to review this PR? :)

@hcho3
Copy link
Collaborator

hcho3 commented Jun 24, 2019

Reviewing now

@trivialfis
Copy link
Member

trivialfis commented Jun 24, 2019

@hcho3 I'm only approving the PR because the tests passed, and I think optimizing histogram build will be beneficial to future improvement. But if we merge this someone needs to bring a huge refactor to quantile_hist and clarify how all these are put together.

Personally I prefer optimizing this by thinking about the whole structure. And I think there's lots of opportunities in optimization if we look at how the whole procedure is run, I mean how many of the steps are necessary if we do some preprocessing to the data like calculating the number of non-zero columns, implement a sparse histogram you mentioned, if it's possible to do grouping block before constructing quantile cuts etc. Or if it's possible to implement prediction cache for multi-class objective so that CSR format may not be needed at all.

BTW, @RAMitchell , when I mentioned "column matrix", I meant this class: https://github.com/dmlc/xgboost/blob/master/src/common/column_matrix.h

Going instruction level optimization is cool, but to me quantile hist has a lot more room to be improved on other places and vectorization might not be the critical point. I don't have the time to do it myself right now, it just my opinions.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SmirnovEgorRu Thanks a lot for preparing the pull request. The code is better organized and more comprehensible than #4433, and benchmark results are nice. I also like how now all operations are batched over multiple nodes. Overall, I'm inclined to approve this pull request.

@trivialfis I agree that more clarification of what each component does is in order.

src/common/hist_util.h Outdated Show resolved Hide resolved
src/common/hist_util.h Outdated Show resolved Hide resolved
src/tree/param.h Show resolved Hide resolved
src/tree/param.h Show resolved Hide resolved
src/tree/updater_quantile_hist.h Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, thanks for review. I have applied all comments, CI is green.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2019

Looks like the R project website (https://cloud.r-project.org) is currently down, causing Windows R tests to fail. For now, we will ignore the failed tests.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, thank you!
Do I need to do smt else for merging this PR?

@hcho3 hcho3 merged commit 4d6590b into dmlc:master Jun 27, 2019
@hcho3
Copy link
Collaborator

hcho3 commented Jun 27, 2019

@SmirnovEgorRu It’s merged now. Thanks!

@coding-komek
Copy link

Thanks for including these changes! Is there an ETA as to when a release can be expected?

@arnocandel
Copy link

arnocandel commented Jul 2, 2019

Are results (model, predictions) reproducible for same machine, same random state, same input?

@SmirnovEgorRu
Copy link
Contributor Author

@arnocandel I tested this for multiple data sets - it was reproducible even for multicore systems,
Also, XGBoost has very sensitive tests (like for features importance), If you do smth wrong in terms of results сhanging - they would be failed.

@SmirnovEgorRu
Copy link
Contributor Author

Hi @hcho3, @trivialfis,
Do you know when the next release will be available?

@hcho3
Copy link
Collaborator

hcho3 commented Jul 22, 2019

See #4680

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants