Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hist 10x slower than Exact #5405

Closed
shenkev opened this issue Mar 11, 2020 · 14 comments
Closed

Hist 10x slower than Exact #5405

shenkev opened this issue Mar 11, 2020 · 14 comments

Comments

@shenkev
Copy link

shenkev commented Mar 11, 2020

XGBoost version: 0.90
System: linux
CPU Cores: 40
Language: Python

I’m training with nthreads=40 on a dataset of size 12M and 48 features. “Exact” mode boosts trees at a rate of 1 tree per 12 seconds. With the same hyperparameters, “hist” mode (I’ve only changed “tree_method”) boosts trees at a rate of 1 per 2 minutes (10x slower). I am loading train and val data from libsvm files.

Furthermore, "hist" has a much longer startup time than "exact".

When I inspect the CPU usage, both “exact” and “hist” uses all 40 cores. The CPU usage of “exact” oscillates around 20-100% while the CPU usage of “hist” stays saturated around 100%.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 11, 2020

@shenkev Can you try 1.0.2? We made lots of performance improvement in ‘hist’.

@trivialfis
Copy link
Member

ping @SmirnovEgorRu here.

@shenkev
Copy link
Author

shenkev commented Mar 11, 2020

Thanks for getting back to this, yes let me try the newest version.

@SmirnovEgorRu
Copy link
Contributor

@shenkev, thank you for reporting the issue.
How did you obtain 20-100% CPU usage? If you used "top" tool, for example, it means that your cpu utilized only 1 core from 40 available (it should be 4000% in ideal case).
If it's not a private information - could you, please, send the full list of parameters what you used for training? I can try to reproduce the numbers.

@shenkev
Copy link
Author

shenkev commented Mar 12, 2020

Sorry for the slow reply, I've tried the new stable release 1.0.0. Hist is no longer 10x slower than exact. However, it's still a bit slower.

Given my dataset size, I'm boosting 1 tree per 13 seconds in "exact" and 1 tree per 17 seconds in "hist".

The parameters I'm using for both algorithms are:

{
'eta': 0.01
'colsample_bytree: 0.7,
max_depth: 10,
objective: binary:logistic
}

Is "hist" expected to be slightly slower than "exact"? I've noticed from previous experience that hist doesn't have as much benefit over "exact" for small max_depth.

@shenkev
Copy link
Author

shenkev commented Mar 12, 2020

@SmirnovEgorRu I'm using the Gnome system monitor app that let's me see the CPU usage for each CPU. By oscillating between 20% and 100% I mean each CPU oscillates in that range.

@trivialfis
Copy link
Member

@shenkev How many boosted round did you run?

@SmirnovEgorRu
Copy link
Contributor

@shenkev, I tested XGBoost 1.0.2 on your dimensions + your parameters:
Performance depending on tree method:

  • not set (selected to 'approx'): 136.856 sec
  • 'exact': 141.309 sec
  • 'hist': 23.859 sec

My reproducer:

import timeit
import xgboost as xgb
from sklearn.datasets import make_classification

print("XGBoost version: ", xgb.__version__)

print("Data generation...")
trainX, trainY = make_classification(n_samples=12000000, n_features=48)

param = {
    'n_estimators': 10,
    'eta': 0.01,
    'colsample_bytree': 0.7,
    'max_depth': 10,
    'objective': 'binary:logistic',
    'verbosity': 3,
    'tree_method': 'hist',
}

print("XGB Training...")
dtrain = xgb.DMatrix(trainX, label=trainY)
t1 = timeit.default_timer()
model_xgb = xgb.train(param, dtrain, param['n_estimators'])
t2 = timeit.default_timer()

print("Time =", (t2-t1)*1000, "ms")

HW: Xeon 5120 @ 2.20GHz, 14 cores/socket, 2 sockets, HT: on

Do you see similar numbers on your HW for the bench?

P.S. current master contains even stronger optimizations of 'hist' method vs, 1.0 version due to this PR #5244. So, you can try this and obtain even better results.

@SmirnovEgorRu
Copy link
Contributor

@shenkev,

P.S. current master contains even stronger optimizations of 'hist' method vs, 1.0 version due to this PR #5244. So, you can try this and obtain even better results.

For example for 100 iteration on the same dataset and parameters with hist method I see:
XGB 1.0 - 116.053 sec
XGB master - 92.691 sec

@shenkev
Copy link
Author

shenkev commented Mar 13, 2020

@SmirnovEgorRu Thanks for reproducing this. I'll try again with the new 1.0.2 version. Maybe the problem is with our particular dataset or environment.

@trivialfis I only ran 20 rounds to time the algorithm but our full model requires hundreds of rounds.

@shenkev
Copy link
Author

shenkev commented Mar 13, 2020

I tried training in a different environment and the performance of hist was much better, it's now ~1.7 faster than exact.

My original environment was in a docker image using python. My other environment was using xgboost4j not inside a docker image.

In both environments, "exact" runs at about the same speed. "Hist" is slower only in the docker + python environment.

Any thoughts as to why I'm seeing difference in "hist" runtime between the different environments? Please close the issue otherwise.

@SmirnovEgorRu
Copy link
Contributor

@shenkev, just for my understanding - do you use spark APIs?
I'm interesting, because I haven't checked perf of this. It looks like perf of 'hist' in case of single-node python API is much better than spark APIs. If so - it's the good reason to invest in optimizations of java.

@trivialfis
Copy link
Member

If the data is extremely sparse, distributed algorithm can be much slower. I optimized quantile building for sparse data, but it doesn't work on distributed environment.

@shenkev
Copy link
Author

shenkev commented Mar 13, 2020

No, we don't use Spark nor parallel computing (i'm 95% sure).

@hcho3 hcho3 closed this as completed Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants