Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] efficiency improvements #2791

Open
guolinke opened this issue Feb 21, 2020 · 22 comments
Open

[Discussion] efficiency improvements #2791

guolinke opened this issue Feb 21, 2020 · 22 comments

Comments

@guolinke
Copy link
Collaborator

This is to call the efficiency improvement for LightGBM, include but not limited to:

  • Tree learning algorithm acceleration
  • I/O related, like dataset loading and model saving
  • Inference speed improvement
  • Better GPU acceleration
  • Distributed learning improvement

If you have any ideas, you can discuss with us here, and open the corresponding issues/pull requests if needed.

@StrikerRUS
Copy link
Collaborator

Linking #2934 and #2935.

@MaxHalford
Copy link

MaxHalford commented Apr 11, 2020

I have a suggestion but I'm not sure it's in-scope. It concerns improving efficiency during the prediction speed of trees. This in turn could improve the overall training time because gradient boosting requires asking the weak models to predict gradients.

I'm not familiar with the tree implementation in LightGBM. The intuitive way of having two pointers that refer to the left and right childs doesn't seem to be the best format for evaluating a tree. Basically, it seems that there are two alternative ways:

  1. Compile the tree to a sequence of ifs and elses. AFAIK this is what treelite allows you to do.
  2. Convert a tree into a heap, which means the nodes are contiguous in memory. Each branch contains two integers that indicate where to go next in the heap. This is quite well explained in this Wikipedia article section.

I thought about suggesting this after having read this great blog post by Andrew Tulloch, which includes some in-depth benchmarks. I don't know if this is something LightGBM already integrates. Also I'm not well-versed in C++ so I can't help on that side. However these solutions seem rather simple to implement.

@julioasotodv
Copy link
Contributor

Apart from the possibility to use CUDA instead of OpenCL for GPU trees (even though it would reduce compatibility), xgboost has made a lot of progress using Dask for distributed training (in fact, I believe that other experimental distributed training techniques they had are now deprecated).

I believe someone started a project fro smoother integration between LightGBM and Dask, but I am unable to find it right now...

As for GPU, I think that guys at Rapids (Nvidia) are looking forward to developing more CUDA implementations for different ML algorithms (just like they have done with xgboost, as they have rewritten the whole gpu tree learner in the library). Maybe contacting them could be a good idea.

@jameslamb
Copy link
Collaborator

dask-lightgbm is here: https://github.com/dask/dask-lightgbm

It's not something we officially maintain or contribute to yet, but I agree it could be a useful thing for us to help with.

@guolinke
Copy link
Collaborator Author

guolinke commented Aug 6, 2020

@MaxHalford Thanks very much. Actually, LightGBM is able to convert tree model to c++ codes.
However, the tree in lightgbm is not the balance tree, so it is not trivial to optimized by the methods your post.
But this is indeed a good potential direction.
BTW, we will implement the symmetric tree (like in CatBoost) in the future (by @shiyu1994 ), which will be much faster in prediction time, and more robust for the small dataset.

@MaxHalford
Copy link

@guolinke I think the trick still works. You just need to maintain two arrays that indicate the left and right child of each node. That way you can "walk" to the next node with something like:

if x[features[node]] < thresholds[node]:
    node = left[node]
else:
    node = right[node]

@guolinke
Copy link
Collaborator Author

guolinke commented Aug 6, 2020

Thank you @julioasotodv .
Actually, the CUDA implementation in existing tools already integrates most "Histogram tricks" from LightGBM.
So it is more like the "reinvert the wheel" if we just simply re-implement it again.
We plan to revisit the existing implementations and estimate the potential improvements. If there are still some rooms, we will develop the new version of LightGBM GPU.

@guolinke
Copy link
Collaborator Author

guolinke commented Aug 6, 2020

@MaxHalford ,
The current prediction logic is very similar to that fashion, it is

inline int Tree::GetLeaf(const double* feature_values) const {
int node = 0;
if (num_cat_ > 0) {
while (node >= 0) {
node = Decision(feature_values[split_feature_[node]], node);
}
} else {
while (node >= 0) {
node = NumericalDecision(feature_values[split_feature_[node]], node);
}
}
return ~node;
}
.

@MaxHalford
Copy link

Cheers @guolinke. I don't believe there's much improvement to bring from that side then.

How about the method that is discusses in this scikit-learn issue? Is it already implemented?

@guolinke
Copy link
Collaborator Author

guolinke commented Sep 6, 2020

@MaxHalford i quick through the PR. I think it is related to python-side histogram implementation, and our cpp implementation is already optimized for both speed and memory.

And we have LRU histogram pool to further reduce memory cost.

@ogrisel
Copy link

ogrisel commented Sep 7, 2020

For the record, LightGBM is 1.5x to 2x faster than the master branch of scikit-learn in sequential mode at the moment. This difference reduces with many threads because of the diminishing returns of parallelism after 8 to 16 CPU threads but that depends on the shape of the dataset, number of nodes... I haven't investigated yet why LightGBM is so much faster in sequential mode yet. It wasn't the case 6 months ago if I remember correctly.

@guolinke
Copy link
Collaborator Author

guolinke commented Sep 7, 2020

@ogrisel the sequential mode is num_thread=1?
Did you test the LightGBM 3.0.0 ?
We have several speed optimizations in the 3.0.0 release. In particular, LightGBM also implement row-wise histogram algorithm (previous is column-wise), and may is faster with smaller #thread or smaller #feature.
LightGBM will test the col-wise and row-wise before the training, when given a data and parameter, and use the faster one automatically.

@ogrisel
Copy link

ogrisel commented Sep 7, 2020

@ogrisel the sequential mode is num_thread=1?

I just set the OMP_NUM_THREADS=1 environment variable when running my benchmark script. I suppose this is equivalent.

Did you test the LightGBM 3.0.0 ?

Yes.

We have several speed optimizations in the 3.0.0 release. In particular, LightGBM also implement row-wise histogram algorithm (previous is column-wise), and may is faster with smaller #thread or smaller #feature.

Ah thanks that's probably it then. Indeed the Higgs boson benchmark I used has not that many features.

@jameslamb
Copy link
Collaborator

I've just added a new efficiency-related feature: #3369

If anyone following along here has experience with Apache Arrow, please add any context you think could help on that issue, or let me know if I've made any mistakes.

@jameslamb
Copy link
Collaborator

I've started a discussion at dask/community#104 about the possibility of bringing dask-lightgbm into this project, similar to the direction xgboost has gone:

@StrikerRUS @guolinke I've been working with Dask a lot and would be willing to take responsibility for maintenance work on this. I agree with the decisions reached in those two xgboost links above, that having a native integration with Dask that we maintain would lead to a better user experience.

@StrikerRUS
Copy link
Collaborator

@jameslamb Awesome proposal!
Unfortunately, I'm not familiar with dask, but my intuition was that dask is about scaling training up to multiple machines. Something like "for small data use your own laptop, for big data use dask on several machines, for huge data use Spark on several clusters". Please correct me if I'm wrong. Anyway, my main concern is that by adopting dask wrapper we will have to ensure its' quality by unit tests on out side. But we have no resources to run distributed tests. Our current MPI test just ensures that MPI version can be built, but all tests are executed on one machine.

@jameslamb
Copy link
Collaborator

Thanks!

but my intuition was that dask is about scaling training up to multiple machines. Something like "for small data use your own laptop, for big data use dask on several machines, for huge data use Spark on several clusters". Please correct me if I'm wrong

Dask is a thing that allows for horizontal scaling across multiple machines, but it's also an API for expressing Python work that could be parallelized (even if that parallelization is all on one machine).

So, for example, people often use Dask Dataframes as a way to work with larger-than-memory data frames on a single machine. You write code that closely follows the pandas API, and Dask takes care of moving data back and forth between disk and memory without you needing to think about it.

So if we follow that proposal, you'd be able to use a Dask-ified version of lightgbm models to train on Dask collections (Dask DataFrame, Dask Array), which would be beneficial even if not scaling out to multiple machines.

by adopting dask wrapper we will have to ensure its' quality by unit tests on out side. But we have no resources to run distributed tests. Our current MPI test just ensures that MPI version can be built, but all tests are executed on one machine.

I don't think we'd have to do this for this to be successful. The dask-lightgbm project already does not do this, and approximates it with docker-compose.

One of the main design principles of Dask is that it scales from your laptop to a distributed cluster with minimal code changes...that means that we can test against a dask.distributed.LocalCluster and have some confidence that it will work on a true multi-node cluster.

Issues that are specific to the multi-node setting might be raised by users and have to be addressed here with manual testing, but that is already the current state of LightGBM-on-Dask with dask-lightgbm, so I don't think it's a reason not to move forward with that proposal.

@StrikerRUS
Copy link
Collaborator

@jameslamb Thanks a lot for the detailed response!

.that means that we can test against a dask.distributed.LocalCluster and have some confidence that it will work on a true multi-node cluster.

OK, this is great!

@jameslamb
Copy link
Collaborator

Thanks! @guolinke what do you think?

If we move forward with this, I'll ask the dask-lightgbm maintainers to make a PR here with the main components, and then we I can generate some issues for what else would need to be done for testing, docs, etc.

@guolinke
Copy link
Collaborator Author

awesome! sounds good to me

@ppaleja
Copy link

ppaleja commented Jan 26, 2021

Hi, I hope this is the right place to post this, but I was reading through the original LightGBM paper, and it mentions a greedy coloring scheme in which to do the (I think) one of the main optimizations that LightGBM, the feature bundling. I was wondering why a greedy method was chosen to do this instead of some other targeted approach (I think the paper mentions that bundling works well in the sparse matrix case and I believe there are coloring algorithms that specifically target sparse graphs and have more provable guarantees than a greedy method would). Sorry if this has already been asked, but has there been any work on using a different algorithm for this bundling process? I feel that this might be worth looking into, but I wasn't able to find in the code where the bundling process occurs (if it exists currently), to see if I could 'plug in' another algorithm and see if it would give a noticeable improvement.

@lorentzenchr
Copy link
Contributor

We have several speed optimizations in the 3.0.0 release. In particular, LightGBM also implement row-wise histogram algorithm (previous is column-wise), and may is faster with smaller #thread or smaller #feature.

Is it possible to lay out the algorithm in this case of row-wise histogram computation (pointing to literature, add code comments, writing a blog post, ...)? I find it very difficult to follow the logic reading the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants