-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Add true streaming APIs to reduce client-side memory usage #5299
Conversation
PR is now green and limited to just pure streaming improvements for now. Ready for review whenever. We will wait for this review to add other components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for doing the work of splitting this up! I just left some very minor suggestions from a quick initial review. Someone from the team will try to review the substance of this PR more thoroughly in the future.
@guolinke @shiyu1994 I believe only you two are able to provide a thoughtful review for this PR. |
*/ | ||
inline int32_t num_classes() const { | ||
if (num_data_) { | ||
return static_cast<int>(num_init_score_ / num_data_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is num_init_score_
always set to non zero values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, sometimes there are no initial scores. Should we return 1 even in the case of no initial score inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but when there is no num_init_score_, num_classes could be > 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense to return 1 for all non-multiclass scenarios, so I made the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that when multiclass, and num_init_score_ == 0
, this could still return 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svotaw is this addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like I missed this one. Is there a particular concern/bug? What is it that you suggest it should return? This is currently used as part of allocation. If num_init_score_ == 0, this is not used. Happy to fix it if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_init_score_
does not always have values.
I believe the num_class
is a hyper-parameter, which can get from config. So I am not sure why you use num_init_score_
to get num_class
.
If dataset->num_classes()
is currently only used for init_score assignment, maybe we should use a different name to avoid future bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. How about num_init_score_classes()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good to me
@shiyu1994 can you also review this PR? |
Hey @shiyu1994 gently poking on this, thank you so much for your time and consideration! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, LGTM
@jameslamb you still have "changes requested". Is there something else you'd like me to change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this awesome contribution!
Just please fix some typos and docstring mismatches listed below:
Also, please fix CI - tests cannot be compiled:
|
ah, my other breaking change PR needs merging here :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the high-quality PR. The changes LGTM in general. Just left some suggestions.
*/ | ||
inline int32_t num_classes() const { | ||
if (num_data_) { | ||
return static_cast<int>(num_init_score_ / num_data_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that when multiclass, and num_init_score_ == 0
, this could still return 1?
@jameslamb @StrikerRUS just checking in to see what's left for me to do. PR looks green except for one R failure which appears to be a flake. |
We'll add another review when we have time. Re-requesting a review by clicking the little circle next to our names under Please be patient as our limited time is spread across multiple efforts in this repo, not only this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the following compiler warnings (I guess size_t
type should be used for i
/j
value):
[ 98%] Building CXX object CMakeFiles/testlightgbm.dir/tests/cpp_tests/testutils.cpp.o
/__w/1/s/tests/cpp_tests/test_stream.cpp:48:34: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
for (size_t idx = 0; idx < nrows; ++idx) {
~~~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:49:32: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
for (size_t k = 0; k < ncols; ++k) {
~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:61:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
for (size_t i = 0; i < ncols; ++i) {
~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:156:42: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
for (size_t j = start_index; j < stop_index; ++j) {
~ ^ ~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:169:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
for (size_t i = 0; i < ncols; ++i) {
~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:271:21: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
for (int i = 0; i < creation_types.size(); ++i) { // from sampled data or reference
~ ^ ~~~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:272:23: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
for (int j = 0; j < batch_counts.size(); ++j) {
~ ^ ~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:325:21: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
for (int i = 0; i < creation_types.size(); ++i) { // from sampled data or reference
~ ^ ~~~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:326:23: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
for (int j = 0; j < batch_counts.size(); ++j) {
~ ^ ~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/testutils.cpp:318:26: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
for (auto i = 0; i < ref_init_scores->size(); i++) {
~ ^ ~~~~~~~~~~~~~~~~~~~~~~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this awesome PR!
I don't have any other additional comments except minor two ones below.
However, I'm not qualified to formally approve this PR.
I believe it should be done by @shiyu1994 and/or @guolinke (there have been a lot of code changes since his last review).
Kindly ping @AlbertoEAF as you may be interested in this new functionality 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the new changes. Thanks!
@jameslamb Curious... Is there a way to re-run an individual check? There is a lint failure on a file I didn't edit, and a couple of jobs that just seemed to timeout. The only way I know to rerun those checks is to push some change. Other than that, it seems to me that the only approval left if for @guolinke to check it over again. Thanks! |
Admins in this project can do that. For you, please just merge latest For the extra benefits described in #5368 (comment) |
@jameslamb branch merged with master, and looks all green except for a timeout on 1 check |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Background
Currently, Azure SynapseML wraps the LightGBM library to make it much easier for customers to use it in a distributed Spark environment. The initial approach taken by SynapseML uses the
LGBM_DatasetCreateFromMat
API to create aDataset
for either each Spark partition or each executor (user option). However, this method requires the SynapseML Scala code to load the entire uncompressed data into memory before sending it to LightGBM, and merge partition data together manually. This amounts to using an order of magnitude or more memory (raw double arrays and multiple copies) over what LightGBMDatasets
use internally (binned data). This requires larger Spark clusters than are really needed, and often causes OOM issues.In order to improve the memory performance of SynapseML-wrapped LightGBM, we decided to convert the
Dataset
creation into more of a “streaming” scenario (as opposed to the above “bulk” input matrices), where we do NOT create large arrays on client side. After initial investigation, there were existing LightGBM APIs that seemed to fit this purpose:LGBM_DatasetCreateFromSampledColumn
andLGBM_DatasetPushRows[byCSR]
. This seemed to allow creation of a “reference”Dataset
with defined feature groups and bins, and to push small micro-batches of data into the set. However, these APIs suffered from several significant issues as currently implemented:FinishDataset
is called when the literal last index is pushed.Metadata
must still be handled client sideChanges
This PR adds APIs to LightGBM to fix these issues and create a true “streaming” flow where microbatches of full rows, including metadata, can be pushed directly into LightGBM
Dataset
format. Control over FinishLoad() is given to client to eliminate problem with push order.The new general streaming flow uses the following APIs:
DatasetCreateFromSampledColumn
(final number of rows) with samples from partition2.
DatasetInitStreaming
, to initialize allocation and other setup3.
DatasetPushRows[byCSR]WithMetadata
over the Spark RowIterator
4.
DatasetMarkFinished
Note that there shouldn’t be any impact of this PR on training iteration code. The changes are mostly limited to initial
Dataset
creation and loading.Testing
I added C++ tests for all of the above so we could test the functionality intensively. There are now C++ test files streaming. Both dense and sparse scenarios are covered, microbatch sizes of 1 and 2+, and all types of Metadata.
Also, the jar created from this PR was used in SynapseML, and passed an extensive list of tests covering old “bulk” mode vs new “streaming” mode, sparse vs dense, binary classification vs multiclass, regression and ranking, weights, initial data in binary and multiclass, validation data, and more. Basically our entire SynapseML LightGBM test suite now passes in both streaming and the older bulk mode (we kept them both to be able to compare and test performance).