Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add true streaming APIs to reduce client-side memory usage #5299

Merged
merged 33 commits into from
Aug 10, 2022

Conversation

svotaw
Copy link
Contributor

@svotaw svotaw commented Jun 16, 2022

Background

Currently, Azure SynapseML wraps the LightGBM library to make it much easier for customers to use it in a distributed Spark environment. The initial approach taken by SynapseML uses the LGBM_DatasetCreateFromMat API to create a Dataset for either each Spark partition or each executor (user option). However, this method requires the SynapseML Scala code to load the entire uncompressed data into memory before sending it to LightGBM, and merge partition data together manually. This amounts to using an order of magnitude or more memory (raw double arrays and multiple copies) over what LightGBM Datasets use internally (binned data). This requires larger Spark clusters than are really needed, and often causes OOM issues.

In order to improve the memory performance of SynapseML-wrapped LightGBM, we decided to convert the Dataset creation into more of a “streaming” scenario (as opposed to the above “bulk” input matrices), where we do NOT create large arrays on client side. After initial investigation, there were existing LightGBM APIs that seemed to fit this purpose: LGBM_DatasetCreateFromSampledColumn and LGBM_DatasetPushRows[byCSR]. This seemed to allow creation of a “reference” Dataset with defined feature groups and bins, and to push small micro-batches of data into the set. However, these APIs suffered from several significant issues as currently implemented:

  1. Not thread safe for parallelism, since FinishDataset is called when the literal last index is pushed.
  2. Only pushes feature data, and Metadata must still be handled client side

Changes

This PR adds APIs to LightGBM to fix these issues and create a true “streaming” flow where microbatches of full rows, including metadata, can be pushed directly into LightGBM Dataset format. Control over FinishLoad() is given to client to eliminate problem with push order.

The new general streaming flow uses the following APIs:

  1. DatasetCreateFromSampledColumn (final number of rows) with samples from partition
    2.DatasetInitStreaming, to initialize allocation and other setup
    3.DatasetPushRows[byCSR]WithMetadata over the Spark Row Iterator
    4.DatasetMarkFinished

Note that there shouldn’t be any impact of this PR on training iteration code. The changes are mostly limited to initial Dataset creation and loading.

Testing

I added C++ tests for all of the above so we could test the functionality intensively. There are now C++ test files streaming. Both dense and sparse scenarios are covered, microbatch sizes of 1 and 2+, and all types of Metadata.
Also, the jar created from this PR was used in SynapseML, and passed an extensive list of tests covering old “bulk” mode vs new “streaming” mode, sparse vs dense, binary classification vs multiclass, regression and ranking, weights, initial data in binary and multiclass, validation data, and more. Basically our entire SynapseML LightGBM test suite now passes in both streaming and the older bulk mode (we kept them both to be able to compare and test performance).

@svotaw
Copy link
Contributor Author

svotaw commented Jun 17, 2022

PR is now green and limited to just pure streaming improvements for now. Ready for review whenever. We will wait for this review to add other components.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for doing the work of splitting this up! I just left some very minor suggestions from a quick initial review. Someone from the team will try to review the substance of this PR more thoroughly in the future.

src/boosting/score_updater.hpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
@svotaw svotaw requested a review from jameslamb June 26, 2022 23:42
@StrikerRUS
Copy link
Collaborator

@guolinke @shiyu1994 I believe only you two are able to provide a thoughtful review for this PR.

*/
inline int32_t num_classes() const {
if (num_data_) {
return static_cast<int>(num_init_score_ / num_data_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is num_init_score_ always set to non zero values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, sometimes there are no initial scores. Should we return 1 even in the case of no initial score inputs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but when there is no num_init_score_, num_classes could be > 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense to return 1 for all non-multiclass scenarios, so I made the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when multiclass, and num_init_score_ == 0, this could still return 1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svotaw is this addressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like I missed this one. Is there a particular concern/bug? What is it that you suggest it should return? This is currently used as part of allocation. If num_init_score_ == 0, this is not used. Happy to fix it if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_init_score_ does not always have values.
I believe the num_class is a hyper-parameter, which can get from config. So I am not sure why you use num_init_score_ to get num_class.

If dataset->num_classes() is currently only used for init_score assignment, maybe we should use a different name to avoid future bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. How about num_init_score_classes()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me

@svotaw svotaw requested a review from guolinke July 9, 2022 17:06
@guolinke
Copy link
Collaborator

@shiyu1994 can you also review this PR?

@mhamilton723
Copy link
Contributor

Hey @shiyu1994 gently poking on this, thank you so much for your time and consideration!

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, LGTM

@svotaw
Copy link
Contributor Author

svotaw commented Jul 23, 2022

@jameslamb you still have "changes requested". Is there something else you'd like me to change?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this awesome contribution!
Just please fix some typos and docstring mismatches listed below:

include/LightGBM/bin.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.h Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.h Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.h Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

Also, please fix CI - tests cannot be compiled:

[ 98%] Building CXX object CMakeFiles/testlightgbm.dir/tests/cpp_tests/testutils.cpp.o
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:67:18: error: no matching function for call to 'LGBM_DatasetCreateFromSampledColumn'
        result = LGBM_DatasetCreateFromSampledColumn(
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/include/LightGBM/c_api.h:127:23: note: candidate function not viable: requires 9 arguments, but 8 were provided
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromSampledColumn(double** sample_data,
                      ^
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:48:34: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t idx = 0; idx < nrows; ++idx) {
                             ~~~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:49:32: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t k = 0; k < ncols; ++k) {
                             ~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:61:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:176:18: error: no matching function for call to 'LGBM_DatasetCreateFromSampledColumn'
        result = LGBM_DatasetCreateFromSampledColumn(
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/include/LightGBM/c_api.h:127:23: note: candidate function not viable: requires 9 arguments, but 8 were provided
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromSampledColumn(double** sample_data,
                      ^
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:157:42: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
          for (size_t j = start_index; j < stop_index; ++j) {
                                       ~ ^ ~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:170:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:273:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<signed char, std::__1::allocator<signed char> >::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:274:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:327:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<signed char, std::__1::allocator<signed char> >::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:328:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/testutils.cpp:318:26: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<double, std::__1::allocator<double> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < ref_init_scores->size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/testutils.cpp:345:26: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < ref_query_boundaries.size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
9 warnings and 2 errors generated.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=13122&view=logs&j=b720f717-c673-5cb6-db4e-30b527463a8e&t=3374e212-0f80-5d51-7e1c-28e6268f20d8&l=159

@svotaw
Copy link
Contributor Author

svotaw commented Jul 24, 2022

Also, please fix CI - tests cannot be compiled:

ah, my other breaking change PR needs merging here :)

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the high-quality PR. The changes LGTM in general. Just left some suggestions.

*/
inline int32_t num_classes() const {
if (num_data_) {
return static_cast<int>(num_init_score_ / num_data_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when multiclass, and num_init_score_ == 0, this could still return 1?

src/c_api.cpp Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Show resolved Hide resolved
@svotaw svotaw requested a review from StrikerRUS July 28, 2022 17:27
@svotaw
Copy link
Contributor Author

svotaw commented Jul 28, 2022

@jameslamb @StrikerRUS just checking in to see what's left for me to do. PR looks green except for one R failure which appears to be a flake.

@jameslamb
Copy link
Collaborator

checking in to see what's left for me to do

We'll add another review when we have time. Re-requesting a review by clicking the little circle next to our names under Reviewers is sufficient to notify us.

image

Please be patient as our limited time is spread across multiple efforts in this repo, not only this PR.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the following compiler warnings (I guess size_t type should be used for i/j value):

[ 98%] Building CXX object CMakeFiles/testlightgbm.dir/tests/cpp_tests/testutils.cpp.o
/__w/1/s/tests/cpp_tests/test_stream.cpp:48:34: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t idx = 0; idx < nrows; ++idx) {
                             ~~~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:49:32: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t k = 0; k < ncols; ++k) {
                             ~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:61:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:156:42: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
          for (size_t j = start_index; j < stop_index; ++j) {
                                       ~ ^ ~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:169:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:271:21: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:272:23: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:325:21: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:326:23: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/testutils.cpp:318:26: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < ref_init_scores->size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~

include/LightGBM/bin.h Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
@svotaw svotaw requested a review from StrikerRUS July 31, 2022 13:43
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this awesome PR!

I don't have any other additional comments except minor two ones below.

However, I'm not qualified to formally approve this PR.

I believe it should be done by @shiyu1994 and/or @guolinke (there have been a lot of code changes since his last review).

tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved
tests/cpp_tests/testutils.cpp Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

Kindly ping @AlbertoEAF as you may be interested in this new functionality 🙂

@svotaw svotaw requested review from shiyu1994 and guolinke August 2, 2022 21:17
@StrikerRUS StrikerRUS dismissed their stale review August 2, 2022 21:22

All comments were addressed

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the new changes. Thanks!

@jameslamb jameslamb dismissed their stale review August 4, 2022 16:28

my initial comments were addressed

@svotaw
Copy link
Contributor Author

svotaw commented Aug 4, 2022

@jameslamb Curious... Is there a way to re-run an individual check? There is a lint failure on a file I didn't edit, and a couple of jobs that just seemed to timeout. The only way I know to rerun those checks is to push some change.

Other than that, it seems to me that the only approval left if for @guolinke to check it over again. Thanks!

@jameslamb
Copy link
Collaborator

is there a way to re-trigger an individual check?

Admins in this project can do that. For you, please just merge latest master into this branch as a way to re-trigger CI, since it's been a few days since the last commit.

For the extra benefits described in #5368 (comment)

@svotaw
Copy link
Contributor Author

svotaw commented Aug 4, 2022

@jameslamb branch merged with master, and looks all green except for a timeout on 1 check

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants