feature: Add true streaming APIs to reduce client-side memory usage #5299

svotaw · 2022-06-16T22:52:48Z

Background

Currently, Azure SynapseML wraps the LightGBM library to make it much easier for customers to use it in a distributed Spark environment. The initial approach taken by SynapseML uses the LGBM_DatasetCreateFromMat API to create a Dataset for either each Spark partition or each executor (user option). However, this method requires the SynapseML Scala code to load the entire uncompressed data into memory before sending it to LightGBM, and merge partition data together manually. This amounts to using an order of magnitude or more memory (raw double arrays and multiple copies) over what LightGBM Datasets use internally (binned data). This requires larger Spark clusters than are really needed, and often causes OOM issues.

In order to improve the memory performance of SynapseML-wrapped LightGBM, we decided to convert the Dataset creation into more of a “streaming” scenario (as opposed to the above “bulk” input matrices), where we do NOT create large arrays on client side. After initial investigation, there were existing LightGBM APIs that seemed to fit this purpose: LGBM_DatasetCreateFromSampledColumn and LGBM_DatasetPushRows[byCSR]. This seemed to allow creation of a “reference” Dataset with defined feature groups and bins, and to push small micro-batches of data into the set. However, these APIs suffered from several significant issues as currently implemented:

Not thread safe for parallelism, since FinishDataset is called when the literal last index is pushed.
Only pushes feature data, and Metadata must still be handled client side

Changes

This PR adds APIs to LightGBM to fix these issues and create a true “streaming” flow where microbatches of full rows, including metadata, can be pushed directly into LightGBM Dataset format. Control over FinishLoad() is given to client to eliminate problem with push order.

The new general streaming flow uses the following APIs:

DatasetCreateFromSampledColumn (final number of rows) with samples from partition
2.DatasetInitStreaming, to initialize allocation and other setup
3.DatasetPushRows[byCSR]WithMetadata over the Spark Row Iterator
4.DatasetMarkFinished

Note that there shouldn’t be any impact of this PR on training iteration code. The changes are mostly limited to initial Dataset creation and loading.

Testing

I added C++ tests for all of the above so we could test the functionality intensively. There are now C++ test files streaming. Both dense and sparse scenarios are covered, microbatch sizes of 1 and 2+, and all types of Metadata.
Also, the jar created from this PR was used in SynapseML, and passed an extensive list of tests covering old “bulk” mode vs new “streaming” mode, sparse vs dense, binary classification vs multiclass, regression and ranking, weights, initial data in binary and multiclass, validation data, and more. Basically our entire SynapseML LightGBM test suite now passes in both streaming and the older bulk mode (we kept them both to be able to compare and test performance).

svotaw · 2022-06-17T16:42:09Z

PR is now green and limited to just pure streaming improvements for now. Ready for review whenever. We will wait for this review to add other components.

jameslamb

Thanks very much for doing the work of splitting this up! I just left some very minor suggestions from a quick initial review. Someone from the team will try to review the substance of this PR more thoroughly in the future.

src/boosting/score_updater.hpp

tests/cpp_tests/testutils.h

include/LightGBM/c_api.h

StrikerRUS · 2022-07-03T15:12:03Z

@guolinke @shiyu1994 I believe only you two are able to provide a thoughtful review for this PR.

include/LightGBM/dataset.h

guolinke · 2022-07-05T01:11:13Z

include/LightGBM/dataset.h

+  */
+  inline int32_t num_classes() const {
+    if (num_data_) {
+      return static_cast<int>(num_init_score_ / num_data_);


is num_init_score_ always set to non zero values?

no, sometimes there are no initial scores. Should we return 1 even in the case of no initial score inputs?

but when there is no num_init_score_, num_classes could be > 1.

Ok, makes sense to return 1 for all non-multiclass scenarios, so I made the change.

It seems that when multiclass, and num_init_score_ == 0, this could still return 1?

@svotaw is this addressed?

it looks like I missed this one. Is there a particular concern/bug? What is it that you suggest it should return? This is currently used as part of allocation. If num_init_score_ == 0, this is not used. Happy to fix it if needed.

num_init_score_ does not always have values.
I believe the num_class is a hyper-parameter, which can get from config. So I am not sure why you use num_init_score_ to get num_class.

If dataset->num_classes() is currently only used for init_score assignment, maybe we should use a different name to avoid future bugs.

makes sense. How about num_init_score_classes()?

it looks good to me

src/io/metadata.cpp

include/LightGBM/dataset.h

guolinke · 2022-07-10T00:33:45Z

@shiyu1994 can you also review this PR?

mhamilton723 · 2022-07-13T20:45:15Z

Hey @shiyu1994 gently poking on this, thank you so much for your time and consideration!

guolinke

Thank you, LGTM

svotaw · 2022-07-23T19:55:28Z

@jameslamb you still have "changes requested". Is there something else you'd like me to change?

StrikerRUS

Thanks a lot for this awesome contribution!
Just please fix some typos and docstring mismatches listed below:

include/LightGBM/bin.h

include/LightGBM/c_api.h

tests/cpp_tests/testutils.cpp

tests/cpp_tests/testutils.h

StrikerRUS · 2022-07-24T17:04:14Z

Also, please fix CI - tests cannot be compiled:

[ 98%] Building CXX object CMakeFiles/testlightgbm.dir/tests/cpp_tests/testutils.cpp.o
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:67:18: error: no matching function for call to 'LGBM_DatasetCreateFromSampledColumn'
        result = LGBM_DatasetCreateFromSampledColumn(
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/include/LightGBM/c_api.h:127:23: note: candidate function not viable: requires 9 arguments, but 8 were provided
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromSampledColumn(double** sample_data,
                      ^
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:48:34: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t idx = 0; idx < nrows; ++idx) {
                             ~~~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:49:32: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t k = 0; k < ncols; ++k) {
                             ~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:61:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:176:18: error: no matching function for call to 'LGBM_DatasetCreateFromSampledColumn'
        result = LGBM_DatasetCreateFromSampledColumn(
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/include/LightGBM/c_api.h:127:23: note: candidate function not viable: requires 9 arguments, but 8 were provided
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromSampledColumn(double** sample_data,
                      ^
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:157:42: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
          for (size_t j = start_index; j < stop_index; ++j) {
                                       ~ ^ ~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:170:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:273:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<signed char, std::__1::allocator<signed char> >::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:274:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:327:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<signed char, std::__1::allocator<signed char> >::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/test_stream.cpp:328:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/testutils.cpp:318:26: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<double, std::__1::allocator<double> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < ref_init_scores->size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/1/s/tests/cpp_tests/testutils.cpp:345:26: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < ref_query_boundaries.size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
9 warnings and 2 errors generated.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=13122&view=logs&j=b720f717-c673-5cb6-db4e-30b527463a8e&t=3374e212-0f80-5d51-7e1c-28e6268f20d8&l=159

svotaw · 2022-07-24T21:25:01Z

Also, please fix CI - tests cannot be compiled:

ah, my other breaking change PR needs merging here :)

shiyu1994

Thanks for the high-quality PR. The changes LGTM in general. Just left some suggestions.

shiyu1994 · 2022-07-27T05:42:27Z

include/LightGBM/dataset.h

+  */
+  inline int32_t num_classes() const {
+    if (num_data_) {
+      return static_cast<int>(num_init_score_ / num_data_);


It seems that when multiclass, and num_init_score_ == 0, this could still return 1?

src/c_api.cpp

tests/cpp_tests/testutils.cpp

svotaw · 2022-07-28T19:55:05Z

@jameslamb @StrikerRUS just checking in to see what's left for me to do. PR looks green except for one R failure which appears to be a flake.

jameslamb · 2022-07-28T20:30:14Z

checking in to see what's left for me to do

We'll add another review when we have time. Re-requesting a review by clicking the little circle next to our names under Reviewers is sufficient to notify us.

Please be patient as our limited time is spread across multiple efforts in this repo, not only this PR.

StrikerRUS

Please fix the following compiler warnings (I guess size_t type should be used for i/j value):

[ 98%] Building CXX object CMakeFiles/testlightgbm.dir/tests/cpp_tests/testutils.cpp.o
/__w/1/s/tests/cpp_tests/test_stream.cpp:48:34: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t idx = 0; idx < nrows; ++idx) {
                             ~~~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:49:32: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t k = 0; k < ncols; ++k) {
                             ~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:61:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:156:42: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
          for (size_t j = start_index; j < stop_index; ++j) {
                                       ~ ^ ~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:169:30: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        for (size_t i = 0; i < ncols; ++i) {
                           ~ ^ ~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:271:21: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:272:23: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:325:21: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
  for (int i = 0; i < creation_types.size(); ++i) {  // from sampled data or reference
                  ~ ^ ~~~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/test_stream.cpp:326:23: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    for (int j = 0; j < batch_counts.size(); ++j) {
                    ~ ^ ~~~~~~~~~~~~~~~~~~~
/__w/1/s/tests/cpp_tests/testutils.cpp:318:26: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < ref_init_scores->size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~

include/LightGBM/bin.h

tests/cpp_tests/testutils.cpp

StrikerRUS

Thanks a lot for this awesome PR!

I don't have any other additional comments except minor two ones below.

However, I'm not qualified to formally approve this PR.

I believe it should be done by @shiyu1994 and/or @guolinke (there have been a lot of code changes since his last review).

tests/cpp_tests/testutils.cpp

StrikerRUS · 2022-08-02T20:58:23Z

Kindly ping @AlbertoEAF as you may be interested in this new functionality 🙂

All comments were addressed

shiyu1994

I'm OK with the new changes. Thanks!

my initial comments were addressed

svotaw · 2022-08-04T17:41:29Z

@jameslamb Curious... Is there a way to re-run an individual check? There is a lint failure on a file I didn't edit, and a couple of jobs that just seemed to timeout. The only way I know to rerun those checks is to push some change.

Other than that, it seems to me that the only approval left if for @guolinke to check it over again. Thanks!

jameslamb · 2022-08-04T17:44:21Z

is there a way to re-trigger an individual check?

Admins in this project can do that. For you, please just merge latest master into this branch as a way to re-trigger CI, since it's been a few days since the last commit.

For the extra benefits described in #5368 (comment)

svotaw · 2022-08-04T23:20:58Z

@jameslamb branch merged with master, and looks all green except for a timeout on 1 check

github-actions · 2023-08-19T03:23:28Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

svotaw requested review from guolinke, shiyu1994, jameslamb and StrikerRUS as code owners June 16, 2022 22:52

svotaw mentioned this pull request Jun 16, 2022

feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

Closed

jameslamb previously requested changes Jun 17, 2022

View reviewed changes

src/boosting/score_updater.hpp Outdated Show resolved Hide resolved

tests/cpp_tests/testutils.h Outdated Show resolved Hide resolved

include/LightGBM/c_api.h Outdated Show resolved Hide resolved

svotaw requested a review from jameslamb June 26, 2022 23:42

StrikerRUS added the awaiting review label Jul 3, 2022

guolinke reviewed Jul 5, 2022

View reviewed changes

include/LightGBM/dataset.h Show resolved Hide resolved

guolinke reviewed Jul 5, 2022

View reviewed changes

src/io/metadata.cpp Show resolved Hide resolved

guolinke reviewed Jul 5, 2022

View reviewed changes

include/LightGBM/dataset.h Show resolved Hide resolved

svotaw requested a review from guolinke July 9, 2022 17:06

This was referenced Jul 19, 2022

PushRows APIs are not thread safe for sparse data #5383

Closed

feat: Add LightGBM streaming execution mode microsoft/SynapseML#1580

Merged

guolinke approved these changes Jul 23, 2022

View reviewed changes

StrikerRUS requested changes Jul 24, 2022

View reviewed changes

StrikerRUS added feature and removed awaiting review labels Jul 24, 2022

svotaw added 4 commits July 24, 2022 14:29

Extract streaming to own PR

c2c3632

small merge fixes and cleanup

6456bb4

linting fixes

f3db2f5

fix cast warning

5224234

shiyu1994 reviewed Jul 27, 2022

View reviewed changes

svotaw added 2 commits July 27, 2022 10:32

responded to shiyu1994 comments

386978c

Merge branch 'master' into streaming

be3c368

svotaw requested a review from StrikerRUS July 28, 2022 17:27

StrikerRUS previously requested changes Jul 30, 2022

View reviewed changes

responded to StrikerRUS comments

cef90e9

svotaw requested a review from StrikerRUS July 31, 2022 13:43

StrikerRUS reviewed Aug 2, 2022

View reviewed changes

tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved

tests/cpp_tests/testutils.cpp Show resolved Hide resolved

StrikerRUS reviewed Aug 2, 2022

View reviewed changes

tests/cpp_tests/testutils.cpp Outdated Show resolved Hide resolved

Fixes from StrikerRUS comments

91ca0b7

svotaw requested review from shiyu1994 and guolinke August 2, 2022 21:17

shiyu1994 approved these changes Aug 4, 2022

View reviewed changes

Merge branch 'microsoft:master' into streaming

7adbde3

shiyu1994 merged commit 0a5c583 into microsoft:master Aug 10, 2022

svotaw mentioned this pull request Aug 23, 2022

Add streaming concurrency tests #5437

Merged

svotaw mentioned this pull request Sep 1, 2022

Rename Metadata num_classes to be more clear #5461

Merged

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

jameslamb mentioned this pull request Nov 29, 2022

Fix OpenMP thread allocation in Linux #5551

Merged

svotaw mentioned this pull request Dec 6, 2022

Improve memory performance of Dataset loading by supporting a more streaming-based approach to loading Datasets. #5298

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Add true streaming APIs to reduce client-side memory usage #5299

feature: Add true streaming APIs to reduce client-side memory usage #5299

svotaw commented Jun 16, 2022 •

edited

Loading

svotaw commented Jun 17, 2022

jameslamb left a comment

StrikerRUS commented Jul 3, 2022

guolinke Jul 5, 2022

svotaw Jul 5, 2022

guolinke Jul 6, 2022

svotaw Jul 9, 2022

shiyu1994 Jul 27, 2022

guolinke Aug 25, 2022

svotaw Aug 26, 2022

guolinke Aug 26, 2022

svotaw Aug 27, 2022

guolinke Sep 1, 2022

guolinke commented Jul 10, 2022

mhamilton723 commented Jul 13, 2022

guolinke left a comment

svotaw commented Jul 23, 2022

StrikerRUS left a comment

StrikerRUS commented Jul 24, 2022

svotaw commented Jul 24, 2022

shiyu1994 left a comment

shiyu1994 Jul 27, 2022

svotaw commented Jul 28, 2022

jameslamb commented Jul 28, 2022

StrikerRUS left a comment

StrikerRUS left a comment

StrikerRUS commented Aug 2, 2022

shiyu1994 left a comment

svotaw commented Aug 4, 2022

jameslamb commented Aug 4, 2022

svotaw commented Aug 4, 2022

github-actions bot commented Aug 19, 2023

feature: Add true streaming APIs to reduce client-side memory usage #5299

feature: Add true streaming APIs to reduce client-side memory usage #5299

Conversation

svotaw commented Jun 16, 2022 • edited Loading

Background

Changes

Testing

svotaw commented Jun 17, 2022

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jul 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guolinke commented Jul 10, 2022

mhamilton723 commented Jul 13, 2022

guolinke left a comment

Choose a reason for hiding this comment

svotaw commented Jul 23, 2022

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jul 24, 2022

svotaw commented Jul 24, 2022

shiyu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svotaw commented Jul 28, 2022

jameslamb commented Jul 28, 2022

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Aug 2, 2022

shiyu1994 left a comment

Choose a reason for hiding this comment

svotaw commented Aug 4, 2022

jameslamb commented Aug 4, 2022

svotaw commented Aug 4, 2022

github-actions bot commented Aug 19, 2023

svotaw commented Jun 16, 2022 •

edited

Loading