feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

svotaw · 2022-06-14T21:28:59Z

Background

Currently, Azure SynapseML wraps the LightGBM library to make it much easier for customers to use it in a distributed Spark environment. The initial approach taken by SynapseML uses the LGBM_DatasetCreateFromMat API to create a Dataset for either each Spark partition or each executor (user option). However, this method requires the SynapseML Scala code to load the entire uncompressed data into memory before sending it to LightGBM, and merge partition data together manually. This amounts to using an order of magnitude or more memory (raw double arrays and multiple copies) over what LightGBM Datasets use internally (binned data). This requires larger Spark clusters than are really needed, and often causes OOM issues.

In order to improve the memory performance of SynapseML-wrapped LightGBM, we decided to convert the Dataset creation into more of a “streaming” scenario (as opposed to the above “bulk” input matrices), where we do NOT create large arrays on client side. After initial investigation, there were existing LightGBM APIs that seemed to fit this purpose: LGBM_DatasetCreateFromSampledColumn and LGBM_DatasetPushRows[byCSR]. This seemed to allow creation of a “reference” Dataset with defined feature groups and bins, and to push small micro-batches of data into the set. However, these APIs suffered from several significant issues as currently implemented:

Not thread safe for parallelism, since FinishDataset is called when the literal last index is pushed.
Still requires knowing Dataset size up front (not convenient for Spark Row Iterator or general streaming)
Only pushes feature data, and Metadata must still be handled client side
Does not handle partial Datasets or merging Datasets together

Changes

This PR adds APIs to LightGBM to fix these issues and create a true “streaming” flow where microbatches of full rows (including metadata) can be pushed directly into LightGBM Dataset format. The general idea is that we still create an initial Dataset with DatasetCreateFromSampledColumn, but then create other “real” Datasets from that reference. Note that this also fixes a problem where DatasetCreateFromMat does not use the full data corpus to sample the data, and only the data from each separate partition.
The new general flow within SynapseML can be described as 2 approaches: simple streaming and coalesced streaming. With simple streaming, the count of rows must be known up front and uses the following APIs:
Driver:

DatasetCreateFromSampledColumn (final number of rows) with samples from full dataframe
DatasetSerializeReferenceToBinary to create a reference ByteArray to pass to workers

Workers:

Count partition size
DatasetCreateFromSerializedReference(size) using passed buffer
DatasetSetWaitForManualFinish(true), to take manual control over Dataset Finish
DatasetPushRows[byCSR]WithMetadata over the Spark Row Iterator
DatasetMarkFinished

This simple streaming scenario works in some limited cases, but requires counting up front and is not particularly flexible. To address a true “unbounded” streaming scenario, we also added APIs for a “coalesced flow”. The basic approach is to create some intermediate Datasets of some fixed “chunk size” (which are never “Finished”), and then merge them into 1 final Dataset to actually use for training iterations. Datasets keep track if they are only “partially” loaded by keeping a num_pushed_rows. The Driver flow is the same, but the worker flow is different:
Workers

DatasetCreateFromSerializedReference(semi-arbitrary chunk size)
DatasetSetWaitForManualFinish(true), to take manual control over FinishLoad
DatasetPushRows[byCSR]WithMetadata over Spark Row Iterator
If dataset fills, repeat from 1 with new Dataset and add to list
Count sum of num_pushed_rows from all Datasets, creating a final num_rows
Final Dataset = DatasetCreateFromSerializedReference(final num_rows)
DatasetCoalesce(target final Dataset, source list of Dataset chunks) – note this calls FinishLoad internally on the target after coalescing

This “coalesced streaming” flow is very flexible, and allows us to optimize the final training Datasets by Spark partition, executor, or whatever we want. There is no client-side memory overhead, except for a small fixed amout to hold a “microbatch” of pushed data (which can be pushed down to 1 row if desired). And there is no extra special processing for Metadata, so we can send full Spark dataset “Rows” and LightGBM handles it all.
We added the above APIs in this review, which largely do not affect existing APIs. I only refactored some existing code where it made more sense to share sections between old and new methods (e.g. reading header from either file or serialized buffer).
I also added a new ByteBuffer class, to make it easier to pass unbounded buffer back to client layer. This avoids the odd flow of 1) call and get size, 2) allocate some native array, 3) call final API to load buffer. The ByteBuffer can do this in one pass. This also required a little refactoring of the Writer since I wanted to share code between file and memory buffer operations.
Note that there shouldn’t be any impact of this PR on training iteration code. The changes are mostly limited to initial Dataset creation and loading.

Testing

I added C++ tests for all of the above so we could test the functionality intensively. There are now C++ test files for serialization, streaming, coalescing, and the new byte buffer. Both dense and sparse scenarios are covered, 4bit vs 8+bit, microbatch sizes of 1 and 2+, and all types of Metadata.
Also, the jar created from this PR was used in SynapseML, and passed an extensive list of tests covering old “bulk” mode vs new “streaming” mode, sparse vs dense, binary classification vs multiclass, regression and ranking, weights, initial data in binary and multiclass, validation data, and more. Basically our entire SynapseML LightGBM test suite now passes in both streaming and the older bulk mode (we kept them both to be able to compare and test performance).

jameslamb · 2022-06-15T19:34:53Z

@svotaw thanks for the contribution! I'm not ready to review it yet, but since I noticed that you've been pushing multiple commits to try to work around linter issues...wanted to let you know that you can find all the commands run and tools used by the lint CI job for this project here:

LightGBM/.ci/test.sh

Lines 70 to 91 in f96c4ce

    
           if [[ $TASK == "lint" ]]; then 
        
               conda install -q -y -n $CONDA_ENV \ 
        
                   cmakelint \ 
        
                   cpplint \ 
        
                   isort \ 
        
                   mypy \ 
        
                   pycodestyle \ 
        
                   pydocstyle \ 
        
                   "r-lintr>=2.0,<3.0" 
        
               echo "Linting Python code" 
        
               pycodestyle --ignore=E501,W503 --exclude=./.nuget,./external_libs . || exit -1 
        
               pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1 
        
               isort . --check-only || exit -1 
        
               mypy --ignore-missing-imports python-package/ || true 
        
               echo "Linting R code" 
        
               Rscript ${BUILD_DIRECTORY}/.ci/lint_r_code.R ${BUILD_DIRECTORY} || exit -1 
        
               echo "Linting C++ code" 
        
               cpplint --filter=-build/c++11,-build/include_subdir,-build/header_guard,-whitespace/line_length --recursive ./src ./include ./R-package ./swig ./tests || exit -1 
        
               cmake_files=$(find . -name CMakeLists.txt -o -path "*/cmake/*.cmake") 
        
               cmakelint --linelength=120 --filter=-convention/filename,-package/stdargs,-readability/wonkycase ${cmake_files} || exit -1 
        
               exit 0 
        
           fi

You might find you have a faster development cycle by running those things directly in your local development environment.

svotaw · 2022-06-15T22:29:34Z

@jameslamb Thanks for the tip! Sadly I have all linting passing now, but next time I will know. :) Looks like everything is green, except 1 R build for some reason, and I have a path issue where tests find the file in Windows but not in Linux/Mac. Will fix that when I get a chance. The Windows cxproj file was updated by me to Win10 (must have been done locally automatically), but I can revert those if needed once things are all settled.

svotaw · 2022-06-16T13:26:31Z

Some PR guidance: The PR looks large, but it’s half test code, and a big chunk is just some refactor of the DatasetLoader to share serialization code (which git does not recognize very well as just some big chunks of code moved around). There are only a few hundred lines of “interesting” changes, mostly in the APIs, and the DenseBin, SparseBin, and Metadata classes to allow streaming and insertion of data into them. I am happy to walk anyone through anything they’d like more details on, “in person” or asynchronously through email/PR comments. With these changes, we have our full streaming on the SynapseML layer working already as a POC, and it is a big improvement. Hopefully other people can take advantage of these LightGBM improvements as well to reduce their memory usage.

All CI elements are passing now, except for a few R builds, which seem to be issues unrelated to the PR. I will continue investigating.

svotaw · 2022-06-16T16:51:23Z

Related issue created here: #5298

svotaw · 2022-06-16T18:48:26Z

We are going to do some restructuring of the PR to split out things into separate PRs, so hold off on review for now.

jameslamb · 2022-06-16T19:26:15Z

split out things into separate PRs

That would be VERY appreciated, thank you!

svotaw · 2022-06-16T22:58:30Z

New PR is here: #5299. It's less than half the size and only adds streaming changes, not serialization or coalescing. Leaving this PR here for a bit until new one has been looked at by our team, and I've done more testing with the limited build. We will extract some more changes from it later.

github-actions · 2023-08-19T03:35:38Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

svotaw added 18 commits May 12, 2022 13:41

Serialization working in tests

e93a369

Merge branch 'master' into poc-lgbm-edits

21d234c

passing coalesce tests

821ed13

more tests

15fafb6

sparse tests pass

1102420

added dataset free calls

4bf0a41

some fixes from testing

b78dc4f

some cleanup, tests pass

c170eee

api changes

fa4b6e1

some cleanup

0a6f60b

more cleanup, and added CUDA

bf4ce6a

cleanup

d75baed

cleanup

8dde92b

Merge branch 'master' into poc-lgbm-edits

0fe4797

cleanup

f2a1b4d

cleanup

45aa62b

improved metadata perf with memcpy plus cleanup

9cabc91

cleanup

18d5422

svotaw requested review from guolinke, shiyu1994, StrikerRUS and jameslamb as code owners June 14, 2022 21:28

svotaw added 7 commits June 14, 2022 17:04

missing files

6cca968

Linting fixes

aeecacb

more linting

b941bf6

more linting

c3c89e4

doc fixes

445e1bd

doc fixes

958bcd8

CI fixes

f357de0

more CI fixes

90c501b

svotaw added 7 commits June 15, 2022 16:51

testing linux path

5320a11

testing linux path

41f510f

testing linux path

eb9a9e1

Testing memory leaks

79a0811

fix path

384c447

more data leak fixes

f2844f4

more data leak fixes

a22eebe

imatiach-msft mentioned this pull request Jun 16, 2022

Errors Training on Large Dataset : LightGBM microsoft/SynapseML#1534

Open

svotaw mentioned this pull request Jun 16, 2022

Improve memory performance of Dataset loading by supporting a more streaming-based approach to loading Datasets. #5298

Closed

svotaw marked this pull request as draft June 16, 2022 18:48

svotaw closed this Jun 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

svotaw commented Jun 14, 2022

jameslamb commented Jun 15, 2022

svotaw commented Jun 15, 2022

svotaw commented Jun 16, 2022

svotaw commented Jun 16, 2022

svotaw commented Jun 16, 2022

jameslamb commented Jun 16, 2022

svotaw commented Jun 16, 2022

github-actions bot commented Aug 19, 2023

feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

Conversation

svotaw commented Jun 14, 2022

Background

Changes

Testing

jameslamb commented Jun 15, 2022

svotaw commented Jun 15, 2022

svotaw commented Jun 16, 2022

svotaw commented Jun 16, 2022

svotaw commented Jun 16, 2022

jameslamb commented Jun 16, 2022

svotaw commented Jun 16, 2022

github-actions bot commented Aug 19, 2023