Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291

Closed
wants to merge 33 commits into from

Conversation

svotaw
Copy link
Contributor

@svotaw svotaw commented Jun 14, 2022

Background

Currently, Azure SynapseML wraps the LightGBM library to make it much easier for customers to use it in a distributed Spark environment. The initial approach taken by SynapseML uses the LGBM_DatasetCreateFromMat API to create a Dataset for either each Spark partition or each executor (user option). However, this method requires the SynapseML Scala code to load the entire uncompressed data into memory before sending it to LightGBM, and merge partition data together manually. This amounts to using an order of magnitude or more memory (raw double arrays and multiple copies) over what LightGBM Datasets use internally (binned data). This requires larger Spark clusters than are really needed, and often causes OOM issues.

In order to improve the memory performance of SynapseML-wrapped LightGBM, we decided to convert the Dataset creation into more of a “streaming” scenario (as opposed to the above “bulk” input matrices), where we do NOT create large arrays on client side. After initial investigation, there were existing LightGBM APIs that seemed to fit this purpose: LGBM_DatasetCreateFromSampledColumn and LGBM_DatasetPushRows[byCSR]. This seemed to allow creation of a “reference” Dataset with defined feature groups and bins, and to push small micro-batches of data into the set. However, these APIs suffered from several significant issues as currently implemented:

  1. Not thread safe for parallelism, since FinishDataset is called when the literal last index is pushed.
  2. Still requires knowing Dataset size up front (not convenient for Spark Row Iterator or general streaming)
  3. Only pushes feature data, and Metadata must still be handled client side
  4. Does not handle partial Datasets or merging Datasets together

Changes

This PR adds APIs to LightGBM to fix these issues and create a true “streaming” flow where microbatches of full rows (including metadata) can be pushed directly into LightGBM Dataset format. The general idea is that we still create an initial Dataset with DatasetCreateFromSampledColumn, but then create other “real” Datasets from that reference. Note that this also fixes a problem where DatasetCreateFromMat does not use the full data corpus to sample the data, and only the data from each separate partition.
The new general flow within SynapseML can be described as 2 approaches: simple streaming and coalesced streaming. With simple streaming, the count of rows must be known up front and uses the following APIs:
Driver:

  1. DatasetCreateFromSampledColumn (final number of rows) with samples from full dataframe
  2. DatasetSerializeReferenceToBinary to create a reference ByteArray to pass to workers

Workers:

  1. Count partition size
  2. DatasetCreateFromSerializedReference(size) using passed buffer
  3. DatasetSetWaitForManualFinish(true), to take manual control over Dataset Finish
  4. DatasetPushRows[byCSR]WithMetadata over the Spark Row Iterator
  5. DatasetMarkFinished

This simple streaming scenario works in some limited cases, but requires counting up front and is not particularly flexible. To address a true “unbounded” streaming scenario, we also added APIs for a “coalesced flow”. The basic approach is to create some intermediate Datasets of some fixed “chunk size” (which are never “Finished”), and then merge them into 1 final Dataset to actually use for training iterations. Datasets keep track if they are only “partially” loaded by keeping a num_pushed_rows. The Driver flow is the same, but the worker flow is different:
Workers

  1. DatasetCreateFromSerializedReference(semi-arbitrary chunk size)
  2. DatasetSetWaitForManualFinish(true), to take manual control over FinishLoad
  3. DatasetPushRows[byCSR]WithMetadata over Spark Row Iterator
  4. If dataset fills, repeat from 1 with new Dataset and add to list
  5. Count sum of num_pushed_rows from all Datasets, creating a final num_rows
  6. Final Dataset = DatasetCreateFromSerializedReference(final num_rows)
  7. DatasetCoalesce(target final Dataset, source list of Dataset chunks) – note this calls FinishLoad internally on the target after coalescing

This “coalesced streaming” flow is very flexible, and allows us to optimize the final training Datasets by Spark partition, executor, or whatever we want. There is no client-side memory overhead, except for a small fixed amout to hold a “microbatch” of pushed data (which can be pushed down to 1 row if desired). And there is no extra special processing for Metadata, so we can send full Spark dataset “Rows” and LightGBM handles it all.
We added the above APIs in this review, which largely do not affect existing APIs. I only refactored some existing code where it made more sense to share sections between old and new methods (e.g. reading header from either file or serialized buffer).
I also added a new ByteBuffer class, to make it easier to pass unbounded buffer back to client layer. This avoids the odd flow of 1) call and get size, 2) allocate some native array, 3) call final API to load buffer. The ByteBuffer can do this in one pass. This also required a little refactoring of the Writer since I wanted to share code between file and memory buffer operations.
Note that there shouldn’t be any impact of this PR on training iteration code. The changes are mostly limited to initial Dataset creation and loading.

Testing

I added C++ tests for all of the above so we could test the functionality intensively. There are now C++ test files for serialization, streaming, coalescing, and the new byte buffer. Both dense and sparse scenarios are covered, 4bit vs 8+bit, microbatch sizes of 1 and 2+, and all types of Metadata.
Also, the jar created from this PR was used in SynapseML, and passed an extensive list of tests covering old “bulk” mode vs new “streaming” mode, sparse vs dense, binary classification vs multiclass, regression and ranking, weights, initial data in binary and multiclass, validation data, and more. Basically our entire SynapseML LightGBM test suite now passes in both streaming and the older bulk mode (we kept them both to be able to compare and test performance).

@jameslamb
Copy link
Collaborator

@svotaw thanks for the contribution! I'm not ready to review it yet, but since I noticed that you've been pushing multiple commits to try to work around linter issues...wanted to let you know that you can find all the commands run and tools used by the lint CI job for this project here:

LightGBM/.ci/test.sh

Lines 70 to 91 in f96c4ce

if [[ $TASK == "lint" ]]; then
conda install -q -y -n $CONDA_ENV \
cmakelint \
cpplint \
isort \
mypy \
pycodestyle \
pydocstyle \
"r-lintr>=2.0,<3.0"
echo "Linting Python code"
pycodestyle --ignore=E501,W503 --exclude=./.nuget,./external_libs . || exit -1
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
isort . --check-only || exit -1
mypy --ignore-missing-imports python-package/ || true
echo "Linting R code"
Rscript ${BUILD_DIRECTORY}/.ci/lint_r_code.R ${BUILD_DIRECTORY} || exit -1
echo "Linting C++ code"
cpplint --filter=-build/c++11,-build/include_subdir,-build/header_guard,-whitespace/line_length --recursive ./src ./include ./R-package ./swig ./tests || exit -1
cmake_files=$(find . -name CMakeLists.txt -o -path "*/cmake/*.cmake")
cmakelint --linelength=120 --filter=-convention/filename,-package/stdargs,-readability/wonkycase ${cmake_files} || exit -1
exit 0
fi

You might find you have a faster development cycle by running those things directly in your local development environment.

@svotaw
Copy link
Contributor Author

svotaw commented Jun 15, 2022

@jameslamb Thanks for the tip! Sadly I have all linting passing now, but next time I will know. :) Looks like everything is green, except 1 R build for some reason, and I have a path issue where tests find the file in Windows but not in Linux/Mac. Will fix that when I get a chance. The Windows cxproj file was updated by me to Win10 (must have been done locally automatically), but I can revert those if needed once things are all settled.

@svotaw
Copy link
Contributor Author

svotaw commented Jun 16, 2022

Some PR guidance: The PR looks large, but it’s half test code, and a big chunk is just some refactor of the DatasetLoader to share serialization code (which git does not recognize very well as just some big chunks of code moved around). There are only a few hundred lines of “interesting” changes, mostly in the APIs, and the DenseBin, SparseBin, and Metadata classes to allow streaming and insertion of data into them. I am happy to walk anyone through anything they’d like more details on, “in person” or asynchronously through email/PR comments. With these changes, we have our full streaming on the SynapseML layer working already as a POC, and it is a big improvement. Hopefully other people can take advantage of these LightGBM improvements as well to reduce their memory usage.

All CI elements are passing now, except for a few R builds, which seem to be issues unrelated to the PR. I will continue investigating.

@svotaw
Copy link
Contributor Author

svotaw commented Jun 16, 2022

Related issue created here: #5298

@svotaw
Copy link
Contributor Author

svotaw commented Jun 16, 2022

We are going to do some restructuring of the PR to split out things into separate PRs, so hold off on review for now.

@svotaw svotaw marked this pull request as draft June 16, 2022 18:48
@jameslamb
Copy link
Collaborator

split out things into separate PRs

That would be VERY appreciated, thank you!

@svotaw
Copy link
Contributor Author

svotaw commented Jun 16, 2022

New PR is here: #5299. It's less than half the size and only adds streaming changes, not serialization or coalescing. Leaving this PR here for a bit until new one has been looked at by our team, and I've done more testing with the limited build. We will extract some more changes from it later.

@svotaw svotaw closed this Jun 24, 2022
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants