-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Add true streaming APIs to reduce client-side memory usage (for SynapseML) #5291
Conversation
@svotaw thanks for the contribution! I'm not ready to review it yet, but since I noticed that you've been pushing multiple commits to try to work around linter issues...wanted to let you know that you can find all the commands run and tools used by the Lines 70 to 91 in f96c4ce
You might find you have a faster development cycle by running those things directly in your local development environment. |
@jameslamb Thanks for the tip! Sadly I have all linting passing now, but next time I will know. :) Looks like everything is green, except 1 R build for some reason, and I have a path issue where tests find the file in Windows but not in Linux/Mac. Will fix that when I get a chance. The Windows cxproj file was updated by me to Win10 (must have been done locally automatically), but I can revert those if needed once things are all settled. |
Some PR guidance: The PR looks large, but it’s half test code, and a big chunk is just some refactor of the DatasetLoader to share serialization code (which git does not recognize very well as just some big chunks of code moved around). There are only a few hundred lines of “interesting” changes, mostly in the APIs, and the DenseBin, SparseBin, and Metadata classes to allow streaming and insertion of data into them. I am happy to walk anyone through anything they’d like more details on, “in person” or asynchronously through email/PR comments. With these changes, we have our full streaming on the SynapseML layer working already as a POC, and it is a big improvement. Hopefully other people can take advantage of these LightGBM improvements as well to reduce their memory usage. All CI elements are passing now, except for a few R builds, which seem to be issues unrelated to the PR. I will continue investigating. |
Related issue created here: #5298 |
We are going to do some restructuring of the PR to split out things into separate PRs, so hold off on review for now. |
That would be VERY appreciated, thank you! |
New PR is here: #5299. It's less than half the size and only adds streaming changes, not serialization or coalescing. Leaving this PR here for a bit until new one has been looked at by our team, and I've done more testing with the limited build. We will extract some more changes from it later. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Background
Currently, Azure SynapseML wraps the LightGBM library to make it much easier for customers to use it in a distributed Spark environment. The initial approach taken by SynapseML uses the
LGBM_DatasetCreateFromMat
API to create aDataset
for either each Spark partition or each executor (user option). However, this method requires the SynapseML Scala code to load the entire uncompressed data into memory before sending it to LightGBM, and merge partition data together manually. This amounts to using an order of magnitude or more memory (raw double arrays and multiple copies) over what LightGBMDatasets
use internally (binned data). This requires larger Spark clusters than are really needed, and often causes OOM issues.In order to improve the memory performance of SynapseML-wrapped LightGBM, we decided to convert the
Dataset
creation into more of a “streaming” scenario (as opposed to the above “bulk” input matrices), where we do NOT create large arrays on client side. After initial investigation, there were existing LightGBM APIs that seemed to fit this purpose:LGBM_DatasetCreateFromSampledColumn
andLGBM_DatasetPushRows[byCSR]
. This seemed to allow creation of a “reference”Dataset
with defined feature groups and bins, and to push small micro-batches of data into the set. However, these APIs suffered from several significant issues as currently implemented:FinishDataset
is called when the literal last index is pushed.Dataset
size up front (not convenient for Spark RowIterator
or general streaming)Metadata
must still be handled client sideDatasets
or mergingDatasets
togetherChanges
This PR adds APIs to LightGBM to fix these issues and create a true “streaming” flow where microbatches of full rows (including metadata) can be pushed directly into LightGBM
Dataset
format. The general idea is that we still create an initialDataset
withDatasetCreateFromSampledColumn
, but then create other “real”Datasets
from that reference. Note that this also fixes a problem whereDatasetCreateFromMat
does not use the full data corpus to sample the data, and only the data from each separate partition.The new general flow within SynapseML can be described as 2 approaches: simple streaming and coalesced streaming. With simple streaming, the count of rows must be known up front and uses the following APIs:
Driver:
DatasetCreateFromSampledColumn
(final number of rows) with samples from full dataframeDatasetSerializeReferenceToBinary
to create a referenceByteArray
to pass to workersWorkers:
DatasetCreateFromSerializedReference(size)
using passed bufferDatasetSetWaitForManualFinish(true)
, to take manual control over Dataset FinishDatasetPushRows[byCSR]WithMetadata
over the Spark RowIterator
DatasetMarkFinished
This simple streaming scenario works in some limited cases, but requires counting up front and is not particularly flexible. To address a true “unbounded” streaming scenario, we also added APIs for a “coalesced flow”. The basic approach is to create some intermediate Datasets of some fixed “chunk size” (which are never “Finished”), and then merge them into 1 final Dataset to actually use for training iterations. Datasets keep track if they are only “partially” loaded by keeping a
num_pushed_rows
. The Driver flow is the same, but the worker flow is different:Workers
DatasetCreateFromSerializedReference
(semi-arbitrary chunk size)DatasetSetWaitForManualFinish(true)
, to take manual control overFinishLoad
DatasetPushRows[byCSR]WithMetadata
over Spark RowIterator
Dataset
and add to listnum_pushed_rows
from allDatasets
, creating a final num_rowsDatasetCreateFromSerializedReference(final num_rows)
DatasetCoalesce(target final Dataset, source list of Dataset chunks)
– note this callsFinishLoad
internally on the target after coalescingThis “coalesced streaming” flow is very flexible, and allows us to optimize the final training Datasets by Spark partition, executor, or whatever we want. There is no client-side memory overhead, except for a small fixed amout to hold a “microbatch” of pushed data (which can be pushed down to 1 row if desired). And there is no extra special processing for Metadata, so we can send full Spark dataset “Rows” and LightGBM handles it all.
We added the above APIs in this review, which largely do not affect existing APIs. I only refactored some existing code where it made more sense to share sections between old and new methods (e.g. reading header from either file or serialized buffer).
I also added a new
ByteBuffer
class, to make it easier to pass unbounded buffer back to client layer. This avoids the odd flow of 1) call and get size, 2) allocate some native array, 3) call final API to load buffer. TheByteBuffer
can do this in one pass. This also required a little refactoring of the Writer since I wanted to share code between file and memory buffer operations.Note that there shouldn’t be any impact of this PR on training iteration code. The changes are mostly limited to initial
Dataset
creation and loading.Testing
I added C++ tests for all of the above so we could test the functionality intensively. There are now C++ test files for serialization, streaming, coalescing, and the new byte buffer. Both dense and sparse scenarios are covered, 4bit vs 8+bit, microbatch sizes of 1 and 2+, and all types of Metadata.
Also, the jar created from this PR was used in SynapseML, and passed an extensive list of tests covering old “bulk” mode vs new “streaming” mode, sparse vs dense, binary classification vs multiclass, regression and ranking, weights, initial data in binary and multiclass, validation data, and more. Basically our entire SynapseML LightGBM test suite now passes in both streaming and the older bulk mode (we kept them both to be able to compare and test performance).