Support Parallel Data Loading Shufflable Iterable Datasets/DataStreams #100

alex-jw-brooks · 2023-07-25T22:32:46Z

This PR adds support for multiple workers when processing an iterable dataset in such a way that:

Ensures that data is evenly split across the workers in a true partition, i.e., no sampling
We can still shuffle after every iteration, even if we have multiple workers

The caveat to this is that for the shuffling to work correctly, we need to use persistent_workers=True when creating our data loader.

This is accomplished by defining a shuffle_seed, which is essentially a random seed that gets incremented every time we cycle through our data. This is used as the random seed when creating the shuffled stream generator; the workers must be persistent, otherwise the shuffle_seed will get reset with every iteration, but this approach lets us shuffle consistently across workers without them communicating.

Then, to divide the data, we create an iterator yielding every nth item of the preprocessed stream (which would be shuffled by now) given n worker, with an offset based on the worker ID.

Also adds docstrings to the stream wrapper & caches the stream length, since len() is an expensive operation for the data stream.

Closes: #74

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks · 2023-07-25T22:42:36Z

Minimal example showing shuffling (run from a venv on Linux):

import torch
from caikit.core.data_model import DataStream
from caikit_nlp.toolkit.data_stream_wrapper import SimpleIterableStreamWrapper

SAMPLE_DATA = [{"label": "a"}, {"label": "b"}, {"label": "c"}, {"label": "d"}]
SAMPLE_STREAM = DataStream.from_iterable(SAMPLE_DATA)
wrapper = SimpleIterableStreamWrapper(stream=SAMPLE_STREAM, shuffle=True)

torch_loader = torch.utils.data.DataLoader(
    wrapper,
    num_workers=2,
    persistent_workers=True, # Needed, otherwise every iteration shuffles the same way!
)

for epoch in range(3):
    for idx, x in enumerate(torch_loader):
        print(x)
    print("Finished iteration: {}".format(epoch))

Sample output:

{'label': ['c']}
{'label': ['b']}
{'label': ['d']}
{'label': ['a']}
Finished iteration: 0
{'label': ['c']}
{'label': ['d']}
{'label': ['b']}
{'label': ['a']}
Finished iteration: 1
{'label': ['b']}
{'label': ['a']}
{'label': ['c']}
{'label': ['d']}
Finished iteration: 2

In some preliminary benchmarking I did, this is unfortunately slower than running with no worker processes, at least for the way we handle tokenizer mapping onto train streams in prompt tuning (on the order of 2-3x slower). While a bit of a bummer, this is a generic utility for datastreams, and may be beneficial for re-entrant streams that have heft iteration costs, like loading from files etc

alex-jw-brooks · 2023-07-26T08:34:38Z

There are some other potential optimizations that can be made around this, but they do break the genericism a bit; it might be better to consider getting this in first, and having the optimizations as a follow up.

The two main ones I can think of are:

Tokenizer function mapping. Mapping over the data stream this way effectively builds an on the fly tokenizer that retokenizes every time because of the way reentry on the iterator works. I.e., same situation as the sample code below

import caikit
s = caikit.core.data_model.DataStream.from_iterable([1])

def map_func(example):
    print(f"Called the map func on example: {example}") # printed 10 times since every iteration calls this again upon reentry
    return example + 1

mapped_s = s.map(map_func)
for _ in range(10):
    for x in mapped_s:
        pass

skipping tokenization of unyielded samples while tokenizing; that ^ is much more of an issue with the approach we have here, which effectively divides the iterator across n processes, because we're wasting time tokenizing things that aren't even being yielded because they're supposed to be yielded by other processes. Since we know what things to yield, it's probably a good idea to actually hold the func to be mapped in the stream wrapper, and only apply it when we're iterating on things being yielded, assuming we want to do on the fly tokenization. I.e., if we have 4 processes, we only apply the mapped func on each 4th sample. This one could also be added to this PR since it's more isolated from the models using it - I don't have a strong preference either way

alex-jw-brooks added 5 commits July 25, 2023 17:13

Preliminary implementation for distributed shufflable streams

376cef2

Signed-off-by: Alex-Brooks <[email protected]>

Add stream partition docstrings and type hints

67ee5a3

Signed-off-by: Alex-Brooks <[email protected]>

Add test for multiworker shuffling

89b2858

Signed-off-by: Alex-Brooks <[email protected]>

Stream wrapper linter and code formatting

b7bd4cb

Signed-off-by: Alex-Brooks <[email protected]>

Add notice for multiworker shuffling behaviors

b6d1b82

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks requested review from gkumbhat and evaline-ju as code owners July 25, 2023 22:32

Cache datastream length

1392991

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the parallel_streams branch from 73e7af8 to 1392991 Compare July 25, 2023 22:33

alex-jw-brooks mentioned this pull request Jul 31, 2023

Add support causalm finetune #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parallel Data Loading Shufflable Iterable Datasets/DataStreams #100

Support Parallel Data Loading Shufflable Iterable Datasets/DataStreams #100

alex-jw-brooks commented Jul 25, 2023 •

edited

Loading

alex-jw-brooks commented Jul 25, 2023 •

edited

Loading

alex-jw-brooks commented Jul 26, 2023 •

edited

Loading

Support Parallel Data Loading Shufflable Iterable Datasets/DataStreams #100

Are you sure you want to change the base?

Support Parallel Data Loading Shufflable Iterable Datasets/DataStreams #100

Conversation

alex-jw-brooks commented Jul 25, 2023 • edited Loading

alex-jw-brooks commented Jul 25, 2023 • edited Loading

alex-jw-brooks commented Jul 26, 2023 • edited Loading

alex-jw-brooks commented Jul 25, 2023 •

edited

Loading

alex-jw-brooks commented Jul 25, 2023 •

edited

Loading

alex-jw-brooks commented Jul 26, 2023 •

edited

Loading