Align IterableDataset.shuffle with Dataset.shuffle #3842

lhoestq · 2022-03-07T12:10:46Z

From #3444 , Dataset.shuffle can have the same API than IterableDataset.shuffle (i.e. in streaming mode).

Currently you can pass an optional seed to both if you want, BUT currently IterableDataset.shuffle always requires a buffer_size, used for approximate shuffling. I propose using a reasonable default value (maybe 1000) instead.

In this PR, I set the default buffer_size value to 1,000, and I reorder the IterableDataset.shuffle arguments to match Dataset.shuffle, i.e. making seed the first argument.

HuggingFaceDocBuilderDev · 2022-03-07T12:16:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

mariosasko · 2022-03-07T13:38:51Z

We should also add generator as a param to shuffle to fully align the APIs, no?

lhoestq · 2022-03-07T16:27:07Z

I added the generator argument.

I had to make a few other adjustments to make it work. In particular when you call set_epoch() on a streaming dataset, it updates the underlying random generator by using a new effective seed. The effective seed is generated using the previous generator and the epoch number.

mariosasko

Thanks! One comment:

src/datasets/iterable_dataset.py

Co-authored-by: Mario Šaško <[email protected]>

lhoestq added 3 commits March 7, 2022 13:07

set buffer_size=1000 + reorder shuffle args

e344015

tests

2f11541

docs

a08d297

lhoestq requested review from albertvillanova and mariosasko March 7, 2022 12:10

PhaniKanagala approved these changes Mar 7, 2022

View reviewed changes

add generator arg

60833d0

mariosasko approved these changes Mar 7, 2022

View reviewed changes

src/datasets/iterable_dataset.py Show resolved Hide resolved

lhoestq and others added 3 commits March 7, 2022 19:34

Update src/datasets/iterable_dataset.py

c00bbb0

Co-authored-by: Mario Šaško <[email protected]>

forgot parenthesis

3e169f3

typo

63a78b8

lhoestq merged commit 2a5149b into master Mar 7, 2022

lhoestq deleted the align-iterable-dataset-shuffle branch March 7, 2022 19:03

lhoestq mentioned this pull request Mar 10, 2022

Fix some shuffle docs #3885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align IterableDataset.shuffle with Dataset.shuffle #3842

Align IterableDataset.shuffle with Dataset.shuffle #3842

lhoestq commented Mar 7, 2022

HuggingFaceDocBuilderDev commented Mar 7, 2022

mariosasko commented Mar 7, 2022

lhoestq commented Mar 7, 2022

mariosasko left a comment

Align IterableDataset.shuffle with Dataset.shuffle #3842

Align IterableDataset.shuffle with Dataset.shuffle #3842

Conversation

lhoestq commented Mar 7, 2022

HuggingFaceDocBuilderDev commented Mar 7, 2022

mariosasko commented Mar 7, 2022

lhoestq commented Mar 7, 2022

mariosasko left a comment

Choose a reason for hiding this comment