Make DistributedSampler stateful #1269

gokulavasan · 2024-06-10T22:33:06Z

🚀 The feature

Currently RandomSampler, BatchSampler are patched here https://github.com/pytorch/data/blob/main/torchdata/stateful_dataloader/sampler.py#L134-L135 to make them stateful and work out of the box with StatefulDataLoader.

It would be useful to consider making DistributedSampler (https://github.com/pytorch/pytorch/blob/2176ef7dfaf02dd6dbb8484a50c99d5fadf3ea0b/torch/utils/data/distributed.py#L13) also implement stateful methods and patch it in torchdata.

Motivation, pitch

So that users can use DistributedSampler also out of the box with checkpointing capability

Alternatives

Users would have implement the stateful interface for DistributedSampler but extending it

Additional context

No response

andrewkho · 2024-06-12T17:02:16Z

This currently isn't broken right? ie fast-forwarding the sampler will work, but may be inefficient. I'm OK either way for before/after release branch cut

ShoufaChen · 2024-07-07T04:02:20Z

Hi @gokulavasan @andrewkho ,

I found that current StatefulDataloader works well with DistributedSampler without any modifications.

Would you mind please explaining why it might be inefficient?

Thanks in advance.

ShoufaChen · 2024-07-08T06:35:23Z

This currently isn't broken right? ie fast-forwarding the sampler will work, but may be inefficient. I'm OK either way for before/after release branch cut

Hi @andrewkho ,
Does fast-forwarding here mean that the sampler would iterate from the head to the checkpointing point? If so, is it inefficient?
An efficient way would be to jump directly to the checkpointing point, right?

Please correct me if my understanding is wrong. Thank you.

andrewkho · 2024-07-08T16:17:53Z

HI @ShoufaChen you're correct, it should work without modifications but may be slow for large tables. https://github.com/pytorch/data/blob/main/torchdata/stateful_dataloader/sampler.py#L47 Here is where we've done the conversion for RandomSampler and BatchSampler as examples.

You can see for example the default batch sampler calling next() to naively fast-forward the sampler.

Here's an example where you can see that increasing the samples to iterate through increases the time required to fast-forward, and when you get to very large scales (eg billion scale) this starts to slow down to order of minutes: https://colab.research.google.com/drive/1UlJAMqzaCjtbW4RPaaoHxGd9sjiKFk7O?usp=sharing

andrewkho added the enhancement New feature or request label Jun 12, 2024

ramanishsingh mentioned this issue Aug 21, 2024

Make DistributedSampler stateful #1315

Merged

ramanishsingh closed this as completed in #1315 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DistributedSampler stateful #1269

Make DistributedSampler stateful #1269

gokulavasan commented Jun 10, 2024

andrewkho commented Jun 12, 2024

ShoufaChen commented Jul 7, 2024

ShoufaChen commented Jul 8, 2024 •

edited

Loading

andrewkho commented Jul 8, 2024

Make DistributedSampler stateful #1269

Make DistributedSampler stateful #1269

Comments

gokulavasan commented Jun 10, 2024

🚀 The feature

Motivation, pitch

Alternatives

Additional context

andrewkho commented Jun 12, 2024

ShoufaChen commented Jul 7, 2024

ShoufaChen commented Jul 8, 2024 • edited Loading

andrewkho commented Jul 8, 2024

ShoufaChen commented Jul 8, 2024 •

edited

Loading