Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistributedGeoSamplers #305

Open
RitwikGupta opened this issue Dec 19, 2021 · 9 comments
Open

Add DistributedGeoSamplers #305

RitwikGupta opened this issue Dec 19, 2021 · 9 comments
Labels
samplers Samplers for indexing datasets

Comments

@RitwikGupta
Copy link
Contributor

set_epoch is a used by a lot of codebases in an effort to be deterministic

@adamjstewart
Copy link
Collaborator

a lot of codebases

Can you give some examples? I would like to see how they implement this.

@adamjstewart adamjstewart added the samplers Samplers for indexing datasets label Dec 19, 2021
@adamjstewart adamjstewart added this to the 0.2.0 milestone Dec 19, 2021
@adamjstewart
Copy link
Collaborator

Okay, so all of the examples above are using torch.utils.data.distributed.DistributedSampler. This feature was added in pytorch/pytorch#39628 with little explanation other than what is in the docs/source code. This has a comment:

In distributed mode, calling the set_epoch() method at the beginning of each epoch before creating the DataLoader iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.

The set_epoch method docstring (which doesn't make it into the docs for some reason) contains:

Sets the epoch for this sampler. When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.

TL;DR: So this feature is not used for reproducibility/determinism, but for randomness across replicas. It's unclear to me why PyTorch needs this to prevent subsequent iterations from having the same ordering. Until I understand that, I'm not sure whether we need this feature or not.

Our samplers definitely weren't created with distributed sampling in mind, but they should still work since our datasets don't have a finite length that needs to be subsampled (other than GridGeoSampler which would need special treatment). RandomGeoSampler and RandomBatchGeoSampler imply shuffle=True.

P.S. The other non-distributed samplers still have a generator arg that can be used to control random sampling. We may want to add this.

@isaaccorley
Copy link
Collaborator

isaaccorley commented Dec 21, 2021

PyTorch Lightning automatically wraps samplers as DistributedSamplers so we don't need to handle any of this since we use PL. You would only need to mess with this if you were rolling your own distributed training scripts. See https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#replace-sampler-ddp

@adamjstewart
Copy link
Collaborator

@RitwikGupta does PL satisfy your use case?

@RitwikGupta
Copy link
Contributor Author

@adamjstewart not entirely, I'd have to refactor this entire FB codebase into PyTorch Lightning, which would be a massive pain.

@isaaccorley
Copy link
Collaborator

isaaccorley commented Dec 22, 2021

Adding a set_epoch method to our samplers wouldn't actually solve this. The above links use DistributedSamplers which splits up dataset indices to be sampled across nodes/gpus. I've done this for another project but we would need to create our own modification of a DistributedSamplerWrapper similarly to this

Edit:
torch.utils.data.DistributedSampler also provides some insight. The reason set_epoch exists is because the epoch is used when setting the seed for shuffling the indices across nodes/gpus.

@RitwikGupta
Copy link
Contributor Author

RitwikGupta commented Dec 22, 2021

Right, I should say that set_epoch isn't the fix here, the larger fix is to create a distributed wrapper for the existing samplers. The issue is just to file the symptom.

@adamjstewart adamjstewart changed the title 'RandomGeoSampler' object has no attribute 'set_epoch' Add DistributedGeoSamplers Dec 28, 2021
@adamjstewart adamjstewart modified the milestones: 0.2.0, 0.3.0 Jan 1, 2022
@adamjstewart adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022
@adamjstewart adamjstewart removed this from the 0.4.0 milestone Dec 6, 2022
@dylanrstewart
Copy link
Contributor

I think this is still an issue. I am unable to train using multiple gpus when I am leveraging a RandomBatchGeoSampler AttributeError: 'RandomBatchGeoSampler' object has no attribute 'drop_last'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
samplers Samplers for indexing datasets
Projects
None yet
Development

No branches or pull requests

4 participants