You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enable use of IterableDataset when training with NeuronTrainer and DDP. Or is there a design limitation that prevents this?
I can't share the project code, but see below another case for simplicity, which produces the same issue. DistributedSampler expects a dataset with known length, which a IterableDataset doesn't have by design.
Traceback (most recent call last):
File "/home/ubuntu/issue.py", line 29, in <module>
result = trainer.train()
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1414, in train
result = super().train(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/optimum/neuron/utils/require_utils.py", line 51, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 686, in _inner_training_loop
train_dataloader = self.get_train_dataloader()
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 897, in get_train_dataloader
return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/accelerate/accelerator.py", line 1274, in prepare
result = tuple(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/accelerate/accelerator.py", line 1275, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/accelerate/accelerator.py", line 1149, in _prepare_one
return self.prepare_data_loader(obj, device_placement=device_placement)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/optimum/neuron/accelerate/accelerator.py", line 223, in prepare_data_loader
data_loader = self._prepare_data_loader_for_distributed(
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/optimum/neuron/accelerate/accelerator.py", line 191, in _prepare_data_loader_for_distributed
sampler = DistributedSampler(data_loader.dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/torch/utils/data/distributed.py", line 91, in __init__
self.num_samples = math.ceil(len(self.dataset) / self.num_replicas) # type: ignore[arg-type]
TypeError: object of type 'CustomDataset' has no len()
Motivation
Have a project for distributed training on Trainium with DDP that requires use of HuggingFace's IterableDataset (when streaming=True in load.py/load_dataset() from package datasets==2.19.0)
Your contribution
N/A. I noticed on Nvidia A100 GPUs (with transformers Trainer) that it uses accelerate.data_loader.DataLoaderDispatcher and does not use DistributedSampler.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
Feature request
Enable use of IterableDataset when training with NeuronTrainer and DDP. Or is there a design limitation that prevents this?
I can't share the project code, but see below another case for simplicity, which produces the same issue. DistributedSampler expects a dataset with known length, which a IterableDataset doesn't have by design.
Setup
OS: Ubuntu 22.04.4 LTS (kernel 6.5.0-1023-aws)
apt packages
pip packages
Command:
torchrun --nproc_per_node=2 issue.py
Code (issue.py)
Issue
Motivation
Have a project for distributed training on Trainium with DDP that requires use of HuggingFace's IterableDataset (when
streaming=True
in load.py/load_dataset() from package datasets==2.19.0)Your contribution
N/A. I noticed on Nvidia A100 GPUs (with transformers Trainer) that it uses accelerate.data_loader.DataLoaderDispatcher and does not use DistributedSampler.
The text was updated successfully, but these errors were encountered: