Issue with running multiple models in PyTorch Lightning #2807

tioans · 2020-08-03T10:38:36Z

🐛 Bug

I am developing a system which needs to train dozens of individual models (>50) using Lightning, each with their own TensorBoard plots and logs. My current implementation has one Trainer object per model and it seems like I'm running into an error when I go over ~90 Trainer objects. Interestingly, the error only appears when I run the .test() method, not during .fit().

As I just started with Lightning, I am not sure if having one Trainer/model is the best approach. However, I require individual plots from each model, and it seems that if I use a single trainer for multiple models the results get overridden.

To Reproduce

Steps to reproduce the behaviour:

1.Define more than 90 Trainer objects, each with their own model.
2. Run training for each model.
3. Run testing for each model.
4. See error

Traceback (most recent call last):
  File "lightning/main_2.py", line 193, in <module>
    main()
  File "lightning/main_2.py", line 174, in main
    new_trainer.test(model=new_model, test_dataloaders=te_loader)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1343, in __test_given_model
    self.set_random_port(force=True)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\distrib_data_parallel.py", line 398, in set_random_port
    default_port = RANDOM_PORTS[-1]
IndexError: index -1 is out of bounds for axis 0 with size 0

Code sample

Defining the Trainer objects:

for i in range(args["num_users"]):
    trainer_list_0.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                          fast_dev_run=args["fast_dev_run"], weights_summary=None))
    trainer_list_1.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                            fast_dev_run=args["fast_dev_run"], weights_summary=None))
    trainer_list_2.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                            fast_dev_run=args["fast_dev_run"], weights_summary=None))

Training:

for i in range(args["num_users"]):
    trainer_list_0[i].fit(model_list_0[i], train_dataloader=dataloader_list[i],
                                      val_dataloaders=val_loader)
    trainer_list_1[i].fit(model_list_1[i], train_dataloader=dataloader_list[i],
                                        val_dataloaders=val_loader)
    trainer_list_2[i].fit(model_list_2[i], train_dataloader=dataloader_list[i],
                                        val_dataloaders=val_loader)

Testing:

for i in range(args["num_users"]):
    trainer_list_0[i].test(test_dataloaders=te_loader)
    trainer_list_1[i].test(test_dataloaders=te_loader)
    trainer_list_2[i].test(test_dataloaders=te_loader)

Expected behaviour

I expected the code to work without crashing.

Environment

PyTorch Version (e.g., 1.0): 1.4
OS (e.g., Linux): Windows 10 Pro 2004
How you installed PyTorch (conda, pip, source): conda
Python version: 3.7.6
CUDA/cuDNN version: CUDA 10.1/cuDNN 7.0
GPU models and configuration: RTX 2060 Super

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-03T10:39:32Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-08-03T14:18:04Z

Thanks for the report. Will fix it here #2790 as it is directly related.

tioans · 2020-08-03T16:58:11Z

That's great, thanks. Looking forward to getting the patch!

awaelchli · 2020-08-16T15:51:10Z

@epiicme Just a quick follow up here why this issue got closed:
In your case, you are calling trainer.fit() multiple times, or even instantiate Trainer multiple times. In DDP mode, this cannot work since the Trainer will call the same script multiple times. When this happens, we have no way of controlling which trainer.fit() is executed. I added a note in the docs.

For your use case it means:

if you want to call trainer.fit() multiple times, use distributed_backend="ddp_spawn"
if you want to use distributed_backend="ddp", you must make sure your script only calls trainer.fit once (or trainer.test)

It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs.
Hope this helps you!

MarioIshac · 2020-08-19T18:11:17Z

* if you want to call trainer.fit() multiple times, use distributed_backend="ddp_spawn"

* if you want to use distributed_backend="ddp", you must make sure your script only calls trainer.fit once (or trainer.test)
It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs.
Hope this helps you!

I was running into the same error as OP in 0.8.5 when going with the first option in Using multiple trainers vs. single trainer if max_epochs needs to change.. When upgrading to 0.9.0rc16 the issue looks like it went away for now, but I want to confirm what's going on:

I'm not passing any distributed_backend to Trainer, it ends up being None. Did ddp fix for trainer.test() + add basic ddp tests
address this issue only for ddp mode, or also for no distributed backend?

awaelchli · 2020-08-19T18:22:42Z

only for ddp mode. But distributed_backend defaults to ddp_spawn if you run multi-gpu (single node), so that should not be affected by this issue here. Does that answer your question?

MarioIshac · 2020-08-19T18:30:24Z

I am running single node / single gpu (passing gpus=1 to Trainer), would the default backend for that be affected?

awaelchli · 2020-08-19T20:08:44Z

no, in this case the backend is nothing special. For running on a single gpu, we don't need to do any extra work than to put the tensors on that device. That's why you see distributed_backend=None, because there is nothing distributed about running on 1 gpu.

guochengqian · 2021-08-05T08:29:22Z

hi guys. if using multiple trainer, how to set the number of start epochs for the following trainers, such that the tensorboard does not mess up?
e.g., the first trainer train 5 epochs. then, the second trainer should start from Epoch 5, not 0.
any suggestion?

tioans added bug Something isn't working help wanted Open to be worked on labels Aug 3, 2020

awaelchli self-assigned this Aug 3, 2020

awaelchli mentioned this issue Aug 3, 2020

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

Closed

2 tasks

awaelchli mentioned this issue Aug 16, 2020

ddp fix for trainer.test() + add basic ddp tests #2997

Merged

7 tasks

williamFalcon closed this as completed in #2997 Aug 16, 2020

MarioIshac mentioned this issue Aug 18, 2020

Using multiple trainers vs. single trainer if max_epochs needs to change. #3037

Closed

Programmer-RD-AI mentioned this issue Oct 28, 2021

Train multiple models in a single GPU on parallel #10204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with running multiple models in PyTorch Lightning #2807

Issue with running multiple models in PyTorch Lightning #2807

tioans commented Aug 3, 2020

github-actions bot commented Aug 3, 2020

awaelchli commented Aug 3, 2020

tioans commented Aug 3, 2020

awaelchli commented Aug 16, 2020

MarioIshac commented Aug 19, 2020

awaelchli commented Aug 19, 2020

MarioIshac commented Aug 19, 2020

awaelchli commented Aug 19, 2020

guochengqian commented Aug 5, 2021

Issue with running multiple models in PyTorch Lightning #2807

Issue with running multiple models in PyTorch Lightning #2807

Comments

tioans commented Aug 3, 2020

🐛 Bug

To Reproduce

Code sample

Expected behaviour

Environment

github-actions bot commented Aug 3, 2020

awaelchli commented Aug 3, 2020

tioans commented Aug 3, 2020

awaelchli commented Aug 16, 2020

MarioIshac commented Aug 19, 2020

awaelchli commented Aug 19, 2020

MarioIshac commented Aug 19, 2020

awaelchli commented Aug 19, 2020

guochengqian commented Aug 5, 2021