Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with running multiple models in PyTorch Lightning #2807

Closed
tioans opened this issue Aug 3, 2020 · 9 comments · Fixed by #2997
Closed

Issue with running multiple models in PyTorch Lightning #2807

tioans opened this issue Aug 3, 2020 · 9 comments · Fixed by #2997
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@tioans
Copy link

tioans commented Aug 3, 2020

🐛 Bug

I am developing a system which needs to train dozens of individual models (>50) using Lightning, each with their own TensorBoard plots and logs. My current implementation has one Trainer object per model and it seems like I'm running into an error when I go over ~90 Trainer objects. Interestingly, the error only appears when I run the .test() method, not during .fit().

As I just started with Lightning, I am not sure if having one Trainer/model is the best approach. However, I require individual plots from each model, and it seems that if I use a single trainer for multiple models the results get overridden.

To Reproduce

Steps to reproduce the behaviour:

1.Define more than 90 Trainer objects, each with their own model.
2. Run training for each model.
3. Run testing for each model.
4. See error

Traceback (most recent call last):
  File "lightning/main_2.py", line 193, in <module>
    main()
  File "lightning/main_2.py", line 174, in main
    new_trainer.test(model=new_model, test_dataloaders=te_loader)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1343, in __test_given_model
    self.set_random_port(force=True)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\distrib_data_parallel.py", line 398, in set_random_port
    default_port = RANDOM_PORTS[-1]
IndexError: index -1 is out of bounds for axis 0 with size 0

Code sample

Defining the Trainer objects:

for i in range(args["num_users"]):
    trainer_list_0.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                          fast_dev_run=args["fast_dev_run"], weights_summary=None))
    trainer_list_1.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                            fast_dev_run=args["fast_dev_run"], weights_summary=None))
    trainer_list_2.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                            fast_dev_run=args["fast_dev_run"], weights_summary=None))

Training:

for i in range(args["num_users"]):
    trainer_list_0[i].fit(model_list_0[i], train_dataloader=dataloader_list[i],
                                      val_dataloaders=val_loader)
    trainer_list_1[i].fit(model_list_1[i], train_dataloader=dataloader_list[i],
                                        val_dataloaders=val_loader)
    trainer_list_2[i].fit(model_list_2[i], train_dataloader=dataloader_list[i],
                                        val_dataloaders=val_loader)

Testing:

for i in range(args["num_users"]):
    trainer_list_0[i].test(test_dataloaders=te_loader)
    trainer_list_1[i].test(test_dataloaders=te_loader)
    trainer_list_2[i].test(test_dataloaders=te_loader)

Expected behaviour

I expected the code to work without crashing.

Environment

  • PyTorch Version (e.g., 1.0): 1.4
  • OS (e.g., Linux): Windows 10 Pro 2004
  • How you installed PyTorch (conda, pip, source): conda
  • Python version: 3.7.6
  • CUDA/cuDNN version: CUDA 10.1/cuDNN 7.0
  • GPU models and configuration: RTX 2060 Super
@tioans tioans added bug Something isn't working help wanted Open to be worked on labels Aug 3, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli awaelchli self-assigned this Aug 3, 2020
@awaelchli
Copy link
Contributor

Thanks for the report. Will fix it here #2790 as it is directly related.

@tioans
Copy link
Author

tioans commented Aug 3, 2020

That's great, thanks. Looking forward to getting the patch!

@awaelchli
Copy link
Contributor

@epiicme Just a quick follow up here why this issue got closed:
In your case, you are calling trainer.fit() multiple times, or even instantiate Trainer multiple times. In DDP mode, this cannot work since the Trainer will call the same script multiple times. When this happens, we have no way of controlling which trainer.fit() is executed. I added a note in the docs.

For your use case it means:

  • if you want to call trainer.fit() multiple times, use distributed_backend="ddp_spawn"
  • if you want to use distributed_backend="ddp", you must make sure your script only calls trainer.fit once (or trainer.test)

It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs.
Hope this helps you!

@MarioIshac
Copy link
Contributor

* if you want to call trainer.fit() multiple times, use distributed_backend="ddp_spawn"

* if you want to use distributed_backend="ddp", you must make sure your script only calls trainer.fit once (or trainer.test)

It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs.
Hope this helps you!

I was running into the same error as OP in 0.8.5 when going with the first option in Using multiple trainers vs. single trainer if max_epochs needs to change.. When upgrading to 0.9.0rc16 the issue looks like it went away for now, but I want to confirm what's going on:

I'm not passing any distributed_backend to Trainer, it ends up being None. Did ddp fix for trainer.test() + add basic ddp tests
address this issue only for ddp mode, or also for no distributed backend?

@awaelchli
Copy link
Contributor

only for ddp mode. But distributed_backend defaults to ddp_spawn if you run multi-gpu (single node), so that should not be affected by this issue here. Does that answer your question?

@MarioIshac
Copy link
Contributor

I am running single node / single gpu (passing gpus=1 to Trainer), would the default backend for that be affected?

@awaelchli
Copy link
Contributor

no, in this case the backend is nothing special. For running on a single gpu, we don't need to do any extra work than to put the tensors on that device. That's why you see distributed_backend=None, because there is nothing distributed about running on 1 gpu.

@guochengqian
Copy link

hi guys. if using multiple trainer, how to set the number of start epochs for the following trainers, such that the tensorboard does not mess up?
e.g., the first trainer train 5 epochs. then, the second trainer should start from Epoch 5, not 0.
any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
4 participants