-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with running multiple models in PyTorch Lightning #2807
Comments
Hi! thanks for your contribution!, great first issue! |
Thanks for the report. Will fix it here #2790 as it is directly related. |
That's great, thanks. Looking forward to getting the patch! |
@epiicme Just a quick follow up here why this issue got closed: For your use case it means:
It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs. |
I was running into the same error as OP in I'm not passing any |
only for ddp mode. But distributed_backend defaults to ddp_spawn if you run multi-gpu (single node), so that should not be affected by this issue here. Does that answer your question? |
I am running single node / single gpu (passing |
no, in this case the backend is nothing special. For running on a single gpu, we don't need to do any extra work than to put the tensors on that device. That's why you see distributed_backend=None, because there is nothing distributed about running on 1 gpu. |
hi guys. if using multiple trainer, how to set the number of start epochs for the following trainers, such that the tensorboard does not mess up? |
🐛 Bug
I am developing a system which needs to train dozens of individual models (>50) using Lightning, each with their own TensorBoard plots and logs. My current implementation has one Trainer object per model and it seems like I'm running into an error when I go over ~90 Trainer objects. Interestingly, the error only appears when I run the .test() method, not during .fit().
As I just started with Lightning, I am not sure if having one Trainer/model is the best approach. However, I require individual plots from each model, and it seems that if I use a single trainer for multiple models the results get overridden.
To Reproduce
Steps to reproduce the behaviour:
1.Define more than 90 Trainer objects, each with their own model.
2. Run training for each model.
3. Run testing for each model.
4. See error
Code sample
Defining the Trainer objects:
Training:
Testing:
Expected behaviour
I expected the code to work without crashing.
Environment
conda
,pip
, source): condaThe text was updated successfully, but these errors were encountered: