-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using neptune logger and multi-gpu DDP, duplicate runs are logged, one for each gpu. #10604
Comments
let me check it |
I have the same problem |
Same issue here, blocking upgrade to latest PL version. @kamil-kaczmarek I guess the problem is that |
We are double checking this, but seems that you are exactly right 👍 Thanks |
Whew! I was planning to open this issue a month or so ago! I was facing the same issue with a couple of other stuff on the same dataset. Thought I'd share my library and issues maybe it provides extra information! -Libraries:
-Code:
Dataset Size: ~8.5Million! Everything Works here on a smaller dataset! With the newer dataset the model kept blowing up to +-na.inf error. Maybe because of ddp_sharded?! -Library Update discussed on Slack: Torch+Lightning
Issues:
Facing issues with -
Code works but issue with NeptuneClient:
->Discussed on Slack and updated Lightning to latest version:
-Issues:
And Can't kill the tasks with Ctrl+C Neptune has updated their library and provides another API as well now to use. So there are some inconsistencies there. If. you check the LIghtningDOcs for Neptune and the Neptune website for the Lightning API. So, For the code I've successfully been working on for so long the Library version that work well for me is:
And using: and in Trainer: |
Yes I can confirm, the run instance should be created lazily on the first time the class AnyLogger(Logger):
def __init__(self, ...):
self._run_instance = None
@property
def experiment(self):
if self._run_instance is None:
# create it
self._run_instance = self._init_run_instance(api_key, project, name, run, neptune_run_kwargs)
else:
return self._run_instance This way we can prevent that experiments are getting created on multiple processes when the user creates a logger: logger = AnyLogger(...)
trainer = Trainer(logger=logger, ...) Since the above code runs N times in a distributed setting with N devices, the logger instantiation should 1) not create any files or send messages to the server 2) should not create any unpickleable attribues 3) not print any messages. All these events need to be delayed like shown above. |
We are fixing it here: neptune-ai#8 |
Hi, @ankitvad, @ppaysan and @ankitvad, Prince Canuma here, a Data Scientist at Neptune.ai, I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳 All you need to do is upgrade the library to the latest release 👍 Happy Christmas and a prosperous New Year in advance! |
🐛 Bug
With using the neptune logger and trainer with n_gpu > 1, accelerator='ddp', duplicate experiments are created and logged, one for each individual gpu
To Reproduce
Output:
Expected behavior
A single run logged (happens when gpu=1 and/or accelerator is turned off), not 4 of them...
Environment
conda
,pip
, source): pipcc @tchaton
The text was updated successfully, but these errors were encountered: