-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hydra configs with multi GPU DDP training in Pytorch Lightning #2727
Comments
I've just found out that process local rank can be accessed via Nevertheless, it's not possible to delete hydra base directory from hydra.main decorated function, which is weird. |
You could configure Hydra run dir via the command line or your config file to be whatever you want it to be, see this. However, I think the right approach to do DDP training with Hydra is to use multirun. With multirun - each running script gets it's own subdirectory under the primary working directory by design and not by accident. |
I tried to use --multirun option |
I am not familiar with the difference between ddp and ddp-spawn in PL. Are you using the Submitit Launcher plugin to run it on SLURM? |
@rakhimovv when you use
|
Thanks for clarification @yukw777. I got it. I misunderstood at first. In my comment above I meant the situation when I try to run several experiments using @yukw777, do I understand correctly that running @omry no plugin for slurm. I did not have a chance to check Sumbitit or PL's SlurmCluster object. I used just plain sbatch script there are two options I tried:
The main problem here that setting |
By ntask I mean you mean sbatch parameter? I have no intention of supporting it. If you want to use sbatch you are on your own, at least from my perspective. Try with the submitit plugin and we can discuss further. |
@omry I'm trying to think through what you are suggesting here and what Lightning does. Lightning currently handles multiple GPUs per node by launching a subprocess for each additional GPU on the node using the Would you suggest that the config for each node be a multirun job across available GPUs (ntasks == ngpus)? If we launch with the submitit launcher with command I'll think about this because I would like to find a simple approach to this issue. I don't think Lightning can support choosing the GPU for a job. I ran into this issue using |
@rakhimovv sorry for the late reply, but yes you're correct, currently they don't work together. |
Here's an example script that outlines the Hydra/Lightning issue with This example simulates how lightning spawns a process on a node with 2 GPUs (spawns one process along with the main process). You can see how Also, if you updated the example to run import os
import sys
import subprocess
from os.path import abspath
import hydra
from omegaconf import DictConfig
def spawner(cfg):
command = sys.argv
full_path = hydra.utils.to_absolute_path(command[0])
command[0] = full_path
command = [sys.executable] + command
cwd = hydra.utils.get_original_cwd()
env_copy = os.environ.copy()
env_copy['LOCAL_RANK'] = '1'
proc = subprocess.Popen(command, env=env_copy, cwd=cwd)
def objective(cfg):
if 'LOCAL_RANK' in os.environ:
print('bar')
else:
print('foo')
spawner(cfg)
@hydra.main(config_path='.', config_name='argv.yaml')
def main(cfg: DictConfig):
objective(cfg)
if __name__ == '__main__':
main() Here is test: 1 |
Yes, it is an issue.
I think for multiprocessing you need treat the application as a single run (not multirun) and let PL do the multiprocessing. Hydra can set environment variables for jobs, see this. |
So it actually works out great to just have a configuration for submitit. For my example above, if you call |
@jgbos To clarify, the solution is to use submitit, but not with the hydra submitit plugin or hydra Also, does this mean there is currently no solution in the case that Slurm is not being used (outside of @omry suggestion to go down one level of abstraction and deal with Torch distributed ourselves)? |
Sorry, but I don't have an example for DDP with Hydra. In the mean time try to get help from people that have been successful using PL DDP with Hydra. I think you can find a few on this issue. |
@AlexSchuy I'm still trying to figure out the best options, but there are two steps I take to ensure Lightning DDP works. First I modify if distributed_backend == 'ddp':
cwd = os.getcwd()
sys.argv = sys.argv[:1]
sys.argv.extend([
f"hydra.run.dir={cwd}",
"hydra/hydra_logging=disabled",
"hydra/job_logging=disabled",
])
overrides = OmegaConf.load('.hydra/overrides.yaml')
for o in overrides:
if 'hydra/sweeper' in o:
continue
if 'hydra/launcher' in o:
continue
sys.argv.append(o) For launching via job = executor.submit(train, *args, **kwargs) |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
@jgbos thanks for sharing your temporary solution. |
Could anyone provide a small example of how to use PL with hydra and submitit (doesn't matter if with the plugin or not), that would be great! |
@mees and @lukashermann. I don't have a nice simple solution to copy and paste into an issue. But here's the gist of how I have gotten things to work:
This treats each execution of your code as a single task (none of that spawning subprocesses that lightning does by default). Once all tasks are running I recommend this path as it removes any special processing of edit Looking at Lightning's latest, it looks like they may have an accelerator that behaves this way already: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/ddp_hpc_accelerator.py |
Hi! Sorry to bring up a stale issue, is there any update no this? I am currently able to use Hydra with DDP, it's the combination hydra-multirun + lightning-DDP that's still not functioning properly, generating extra folders for child processes. If there's something to work with, even a workaround, I'm willing to try, the multirun option is quite awesome! |
To workaround the extra folders, I did the following. At the very beginning of the program (of each process), I check if an env variable called |
A fix that works for me is to update the PL code starting here to: if _HYDRA_AVAILABLE:
if HydraConfig.initialized():
cwd = get_original_cwd()
os_cwd = f'"{os.getcwd()}"'
command = command[:2]
command += ["-cp", str(Path.cwd().relative_to(Path(cwd)) / ".hydra"), "-cn", "config.yaml"]
command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"] |
If you also want it to work for
But in general this does not work with: Hydra + Submitit + PL with DDP, since the command there is submit.py and you cannot pass the hydra config to it. If I understood it correctly, @jgbos the solution you provided works when you force PL to use 1 GPU only and therefore you can only use 1 GPU per task. Am I right?
I would like to run 1 task per node and N GPUs per task (1 node, 1 task, N GPUs) using Hydra (--multirun) + Submitit (gpus_per_node=N or gpus_per_task=N) and PL with DDP (to handle the multiprocessing for the N GPUs per task/node). As far as I understood this option is not working right now. Is there any workaround for this or should I write my own sweeper/launcher and avoid --multirun + submitit? |
Yes, I wasn't using that feature, your solution is correct
No, this works for multi-gpu. It runs using the Hydra config in the current experiment, so it will launch the process correctly for each rank. There are several caveats to this though, mostly related to complex task functions and multirun (you need to destroy distributed processes and remove PL related environment variables for multirun to work). By no means is this solution robust. In most cases you should just write your own custom strategy which is pretty easy to do.
For submit (local or slurm) you should set |
Thank you, I did not get it at the beginning. I will give it a try! Unfortunately in my case there is a reason for that: my time on the clusters is limited and it is accounted as |
That's too bad because that's the most robust solution I've come up with. What we really need is a Hydra centric |
Here's a potential solution: Create a file import hydra
from hydra.utils import instantiate
@hydra.main(config_path=None, config_name="config")
def main(cfg):
trainer: Trainer = instantiate(cfg.trainer)
model: LightningModule = instantiate(cfg.model)
if cfg.testing:
trainer.test(model)
else:
trainer.fit(model)
if __name__ == "__main__":
main() Create your own command = [sys.executable, "-m", "lightning_cli"]
command += ["-cp", hydra_output, "-cn", "config.yaml"]
command += [
f"hydra.output_subdir={hydra_cfg.output_subdir}",
f"hydra.run.dir={os_cwd}",
f"hydra.job.name=train_ddp_process_{local_rank}",
] Now it will only spawn running lightning and not your task function. |
Thank you for the suggestions, I think your solution should work, but I don't know why it gets stuck at the Trainer, without giving any error (I also tried manually calling my main function from the correct directory). The job(s) keep running normally but the out/err do not get updated. At the end I wrote my own launcher and I am using that one instead of submitit. It is just a modification of the BasicLauncher which spawns a subprocess for each combination generated by --multirun. |
Oh, multirun has some issues that I haven't quite figured out. The solution that seems to work consistently is to modify First make your own def setup_environment(self) -> None:
# ADD THIS STATEMENT ###############
if torch.distributed.is_initialized():
torch.distributed.destroy_process_group()
#################################
# start the other scripts
if not self.cluster_environment.creates_processes_externally:
self._call_children_scripts()
self.setup_distributed()
super().setup_environment() Second in def teardown(self) -> None:
log.detail(f"{self.__class__.__name__}: tearing down DDP plugin")
super().teardown()
if isinstance(self.model, DistributedDataParallel):
self.model = self.lightning_module
if self.sync_batchnorm:
self.model = _revert_sync_batchnorm(self.model)
if self.root_device.type == "cuda":
# GPU teardown
log.detail(f"{self.__class__.__name__}: moving model to CPU")
self.lightning_module.cpu()
# clean up memory
torch.cuda.empty_cache()
# ADD THIS STATEMENT ###############
# Remove PL environments so next multirun starts fresh
envs = (
"LOCAL_RANK",
"NODE_RANK",
"WORLD_SIZE",
"MASTER_ADDR",
"MASTER_PORT",
)
for name in envs:
os.environ.pop(name, None)
################################# |
Sorry for the late reply, I was extremely busy lately. I tested this now on PL 1.5.4 and 1.5.10 and it still seems to be stuck at the trainer (which version did you test this on? The code looks slightly different from the versions I tested). |
@jgbos Thanks for your suggestions! They were helpful for me :) @AlessioQuercia for me, setting |
As far as I understand DDP backend runs my training script from beginning for each GPU that I use. Is there a way to avoid creating different hydra
output
directories in each of the scripts? Should I block somehow every process except one with local rank 0? In my case I'm saving model checkpoints and .yaml file to default hydra output directory, but config file is copied twice and checkpoints are saved once. Anyways, spawning too many of directories is not convenient.What can I do?
Code
What's your environment?
The text was updated successfully, but these errors were encountered: