Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra configs with multi GPU DDP training in Pytorch Lightning #2727

Closed
topshik opened this issue Jul 27, 2020 · 32 comments
Closed

Hydra configs with multi GPU DDP training in Pytorch Lightning #2727

topshik opened this issue Jul 27, 2020 · 32 comments
Labels
distributed Generic distributed-related topic question Further information is requested won't fix This will not be worked on

Comments

@topshik
Copy link

topshik commented Jul 27, 2020

As far as I understand DDP backend runs my training script from beginning for each GPU that I use. Is there a way to avoid creating different hydra output directories in each of the scripts? Should I block somehow every process except one with local rank 0? In my case I'm saving model checkpoints and .yaml file to default hydra output directory, but config file is copied twice and checkpoints are saved once. Anyways, spawning too many of directories is not convenient.

What can I do?

Code

@hydra.main(config_path="train-config.yaml", strict=False)
def train(config: DictConfig) -> None:
    config.hydra_base_dir = os.getcwd()
    original_wd = hydra.utils.get_original_cwd()
    os.chdir(original_wd)

    checkpoint_callback = ModelCheckpoint(
        filepath=config.hydra_base_dir,
        save_top_k=3,
        verbose=True,
        monitor="val_loss",
        mode="min",
    )
    shutil.copy2("train-config.yaml", os.path.join(config.hydra_base_dir, "train-config.yaml"))

    wandb_logger = WandbLogger(
        offline=False,
    )

    model = MyModel(config)

    trainer = pl.Trainer(
        max_epochs=config.train.max_epochs,
        gpus=config.train.n_gpu,
        auto_select_gpus=True,
        distributed_backend="ddp",
        checkpoint_callback=checkpoint_callback,
        logger=wandb_logger,
    )

    trainer.fit(model)

What's your environment?

  • OS: Ubuntu 18.04
  • Conda, Python 3.7.7
  • hydra-core==0.11.3
  • pytorch-lightning==0.8.5
  • wandb==0.9.3
@topshik topshik added the question Further information is requested label Jul 27, 2020
@topshik
Copy link
Author

topshik commented Jul 27, 2020

I've just found out that process local rank can be accessed via local_rank = os.environ.get("LOCAL_RANK", 0), because this is how lightning handles it under cover. Seems like there is needed some clarification on how to work with different DDP processes in the docs.

Nevertheless, it's not possible to delete hydra base directory from hydra.main decorated function, which is weird.

@edenlightning edenlightning added the distributed Generic distributed-related topic label Jul 29, 2020
@Borda
Copy link
Member

Borda commented Jul 31, 2020

cc: @yukw777 @omry

@omry
Copy link
Contributor

omry commented Jul 31, 2020

You could configure Hydra run dir via the command line or your config file to be whatever you want it to be, see this.

However, I think the right approach to do DDP training with Hydra is to use multirun. With multirun - each running script gets it's own subdirectory under the primary working directory by design and not by accident.
I didn't see anyone doing it yet and I do not know how it will work with PL.
I think it's important to create a PL example for DDP , following the great work from @anthonytec2 (#2639) which is still in limbo.

@rakhimovv
Copy link

rakhimovv commented Jul 31, 2020

I tried to use --multirun option
it does not work under trainer.distributed_backend=ddp
but it works withtrainer.distributed_backned=ddp_spawnwith one exception that running it in slurm it fails

@omry
Copy link
Contributor

omry commented Jul 31, 2020

I am not familiar with the difference between ddp and ddp-spawn in PL.
We may still need to make some changes or at least provide a working example. I am counting on people like you to help create that example.

Are you using the Submitit Launcher plugin to run it on SLURM?
What kind of failure?

@yukw777
Copy link
Contributor

yukw777 commented Jul 31, 2020

@rakhimovv when you use ddp, PL starts subprocesses to run the training script (by simply passing the command for the training script to https://docs.python.org/3/library/subprocess.html) and gathers the gradients. What @omry means when he recommends using --multirun, is that instead of relying on the subprocess module as PL does, use --multirun to start subprocesses and gather the gradients. This is not supported by PL currently as far as I know.

ddp_spawn on the other hand starts subprocesses by using this method https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn, which is fundamentally different from directly calling the training script as ddp does.

@rakhimovv
Copy link

rakhimovv commented Jul 31, 2020

Thanks for clarification @yukw777. I got it.

I misunderstood at first. In my comment above I meant the situation when I try to run several experiments using --multirun option like
python train.py --multirun trainer.gpus=4 trainer.distributed_backend=ddp_spawn encoder.blocks_num=2,4

@yukw777, do I understand correctly that running --multirun and ddp simultaneously is not correct fundamentally?

@omry no plugin for slurm. I did not have a chance to check Sumbitit or PL's SlurmCluster object. I used just plain sbatch script

there are two options I tried:

  1. when I set --ntasks=1 despite the number of used gpus
    it works, in this case, PL manages spawning processes itself

  2. when I set --ntasks equal to the number of used gpus
    In this case, the PL transfers starting subprocesses to Slurm. Slurm makes it if one uses 'srun' command inside sbatch script.
    But I get the error below. It fails during the test phase. In my script, I ran fit and test sequentially.

Traceback (most recent call last):
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 202, in run_and_report
    return func()
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 345, in <lambda>
    lambda: hydra.multirun(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 135, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 64, in launch
    ret = run_job(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "train_net.py", line 62, in main
    trainer.fit(model)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1074, in fit
    self.ddp_train(process_idx=task, mp_queue=None, model=model)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 577, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 887, in configure_ddp
    model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 283, in __init__
    self._distributed_broadcast_coalesced(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe
Traceback (most recent call last):
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 202, in run_and_report
    return func()
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 345, in <lambda>
    lambda: hydra.multirun(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 135, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 64, in launch
    ret = run_job(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "train_net.py", line 62, in main
    trainer.fit(model)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1074, in fit
    self.ddp_train(process_idx=task, mp_queue=None, model=model)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 577, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 887, in configure_ddp
    model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 283, in __init__
    self._distributed_broadcast_coalesced(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe
Traceback (most recent call last):
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 202, in run_and_report
    return func()
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 345, in <lambda>
    lambda: hydra.multirun(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 135, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 64, in launch
    ret = run_job(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "train_net.py", line 62, in main
    trainer.fit(model)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1074, in fit
    self.ddp_train(process_idx=task, mp_queue=None, model=model)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 577, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 887, in configure_ddp
    model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 283, in __init__
    self._distributed_broadcast_coalesced(
  File "/trinity/home/r.rakhimov/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe

The main problem here that setting --ntasks equal to the number of used gpus is the only option if I, for example, want to run multi-node (not just multi-gpu) training.

@omry
Copy link
Contributor

omry commented Jul 31, 2020

By ntask I mean you mean sbatch parameter? I have no intention of supporting it. If you want to use sbatch you are on your own, at least from my perspective.

Try with the submitit plugin and we can discuss further.
Also, please create a minimal example demonstrating the problem. (again, using the submitit plugin).

@jgbos
Copy link
Contributor

jgbos commented Aug 7, 2020

@omry I'm trying to think through what you are suggesting here and what Lightning does.

Lightning currently handles multiple GPUs per node by launching a subprocess for each additional GPU on the node using the sys.argv command (this apparently reduces training times over multiprocessing). The use of sys.argv obviously causes issues with Hydra because it will include all the Hydra commands. For example, if multirun command is used, each node would spawn a new multirun job. I currently modify the sys.argv commands to get around this.

Would you suggest that the config for each node be a multirun job across available GPUs (ntasks == ngpus)? If we launch with the submitit launcher with command task=1,2 --multirun would that launch two jobs per node? I like this idea but I don't think Lightning supports it.

I'll think about this because I would like to find a simple approach to this issue. I don't think Lightning can support choosing the GPU for a job. I ran into this issue using mpirun (#2408). It might work to modify the CUDA_VISIBLE_DEVICES per job.

@yukw777
Copy link
Contributor

yukw777 commented Aug 7, 2020

@rakhimovv sorry for the late reply, but yes you're correct, currently they don't work together.

@jgbos
Copy link
Contributor

jgbos commented Aug 7, 2020

Here's an example script that outlines the Hydra/Lightning issue with ddp backend

This example simulates how lightning spawns a process on a node with 2 GPUs (spawns one process along with the main process). You can see how sys.argv is used here. There will be a log directory for each process.

Also, if you updated the example to run multirun you will see it executes twice: python test_argv.py test=1,2 --multirun

import os
import sys
import subprocess
from os.path import abspath

import hydra
from omegaconf import DictConfig


def spawner(cfg):
    command = sys.argv
    full_path = hydra.utils.to_absolute_path(command[0])
    command[0] = full_path
    command = [sys.executable] + command
    cwd = hydra.utils.get_original_cwd()

    env_copy = os.environ.copy()
    env_copy['LOCAL_RANK'] = '1'
    proc = subprocess.Popen(command, env=env_copy, cwd=cwd)

def objective(cfg):
    if 'LOCAL_RANK' in os.environ:
        print('bar')
    else:
        print('foo')
        spawner(cfg)


@hydra.main(config_path='.', config_name='argv.yaml')
def main(cfg: DictConfig):
    objective(cfg)

if __name__ == '__main__':
    main()

Here is argv.yaml

test: 1

@omry
Copy link
Contributor

omry commented Aug 10, 2020

@omry I'm trying to think through what you are suggesting here and what Lightning does.

Lightning currently handles multiple GPUs per node by launching a subprocess for each additional GPU on the node using the sys.argv command (this apparently reduces training times over multiprocessing). The use of sys.argv obviously causes issues with Hydra because it will include all the Hydra commands. For example, if multirun command is used, each node would spawn a new multirun job. I currently modify the sys.argv commands to get around this.

Yes, it is an issue.
It also sounds like you need to create a Sweeper plugin for this, but I am not sure it's the best coarse of action.
Sweeper plugins takes the input overrides (command line) for a multirun and break it down to overrides for individual jobs.

Would you suggest that the config for each node be a multirun job across available GPUs (ntasks == ngpus)? If we launch with the submitit launcher with command task=1,2 --multirun would that launch two jobs per node? I like this idea but I don't think Lightning supports it.

I'll think about this because I would like to find a simple approach to this issue. I don't think Lightning can support choosing the GPU for a job. I ran into this issue using mpirun (#2408). It might work to modify the CUDA_VISIBLE_DEVICES per job.

I think for multiprocessing you need treat the application as a single run (not multirun) and let PL do the multiprocessing.
My suggestions to use the Slurm launcher were thinking more of the case for multi-node but to be honest I think this is not going to work right now and there should be more work to enable it.

Hydra can set environment variables for jobs, see this.
It's likely that this will help. In fact - I am specifically calling out RANK in Torch distributed as a use case there.

@jgbos
Copy link
Contributor

jgbos commented Aug 11, 2020

So it actually works out great to just have a configuration for submitit. For my example above, if you call objective via executor.submit(objective, cfg) it works out great, no need to mess with sys.argv. I think this is because submitit generates a new command (using pickles?) for the submission. I wonder if the Lightning folks would benefit in generating a subprocess similar to how submitit generates the submission??

@AlexSchuy
Copy link

@jgbos To clarify, the solution is to use submitit, but not with the hydra submitit plugin or hydra --multirun? Do you mind showing a working example using pl?

Also, does this mean there is currently no solution in the case that Slurm is not being used (outside of @omry suggestion to go down one level of abstraction and deal with Torch distributed ourselves)?

@omry
Copy link
Contributor

omry commented Aug 14, 2020

Sorry, but I don't have an example for DDP with Hydra.
Supporting it properly will take some development.

In the mean time try to get help from people that have been successful using PL DDP with Hydra. I think you can find a few on this issue.

@jgbos
Copy link
Contributor

jgbos commented Aug 14, 2020

@AlexSchuy I'm still trying to figure out the best options, but there are two steps I take to ensure Lightning DDP works. First I modify sys.argv in the main function (after hydra has been initiated) using the following (which should support multirun)

if distributed_backend == 'ddp':
    cwd = os.getcwd()
    sys.argv = sys.argv[:1]
    sys.argv.extend([
        f"hydra.run.dir={cwd}",
        "hydra/hydra_logging=disabled",
        "hydra/job_logging=disabled",
    ])
    overrides = OmegaConf.load('.hydra/overrides.yaml')
    for o in overrides:
        if 'hydra/sweeper' in o:
            continue

        if 'hydra/launcher' in o:
            continue

        sys.argv.append(o)

For launching via submitit, I have a command function, such as train(cfg: DictConfig), that is used for the submission

job = executor.submit(train, *args, **kwargs)

@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 22, 2020
@stale stale bot closed this as completed Oct 29, 2020
@mees
Copy link
Contributor

mees commented Dec 14, 2020

@jgbos thanks for sharing your temporary solution.
I am facing the issue that since I have hydra.job.override_dirname in the runs folder name it does not parse the '=' symbol correctly. So basically, it would be like trying to do python training.py hydra.run.dir=/home/foo/runs/2020-12-11_17-32-24_trainer.gpus=-1 The error says "mismatched input '=' expecting ". Any suggestion on what would be the best solution?

@lukashermann
Copy link

lukashermann commented Dec 14, 2020

Could anyone provide a small example of how to use PL with hydra and submitit (doesn't matter if with the plugin or not), that would be great!
@jgbos

@jgbos
Copy link
Contributor

jgbos commented Dec 16, 2020

@mees and @lukashermann. I don't have a nice simple solution to copy and paste into an issue. But here's the gist of how I have gotten things to work:

  • Lightning supports writing your own accelerator function. I created an accelerator function similar to ddp that does not spawn a new subprocess for each gpu on the machine. My implementation requires Trainer.num_gpus=1 and Trainer.num_nodes=<world size>.
  • This requires the user to spawn their own process for every GPU on every node. Any Slurm manager supports this with tasks per node (and thus submitit)
  • Be sure to use the setup flag in submitit to set MASTER_ADDR and MASTER_PORT

This treats each execution of your code as a single task (none of that spawning subprocesses that lightning does by default). Once all tasks are running torch.distributed will initiate.

I recommend this path as it removes any special processing of sys.argv, and honestly behaves how I expect distributed code to behave.

edit

Looking at Lightning's latest, it looks like they may have an accelerator that behaves this way already: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/ddp_hpc_accelerator.py

@edornd
Copy link

edornd commented Apr 17, 2021

Hi! Sorry to bring up a stale issue, is there any update no this? I am currently able to use Hydra with DDP, it's the combination hydra-multirun + lightning-DDP that's still not functioning properly, generating extra folders for child processes.
Unfortunately I cannot use SLURM, I'm on a single multi-gpu machine, if that helps.

If there's something to work with, even a workaround, I'm willing to try, the multirun option is quite awesome!

@bryant1410
Copy link
Contributor

bryant1410 commented Jan 13, 2022

To workaround the extra folders, I did the following. At the very beginning of the program (of each process), I check if an env variable called SWEEP_DIR is set, and if not I set it to the multi-run directory I want to use (e.g., multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}; note it has to be set before the Hydra config is loaded). Then, in the Hydra config, I set hydra.sweep.dir to this variable ($(oc.env:SWEEP_DIR}).

@jgbos
Copy link
Contributor

jgbos commented Jan 16, 2022

A fix that works for me is to update the PL code starting here to:

            if _HYDRA_AVAILABLE:
                if HydraConfig.initialized():
                    cwd = get_original_cwd()
                    os_cwd = f'"{os.getcwd()}"'
                    command = command[:2]
                    command += ["-cp", str(Path.cwd().relative_to(Path(cwd)) / ".hydra"), "-cn", "config.yaml"]
                    command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"]

@AlessioQuercia
Copy link
Contributor

AlessioQuercia commented Feb 4, 2022

A fix that works for me is to update the PL code starting here to:

            if _HYDRA_AVAILABLE:
                if HydraConfig.initialized():
                    cwd = get_original_cwd()
                    os_cwd = f'"{os.getcwd()}"'
                    command = command[:2]
                    command += ["-cp", str(Path.cwd().relative_to(Path(cwd)) / ".hydra"), "-cn", "config.yaml"]
                    command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"]

If you also want it to work for python -m script.py you need to modify it slightly:

            if _HYDRA_AVAILABLE:
                if HydraConfig.initialized():
                    cwd = get_original_cwd()
                    os_cwd = f'"{os.getcwd()}"'
                    if __main__.__spec__ is None:
                        command = command[:2]
                    else:
                        command = command[:3]
                    command += ["-cp", str(Path.cwd().relative_to(Path(cwd)) / ".hydra"), "-cn", "config.yaml"]
                    command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"]

But in general this does not work with: Hydra + Submitit + PL with DDP, since the command there is submit.py and you cannot pass the hydra config to it.

If I understood it correctly, @jgbos the solution you provided works when you force PL to use 1 GPU only and therefore you can only use 1 GPU per task. Am I right?

@mees and @lukashermann. I don't have a nice simple solution to copy and paste into an issue. But here's the gist of how I have gotten things to work:

* Lightning supports writing your own accelerator function.  I created an accelerator function similar to ddp that does **not** spawn a new subprocess for each gpu on the machine.  My implementation requires `Trainer.num_gpus=1` and `Trainer.num_nodes=<world size>`.

* This requires the user to spawn their own process for every GPU on every node.  Any Slurm manager supports this with tasks per node (and thus submitit)

* Be sure to use the `setup` flag in `submitit` to set `MASTER_ADDR` and `MASTER_PORT`

This treats each execution of your code as a single task (none of that spawning subprocesses that lightning does by default). Once all tasks are running torch.distributed will initiate.

I recommend this path as it removes any special processing of sys.argv, and honestly behaves how I expect distributed code to behave.

edit

Looking at Lightning's latest, it looks like they may have an accelerator that behaves this way already: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/ddp_hpc_accelerator.py

I would like to run 1 task per node and N GPUs per task (1 node, 1 task, N GPUs) using Hydra (--multirun) + Submitit (gpus_per_node=N or gpus_per_task=N) and PL with DDP (to handle the multiprocessing for the N GPUs per task/node). As far as I understood this option is not working right now. Is there any workaround for this or should I write my own sweeper/launcher and avoid --multirun + submitit?

@jgbos
Copy link
Contributor

jgbos commented Feb 4, 2022

If you also want it to work for python -m script.py you need to modify it slightly

Yes, I wasn't using that feature, your solution is correct

If I understood it correctly, @jgbos the solution you provided works when you force PL to use 1 GPU only and therefore you can only use 1 GPU per task. Am I right?

No, this works for multi-gpu. It runs using the Hydra config in the current experiment, so it will launch the process correctly for each rank. There are several caveats to this though, mostly related to complex task functions and multirun (you need to destroy distributed processes and remove PL related environment variables for multirun to work). By no means is this solution robust. In most cases you should just write your own custom strategy which is pretty easy to do.

I would like to run 1 task per node and N GPUs per task (1 node, 1 task, N GPUs) using Hydra (--multirun) + Submitit (gpus_per_node=N or gpus_per_task=N) and PL with DDP (to handle the multiprocessing for the N GPUs per task/node

For submit (local or slurm) you should set gpus=1 and num_nodes=<num gpus wanted>. There is no reason to have PL launch subtasks for each job, just let the cluster manager (or submitit local tasking) handle spawning the job for each GPU. I've used Hydra+PL+DDP using submitit for a couple years with no issues or updates to PL.

@AlessioQuercia
Copy link
Contributor

For submit (local or slurm) you should set gpus=1 and num_nodes=<num gpus wanted>. There is no reason to have PL launch subtasks for each job, just let the cluster manager (or submitit local tasking) handle spawning the job for each GPU. I've used Hydra+PL+DDP using submitit for a couple years with no issues or updates to PL.

Thank you, I did not get it at the beginning. I will give it a try!

Unfortunately in my case there is a reason for that: my time on the clusters is limited and it is accounted as nodes * cpus * time, regardless of the gpus, that is using 1 gpu per node or all available ones is accounted the same time. Therefore using only 1 gpu per node would be a waste of resources and time.

@jgbos
Copy link
Contributor

jgbos commented Feb 4, 2022

Therefore using only 1 gpu per node would be a waste of resources and time.

That's too bad because that's the most robust solution I've come up with. What we really need is a Hydra centric LightningCLI so that the spawning command is something like python -m pytorch_lightning.main <path to hydra config.yaml>. This would be robust since it doesn't actually recall your task function. I'm looking at. (shameless plug, it might be simple using our tool hydra-zen)

@jgbos
Copy link
Contributor

jgbos commented Feb 4, 2022

Here's a potential solution:

Create a file lightning_cli.py with something like the following:

import hydra
from hydra.utils import instantiate


@hydra.main(config_path=None, config_name="config")
def main(cfg):
    trainer: Trainer = instantiate(cfg.trainer)
    model: LightningModule = instantiate(cfg.model)

    if cfg.testing:
        trainer.test(model)
    else:
        trainer.fit(model)

if __name__ == "__main__":
    main()

Create your own CustomDDPStrategy and use this for the command to spawn a new job:

            command = [sys.executable, "-m", "lightning_cli"]
            command += ["-cp", hydra_output, "-cn", "config.yaml"]
            command += [
                f"hydra.output_subdir={hydra_cfg.output_subdir}",
                f"hydra.run.dir={os_cwd}",
                f"hydra.job.name=train_ddp_process_{local_rank}",
            ]

Now it will only spawn running lightning and not your task function.

@AlessioQuercia
Copy link
Contributor

AlessioQuercia commented Feb 5, 2022

Thank you for the suggestions, I think your solution should work, but I don't know why it gets stuck at the Trainer, without giving any error (I also tried manually calling my main function from the correct directory). The job(s) keep running normally but the out/err do not get updated.

At the end I wrote my own launcher and I am using that one instead of submitit. It is just a modification of the BasicLauncher which spawns a subprocess for each combination generated by --multirun.

@jgbos
Copy link
Contributor

jgbos commented Feb 7, 2022

Oh, multirun has some issues that I haven't quite figured out. The solution that seems to work consistently is to modify CustomDDPStrategy in two spots:

First make your own setup_environment method. For some reason it only works to destroy_process_group in this function, it will hang anywhere else.

    def setup_environment(self) -> None:
        # ADD THIS STATEMENT ###############
        if torch.distributed.is_initialized():
            torch.distributed.destroy_process_group()
        #################################

        # start the other scripts
        if not self.cluster_environment.creates_processes_externally:
            self._call_children_scripts()

        self.setup_distributed()
        super().setup_environment()

Second in teardown you need to remove PL environment variables, I'm unsure of exactly which variables to remove but this seems to work:

    def teardown(self) -> None:
        log.detail(f"{self.__class__.__name__}: tearing down DDP plugin")
        super().teardown()
        if isinstance(self.model, DistributedDataParallel):
            self.model = self.lightning_module

        if self.sync_batchnorm:
            self.model = _revert_sync_batchnorm(self.model)

        if self.root_device.type == "cuda":
            # GPU teardown
            log.detail(f"{self.__class__.__name__}: moving model to CPU")
            self.lightning_module.cpu()
            # clean up memory
            torch.cuda.empty_cache()

        # ADD THIS STATEMENT ###############
        # Remove PL environments so next multirun starts fresh
        envs = (
            "LOCAL_RANK",
            "NODE_RANK",
            "WORLD_SIZE",
            "MASTER_ADDR",
            "MASTER_PORT",
        )
    
        for name in envs:
            os.environ.pop(name, None)
        #################################

@AlessioQuercia
Copy link
Contributor

Sorry for the late reply, I was extremely busy lately. I tested this now on PL 1.5.4 and 1.5.10 and it still seems to be stuck at the trainer (which version did you test this on? The code looks slightly different from the versions I tested).

@alexsax
Copy link

alexsax commented May 25, 2022

@jgbos Thanks for your suggestions! They were helpful for me :)

@AlessioQuercia for me, setting tasks_per_node equal to the number of GPUs worked, and might work well for your accounting scheme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

15 participants