Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepcam dummy wireup error #38

Open
sparticlesteve opened this issue Oct 6, 2022 · 0 comments
Open

deepcam dummy wireup error #38

sparticlesteve opened this issue Oct 6, 2022 · 0 comments

Comments

@sparticlesteve
Copy link
Contributor

It's probably not a common use-case, but the "dummy" wireup method for deepcam doesn't seem to work.

Here's an example script at NERSC:

#!/bin/bash
#SBATCH -A nstaff_g
#SBATCH -q early_science
#SBATCH -C gpu
#SBATCH -J mlperf-deepcam
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node 1
#SBATCH --cpus-per-task=32
#SBATCH --time 30
#SBATCH --image sfarrell/deepcam:ref-21.12

# Configuration
local_batch_size=2
batchnorm_group_size=1
data_dir="/global/cfs/cdirs/mpccc/gsharing/sfarrell/climate-data/All-Hist"
output_dir="$SCRATCH/deepcam/results"
run_tag="test_dummy_${SLURM_JOB_ID}"

srun --mpi=pmi2 shifter --module gpu \
       python ./train.py \
       --wireup_method "dummy" \
       --run_tag ${run_tag} \
       --data_dir_prefix ${data_dir} \
       --output_dir ${output_dir} \
       --model_prefix "segmentation" \
       --optimizer "LAMB" \
       --start_lr 0.0055 \
       --lr_schedule type="multistep",milestones="800",decay_rate="0.1" \
       --lr_warmup_steps 400 \
       --lr_warmup_factor 1. \
       --weight_decay 1e-2 \
       --logging_frequency 10 \
       --save_frequency 0 \
       --max_epochs 1 \
       --max_inter_threads 4 \
       --seed $(date +%s) \
       --batchnorm_group_size ${batchnorm_group_size} \
       --local_batch_size ${local_batch_size}

This gives a runtime error when constructing the DDP wrapper:

Traceback (most recent call last):
  File "./train.py", line 256, in <module>
    main(pargs)
  File "./train.py", line 167, in main
    ddp_net = DDP(net, device_ids=[device.index],
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in __init__
    self.process_group = _get_default_group()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 412, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant