Training bug when using `--sync_bn` #3754

twotwoiscute · 2021-06-24T02:15:11Z

❔Question

Hi I have a problem with using convert_sync_batchnorm ,When I was trying to use DDP everything works fine ,but when I turn on the sync_bn mode ,the training process start and get stuck right away…

## Additional context
Here’s some info :
torch                         1.8.1                   
torchaudio                0.7.0                    
torchvision               0.9.1              

#How I run the script : 
python -m torch.distributed.launch  \
--nproc_per_node 4  \ 
--master_addr $master_addr  \ 
--master_port $port  train.py \ 
--train.py
--batch 256  --weights yolov5s.pt --device 0,1,2,3  \
--sync_bn

#How I init :
#note : opt.local_rank is -1 here 
    if opt.local_rank != -1:
        assert torch.cuda.device_count() > opt.local_rank
        torch.cuda.set_device(opt.local_rank)
        device = torch.device('cuda', opt.local_rank)
        dist.init_process_group(backend='nccl', init_method='env://')

Thanks

The text was updated successfully, but these errors were encountered:

github-actions · 2021-06-24T02:15:50Z

👋 Hello @twotwoiscute, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected].

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-06-24T10:37:56Z

@twotwoiscute your code is out of date, the opt class no longer carries DDP variables. To update your code:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

twotwoiscute · 2021-06-25T02:37:33Z

@glenn-jocher Thanks for the suggestion
However , I clone the latest version , However it still get stucked at forward pass ,Here's what I did :

with amp.autocast(enabled=cuda):
    try:
        print("pred")
        pred = model(imgs)  # forward
        print("done pred")
        loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
        if RANK != -1:
            loss *= WORLD_SIZE  # gradient averaged between devices in DDP mode
        if opt.quad:
            loss *= 4.
    except Exception as error :
             print(error)

I found that if I use 4 gpus and the first 3 gpus would have message print("pred") on the screen and the last gpu never print the message "pred" ,futhermore the first 3 gpus never enter to the code `print("done pred") which means the forward pass never complete in any of gpus.

And this happen after first iteration completed.By the way,sync_bn would only fail if I use multiple gpus, with single gpu everything works perfectly.

glenn-jocher · 2021-06-25T09:20:30Z

@twotwoiscute ok understood, I will try to reproduce this today.

glenn-jocher · 2021-06-25T10:44:05Z

@twotwoiscute yes I can reproduce this. I'm not sure what the cause is. In looking at the documentation perhaps we are missing a process_group input to the convert function.
https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html

The convert function is here:

yolov5/train.py

Lines 217 to 221 in 3749573

    
           # SyncBatchNorm 
        
           if opt.sync_bn and cuda and RANK != -1: 
        
               model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device) 
        
               logger.info('Using SyncBatchNorm()')

twotwoiscute · 2021-06-26T03:39:16Z

I follow the script you mentioned somehing like :

r1 = [ 0,1,2,3 ]
process_groups = [torch.distributed.new_group(pids) for pids in [r1] ] 
process_group = process_groups[0]

However ,it still does not work ...

glenn-jocher · 2021-06-26T11:44:08Z

@twotwoiscute I'm not sure if this is due to torch 1.9.0 or our own code, but I don't see anything wrong with the implementation.

I would actually not use --sync if I were you though. It mainly helps in early training but we found it to have little to no effect on final mAP (provided you train long enough, i.e. 300 COCO epochs).

twotwoiscute · 2021-06-29T10:01:42Z

In my case , I have 64 batch size per gpu , I think it's enough for calculating running mean & var correctly . However,for completeness,maybe to let pytorch team to have a look at this issue ?

glenn-jocher · 2021-06-29T10:10:14Z

@twotwoiscute yes, that's a good idea. I would raise a bug report on the pytorch repo linking to this issue.

glenn-jocher · 2021-07-17T11:08:43Z

@twotwoiscute --sync known issue PR in #4032 to alert future users to the existing problem

github-actions · 2021-08-17T00:10:48Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

twotwoiscute · 2021-08-23T01:44:33Z

@twotwoiscute --sync known issue PR in #4032 to alert future users to the existing problem

Have you get any response from pytorch team? I posted the similiar question in PyTorch Forums However, does not get any response..

glenn-jocher · 2021-08-23T11:55:05Z

@twotwoiscute we still have no resolution on this issue. If you find anything or hear from the PyTorch team please update here, thank you!

twotwoiscute · 2021-08-30T03:58:06Z

@twotwoiscute we still have no resolution on this issue. If you find anything or hear from the PyTorch team please update here, thank you!

I think this issue would be solved in current version since the way I solve this issue by commenting out :

Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()

And sync-bn works perfectly fine.

glenn-jocher · 2021-08-30T10:01:49Z

@twotwoiscute do you mean you comment out L81 in loggers/init.py?

yolov5/utils/loggers/__init__.py

Lines 79 to 81 in b894e69

    
           if ni < 3: 
        
               f = self.save_dir / f'train_batch{ni}.jpg'  # filename 
        
               Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()

glenn-jocher · 2021-08-30T10:10:36Z

@twotwoiscute what about the daemon Thread plots in val.py here?

yolov5/val.py

Lines 221 to 227 in b894e69

    
           # Plot images 
        
           if plots and batch_i < 3: 
        
               f = save_dir / f'val_batch{batch_i}_labels.jpg'  # labels 
        
               Thread(target=plot_images, args=(img, targets, paths, f, names), daemon=True).start() 
        
               f = save_dir / f'val_batch{batch_i}_pred.jpg'  # predictions 
        
               Thread(target=plot_images, args=(img, output_to_target(out), paths, f, names), daemon=True).start()

twotwoiscute · 2021-08-30T10:18:40Z

@glenn-jocher Ops! I think i made a wrong statement.. actually I comment out this part of code :

if tb_writer:
    tb_writer.add_graph(torch.jit.trace(de_parallel(model), imgs, strict=False), [])  # model graph
    # tb_writer.add_image(f, result, dataformats='HWC', global_step=epoch)

but keep the ploting part .

glenn-jocher · 2021-08-30T10:22:46Z

@twotwoiscute I don't think this line correlates with the issue at all. It doesn't actually exist anymore, it's been replaced with loggers/init.py L137 that only runs on the on_train_end() callback, which will not have any effect on starting training with or without --sync:

yolov5/utils/loggers/__init__.py

Lines 126 to 138 in b894e69

    
           def on_train_end(self, last, best, plots, epoch): 
        
               # Callback runs on training end 
        
               if plots: 
        
                   plot_results(file=self.save_dir / 'results.csv')  # save results.png 
        
               files = ['results.png', 'confusion_matrix.png', *[f'{x}_curve.png' for x in ('F1', 'PR', 'P', 'R')]] 
        
               files = [(self.save_dir / f) for f in files if (self.save_dir / f).exists()]  # filter 
        
               if self.tb: 
        
                   from PIL import Image 
        
                   import numpy as np 
        
                   for f in files: 
        
                       self.tb.add_image(f.stem, np.asarray(Image.open(f)), epoch, dataformats='HWC')

glenn-jocher · 2021-08-30T10:26:17Z

@twotwoiscute wait are you saying the line you commented in your comment is uncommented, this line?
This line does run at train start, it's been moved to loggers/init.py L78. I can try commenting it, thanks for the tip!

yolov5/utils/loggers/__init__.py

Lines 72 to 78 in b894e69

    
           def on_train_batch_end(self, ni, model, imgs, targets, paths, plots): 
        
               # Callback runs on train batch end 
        
               if plots: 
        
                   if ni == 0: 
        
                       with warnings.catch_warnings(): 
        
                           warnings.simplefilter('ignore')  # suppress jit trace warning 
        
                           self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])

twotwoiscute · 2021-08-30T10:27:28Z

@glenn-jocher okay,last weekend I tried to comment out plot_image part and it works ,but today it gets stuck again...but when I comment out self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) and works again ...

glenn-jocher · 2021-08-30T10:34:27Z

@twotwoiscute ok got it, thanks for the feedback! These are two very different things, add images is just adding images to TensorBoard, add_graph is adding an interactive YOLOv5 model visualizer (below), which is a much more complicated operation. I don't know what it has to do with --sync but if it's causing the hang we can simply not add graphs on --sync trainings as a workaround.

twotwoiscute added the question Further information is requested label Jun 24, 2021

glenn-jocher linked a pull request Jul 17, 2021 that will close this issue

Add --sync-bn known issue #4032

Merged

github-actions bot added the Stale Stale and schedule for closing soon label Aug 17, 2021

github-actions bot closed this as completed Aug 23, 2021

glenn-jocher added TODO High priority items and removed Stale Stale and schedule for closing soon labels Aug 30, 2021

glenn-jocher reopened this Aug 30, 2021

glenn-jocher changed the title ~~Training process stop when using --sync_bn~~ Training bug when using --sync_bn Aug 30, 2021

glenn-jocher linked a pull request Aug 30, 2021 that will close this issue

DDP torch.jit.trace() --sync-bn fix #4615

Merged

glenn-jocher closed this as completed in #4615 Aug 30, 2021

glenn-jocher removed the TODO High priority items label Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training bug when using `--sync_bn` #3754

Training bug when using `--sync_bn` #3754

twotwoiscute commented Jun 24, 2021

github-actions bot commented Jun 24, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jun 24, 2021 •

edited by UltralyticsAssistant

Loading

twotwoiscute commented Jun 25, 2021 •

edited

Loading

glenn-jocher commented Jun 25, 2021

glenn-jocher commented Jun 25, 2021

twotwoiscute commented Jun 26, 2021

glenn-jocher commented Jun 26, 2021

twotwoiscute commented Jun 29, 2021 •

edited

Loading

glenn-jocher commented Jun 29, 2021

glenn-jocher commented Jul 17, 2021

github-actions bot commented Aug 17, 2021 •

edited by glenn-jocher

Loading

twotwoiscute commented Aug 23, 2021

glenn-jocher commented Aug 23, 2021

twotwoiscute commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021

twotwoiscute commented Aug 30, 2021 •

edited

Loading

glenn-jocher commented Aug 30, 2021 •

edited

Loading

glenn-jocher commented Aug 30, 2021

twotwoiscute commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021

Training bug when using --sync_bn #3754

Training bug when using --sync_bn #3754

Comments

twotwoiscute commented Jun 24, 2021

❔Question

github-actions bot commented Jun 24, 2021 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

glenn-jocher commented Jun 24, 2021 • edited by UltralyticsAssistant Loading

twotwoiscute commented Jun 25, 2021 • edited Loading

glenn-jocher commented Jun 25, 2021

glenn-jocher commented Jun 25, 2021

twotwoiscute commented Jun 26, 2021

glenn-jocher commented Jun 26, 2021

twotwoiscute commented Jun 29, 2021 • edited Loading

glenn-jocher commented Jun 29, 2021

glenn-jocher commented Jul 17, 2021

github-actions bot commented Aug 17, 2021 • edited by glenn-jocher Loading

twotwoiscute commented Aug 23, 2021

glenn-jocher commented Aug 23, 2021

twotwoiscute commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021

twotwoiscute commented Aug 30, 2021 • edited Loading

glenn-jocher commented Aug 30, 2021 • edited Loading

glenn-jocher commented Aug 30, 2021

twotwoiscute commented Aug 30, 2021

glenn-jocher commented Aug 30, 2021

Training bug when using `--sync_bn` #3754

Training bug when using `--sync_bn` #3754

github-actions bot commented Jun 24, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jun 24, 2021 •

edited by UltralyticsAssistant

Loading

twotwoiscute commented Jun 25, 2021 •

edited

Loading

twotwoiscute commented Jun 29, 2021 •

edited

Loading

github-actions bot commented Aug 17, 2021 •

edited by glenn-jocher

Loading

twotwoiscute commented Aug 30, 2021 •

edited

Loading

glenn-jocher commented Aug 30, 2021 •

edited

Loading