Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training bug when using --sync_bn #3754

Closed
twotwoiscute opened this issue Jun 24, 2021 · 21 comments · Fixed by #4032 or #4615
Closed

Training bug when using --sync_bn #3754

twotwoiscute opened this issue Jun 24, 2021 · 21 comments · Fixed by #4032 or #4615
Labels
question Further information is requested

Comments

@twotwoiscute
Copy link

❔Question

Hi I have a problem with using convert_sync_batchnorm ,When I was trying to use DDP everything works fine ,but when I turn on the sync_bn mode ,the training process start and get stuck right away…

## Additional context
Here’s some info :
torch                         1.8.1                   
torchaudio                0.7.0                    
torchvision               0.9.1              

#How I run the script : 
python -m torch.distributed.launch  \
--nproc_per_node 4  \ 
--master_addr $master_addr  \ 
--master_port $port  train.py \ 
--train.py
--batch 256  --weights yolov5s.pt --device 0,1,2,3  \
--sync_bn

#How I init :
#note : opt.local_rank is -1 here 
    if opt.local_rank != -1:
        assert torch.cuda.device_count() > opt.local_rank
        torch.cuda.set_device(opt.local_rank)
        device = torch.device('cuda', opt.local_rank)
        dist.init_process_group(backend='nccl', init_method='env://')

Thanks

@twotwoiscute twotwoiscute added the question Further information is requested label Jun 24, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jun 24, 2021

👋 Hello @twotwoiscute, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected].

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 24, 2021

@twotwoiscute your code is out of date, the opt class no longer carries DDP variables. To update your code:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

@twotwoiscute
Copy link
Author

twotwoiscute commented Jun 25, 2021

@glenn-jocher Thanks for the suggestion
However , I clone the latest version , However it still get stucked at forward pass ,Here's what I did :

with amp.autocast(enabled=cuda):
    try:
        print("pred")
        pred = model(imgs)  # forward
        print("done pred")
        loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
        if RANK != -1:
            loss *= WORLD_SIZE  # gradient averaged between devices in DDP mode
        if opt.quad:
            loss *= 4.
    except Exception as error :
             print(error)

I found that if I use 4 gpus and the first 3 gpus would have message print("pred") on the screen and the last gpu never print the message "pred" ,futhermore the first 3 gpus never enter to the code `print("done pred") which means the forward pass never complete in any of gpus.

And this happen after first iteration completed.By the way,sync_bn would only fail if I use multiple gpus, with single gpu everything works perfectly.

@glenn-jocher
Copy link
Member

@twotwoiscute ok understood, I will try to reproduce this today.

@glenn-jocher
Copy link
Member

@twotwoiscute yes I can reproduce this. I'm not sure what the cause is. In looking at the documentation perhaps we are missing a process_group input to the convert function.
https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html

The convert function is here:

yolov5/train.py

Lines 217 to 221 in 3749573

# SyncBatchNorm
if opt.sync_bn and cuda and RANK != -1:
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
logger.info('Using SyncBatchNorm()')

@twotwoiscute
Copy link
Author

I follow the script you mentioned somehing like :

r1 = [ 0,1,2,3 ]
process_groups = [torch.distributed.new_group(pids) for pids in [r1] ] 
process_group = process_groups[0]

However ,it still does not work ...

@glenn-jocher
Copy link
Member

@twotwoiscute I'm not sure if this is due to torch 1.9.0 or our own code, but I don't see anything wrong with the implementation.

I would actually not use --sync if I were you though. It mainly helps in early training but we found it to have little to no effect on final mAP (provided you train long enough, i.e. 300 COCO epochs).

@twotwoiscute
Copy link
Author

twotwoiscute commented Jun 29, 2021

In my case , I have 64 batch size per gpu , I think it's enough for calculating running mean & var correctly . However,for completeness,maybe to let pytorch team to have a look at this issue ?

@glenn-jocher
Copy link
Member

@twotwoiscute yes, that's a good idea. I would raise a bug report on the pytorch repo linking to this issue.

@glenn-jocher glenn-jocher linked a pull request Jul 17, 2021 that will close this issue
@glenn-jocher
Copy link
Member

@twotwoiscute --sync known issue PR in #4032 to alert future users to the existing problem

@github-actions
Copy link
Contributor

github-actions bot commented Aug 17, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Aug 17, 2021
@twotwoiscute
Copy link
Author

@twotwoiscute --sync known issue PR in #4032 to alert future users to the existing problem

Have you get any response from pytorch team? I posted the similiar question in PyTorch Forums However, does not get any response..

@glenn-jocher
Copy link
Member

@twotwoiscute we still have no resolution on this issue. If you find anything or hear from the PyTorch team please update here, thank you!

@twotwoiscute
Copy link
Author

@twotwoiscute we still have no resolution on this issue. If you find anything or hear from the PyTorch team please update here, thank you!

I think this issue would be solved in current version since the way I solve this issue by commenting out :

Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()

And sync-bn works perfectly fine.

@glenn-jocher
Copy link
Member

@twotwoiscute do you mean you comment out L81 in loggers/init.py?

if ni < 3:
f = self.save_dir / f'train_batch{ni}.jpg' # filename
Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()

@glenn-jocher
Copy link
Member

@twotwoiscute what about the daemon Thread plots in val.py here?

yolov5/val.py

Lines 221 to 227 in b894e69

# Plot images
if plots and batch_i < 3:
f = save_dir / f'val_batch{batch_i}_labels.jpg' # labels
Thread(target=plot_images, args=(img, targets, paths, f, names), daemon=True).start()
f = save_dir / f'val_batch{batch_i}_pred.jpg' # predictions
Thread(target=plot_images, args=(img, output_to_target(out), paths, f, names), daemon=True).start()

@glenn-jocher glenn-jocher added TODO High priority items and removed Stale Stale and schedule for closing soon labels Aug 30, 2021
@glenn-jocher glenn-jocher reopened this Aug 30, 2021
@glenn-jocher glenn-jocher changed the title Training process stop when using --sync_bn Training bug when using --sync_bn Aug 30, 2021
@twotwoiscute
Copy link
Author

twotwoiscute commented Aug 30, 2021

@glenn-jocher Ops! I think i made a wrong statement.. actually I comment out this part of code :

if tb_writer:
    tb_writer.add_graph(torch.jit.trace(de_parallel(model), imgs, strict=False), [])  # model graph
    # tb_writer.add_image(f, result, dataformats='HWC', global_step=epoch) 

but keep the ploting part .

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 30, 2021

@twotwoiscute I don't think this line correlates with the issue at all. It doesn't actually exist anymore, it's been replaced with loggers/init.py L137 that only runs on the on_train_end() callback, which will not have any effect on starting training with or without --sync:

def on_train_end(self, last, best, plots, epoch):
# Callback runs on training end
if plots:
plot_results(file=self.save_dir / 'results.csv') # save results.png
files = ['results.png', 'confusion_matrix.png', *[f'{x}_curve.png' for x in ('F1', 'PR', 'P', 'R')]]
files = [(self.save_dir / f) for f in files if (self.save_dir / f).exists()] # filter
if self.tb:
from PIL import Image
import numpy as np
for f in files:
self.tb.add_image(f.stem, np.asarray(Image.open(f)), epoch, dataformats='HWC')

@glenn-jocher
Copy link
Member

@twotwoiscute wait are you saying the line you commented in your comment is uncommented, this line?
This line does run at train start, it's been moved to loggers/init.py L78. I can try commenting it, thanks for the tip!

def on_train_batch_end(self, ni, model, imgs, targets, paths, plots):
# Callback runs on train batch end
if plots:
if ni == 0:
with warnings.catch_warnings():
warnings.simplefilter('ignore') # suppress jit trace warning
self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])

@twotwoiscute
Copy link
Author

@glenn-jocher okay,last weekend I tried to comment out plot_image part and it works ,but today it gets stuck again...but when I comment out self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) and works again ...

@glenn-jocher
Copy link
Member

@twotwoiscute ok got it, thanks for the feedback! These are two very different things, add images is just adding images to TensorBoard, add_graph is adding an interactive YOLOv5 model visualizer (below), which is a much more complicated operation. I don't know what it has to do with --sync but if it's causing the hang we can simply not add graphs on --sync trainings as a workaround.

image

@glenn-jocher glenn-jocher linked a pull request Aug 30, 2021 that will close this issue
@glenn-jocher glenn-jocher removed the TODO High priority items label Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
2 participants