Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

ehallein · 2022-09-27T06:29:19Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Trying to train a yolov5 model on a m1 Pro, seems to run for until end of first epoch, then crashes:

PYTORCH_ENABLE_MPS_FALLBACK=1 python ~/git/yolov5/train.py --img 640 --batch 6 --save-period 3 --epochs 100 --data dataset.yaml --weights md_v5a.0.0_rebuild_pt-1.12_zerolr.pt --device mps

train: weights=md_v5a.0.0_rebuild_pt-1.12_zerolr.pt, cfg=, data=dataset.yaml, hyp=../git/yolov5/data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=6, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=mps, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=../git/yolov5/runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=3, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.2-175-g1460e57 Python-3.8.13 torch-1.13.0.dev20220926 MPS

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs in Weights & Biases
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir ../git/yolov5/runs/train', view at http://localhost:6006/
Overriding model.yaml nc=3 with nc=2

                 from  n    params  module                                  arguments                     
  0                -1  1      8800  models.common.Conv                      [3, 80, 6, 2, 2]              
  1                -1  1    115520  models.common.Conv                      [80, 160, 3, 2]               
  2                -1  4    309120  models.common.C3                        [160, 160, 4]                 
  3                -1  1    461440  models.common.Conv                      [160, 320, 3, 2]              
  4                -1  8   2259200  models.common.C3                        [320, 320, 8]                 
  5                -1  1   1844480  models.common.Conv                      [320, 640, 3, 2]              
  6                -1 12  13125120  models.common.C3                        [640, 640, 12]                
  7                -1  1   5531520  models.common.Conv                      [640, 960, 3, 2]              
  8                -1  4  11070720  models.common.C3                        [960, 960, 4]                 
  9                -1  1  11061760  models.common.Conv                      [960, 1280, 3, 2]             
 10                -1  4  19676160  models.common.C3                        [1280, 1280, 4]               
 11                -1  1   4099840  models.common.SPPF                      [1280, 1280, 5]               
 12                -1  1   1230720  models.common.Conv                      [1280, 960, 1, 1]             
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 14           [-1, 8]  1         0  models.common.Concat                    [1]                           
 15                -1  4  11992320  models.common.C3                        [1920, 960, 4, False]         
 16                -1  1    615680  models.common.Conv                      [960, 640, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 6]  1         0  models.common.Concat                    [1]                           
 19                -1  4   5332480  models.common.C3                        [1280, 640, 4, False]         
 20                -1  1    205440  models.common.Conv                      [640, 320, 1, 1]              
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 22           [-1, 4]  1         0  models.common.Concat                    [1]                           
 23                -1  4   1335040  models.common.C3                        [640, 320, 4, False]          
 24                -1  1    922240  models.common.Conv                      [320, 320, 3, 2]              
 25          [-1, 20]  1         0  models.common.Concat                    [1]                           
 26                -1  4   4922880  models.common.C3                        [640, 640, 4, False]          
 27                -1  1   3687680  models.common.Conv                      [640, 640, 3, 2]              
 28          [-1, 16]  1         0  models.common.Concat                    [1]                           
 29                -1  4  11377920  models.common.C3                        [1280, 960, 4, False]         
 30                -1  1   8296320  models.common.Conv                      [960, 960, 3, 2]              
 31          [-1, 12]  1         0  models.common.Concat                    [1]                           
 32                -1  4  20495360  models.common.C3                        [1920, 1280, 4, False]        
 33  [23, 26, 29, 32]  1     67284  models.yolo.Detect                      [2, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [320, 640, 960, 1280]]
Model summary: 575 layers, 140045044 parameters, 140045044 gradients, 208.8 GFLOPs

Transferred 955/963 items from md_v5a.0.0_rebuild_pt-1.12_zerolr.pt
optimizer: SGD(lr=0.01) with parameter groups 159 weight(decay=0.0), 163 weight(decay=0.000515625), 163 bias
train: Scanning '/megadetector/train/labels.cache' images and labels.
val: Scanning '/megadetector/val/labels.cache' images and labels... 3

AutoAnchor: 6.81 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to ../git/yolov5/runs/train/exp2/labels.jpg... 
Image sizes 640 train, 640 val
Using 6 dataloader workers
Logging results to ../git/yolov5/runs/train/exp2
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
  0%|          | 0/21 [00:00<?, ?it/s]                                          git/yolov5/utils/loss.py:208: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1664176335355/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  t = t[j]  # filter
       0/99         0G     0.1015     0.0242    0.02756         15        640: 100%|██████████| 21/21 [05:22<00:
                 Class     Images  Instances          P          R      mAP50   mAP50-95:   0%|          | 0/3 [
Traceback (most recent call last):
  File "/git/yolov5/train.py", line 630, in <module>
    main(opt)
  File "/git/yolov5/train.py", line 526, in main
    train(opt.hyp, opt, device, callbacks)
  File "/git/yolov5/train.py", line 349, in train
    results, maps, _ = validate.run(data_dict,
  File "/miniforge3/envs/cameratraps-detector/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "git/yolov5/val.py", line 254, in run
    correct = process_batch(predn, labelsn, iouv)
  File "git/yolov5/val.py", line 82, in process_batch
    iou = box_iou(labels[:, 1:], detections[:, :4])
  File "/git/yolov5/utils/metrics.py", line 286, in box_iou
    inter = (torch.min(a2, b2) - torch.max(a1, b1)).clamp(0).prod(2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!

Environment

YOLOv5 v6.2-175-g1460e57 Python-3.8.13 torch-1.13.0.dev20220926 MPS
MacOS 12.6
M1 Pro

Minimal Reproducible Example

PYTORCH_ENABLE_MPS_FALLBACK=1 python ~/git/yolov5/train.py --img 640 --batch 6 --save-period 3 --epochs 100 --data dataset.yaml --weights md_v5a.0.0_rebuild_pt-1.12_zerolr.pt --device mps

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

github-actions · 2022-09-27T06:29:55Z

👋 Hello @ehallein, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected].

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

May resolve #9613 Signed-off-by: Glenn Jocher <[email protected]>

* NMS MPS device wrapper May resolve #9613 Signed-off-by: Glenn Jocher <[email protected]> * Update general.py Signed-off-by: Glenn Jocher <[email protected]> Signed-off-by: Glenn Jocher <[email protected]>

glenn-jocher · 2022-09-27T16:06:12Z

@ehallein good news 😃! Your original issue may now be fixed ✅ in PR #9620. To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Zengyf-CVer · 2022-09-27T23:47:17Z

@glenn-jocher What is MPS?

glenn-jocher · 2022-09-28T10:59:07Z

@Zengyf-CVer --device MPS is for M1 and M2 Apple computers.

On M2 Macbook for example inference runs in about 10 ms, but torch does not support all ops for MPS yet.

ehallein · 2022-09-28T11:04:59Z

has fixed the bug, in that it allows training to continue. however training does appear to work correctly. Trained models do not detect anything (training with --device cpu works fine).
mAP are zero during training as well:

Should training work with mps at all?

glenn-jocher · 2022-09-28T12:43:16Z

@ehallein we have no validated MPS training yet, but I suspect it will not work because detection inference does not work correctly (classification yes).

Since MPS classification works the issue may be in the Detect() module. I have not had time to debug, but before you start training I would try to debug detection. Please submit a PR for any fixes you discover.

PYTORCH_ENABLE_MPS_FALLBACK=1 python classify/predict.py --device mps  # correct results
PYTORCH_ENABLE_MPS_FALLBACK=1 python detect.py --device mps  # bad results

ehallein added the bug Something isn't working label Sep 27, 2022

glenn-jocher added a commit that referenced this issue Sep 27, 2022

NMS MPS device wrapper

a610f0c

May resolve #9613 Signed-off-by: Glenn Jocher <[email protected]>

glenn-jocher mentioned this issue Sep 27, 2022

NMS MPS device wrapper #9620

Merged

glenn-jocher closed this as completed in #9620 Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

ehallein commented Sep 27, 2022

github-actions bot commented Sep 27, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Sep 27, 2022 •

edited by UltralyticsAssistant

Loading

Zengyf-CVer commented Sep 27, 2022

glenn-jocher commented Sep 28, 2022

ehallein commented Sep 28, 2022

glenn-jocher commented Sep 28, 2022 •

edited

Loading

Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

Comments

ehallein commented Sep 27, 2022

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Sep 27, 2022 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

glenn-jocher commented Sep 27, 2022 • edited by UltralyticsAssistant Loading

Zengyf-CVer commented Sep 27, 2022

glenn-jocher commented Sep 28, 2022

ehallein commented Sep 28, 2022

glenn-jocher commented Sep 28, 2022 • edited Loading

github-actions bot commented Sep 27, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Sep 27, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Sep 28, 2022 •

edited

Loading