Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

Closed
1 of 2 tasks
ehallein opened this issue Sep 27, 2022 · 6 comments · Fixed by #9620
Closed
1 of 2 tasks

Training RuntimeError using MPS: Expected all tensors to be on the same device #9613

ehallein opened this issue Sep 27, 2022 · 6 comments · Fixed by #9620
Labels
bug Something isn't working

Comments

@ehallein
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Trying to train a yolov5 model on a m1 Pro, seems to run for until end of first epoch, then crashes:

PYTORCH_ENABLE_MPS_FALLBACK=1 python ~/git/yolov5/train.py --img 640 --batch 6 --save-period 3 --epochs 100 --data dataset.yaml --weights md_v5a.0.0_rebuild_pt-1.12_zerolr.pt --device mps

train: weights=md_v5a.0.0_rebuild_pt-1.12_zerolr.pt, cfg=, data=dataset.yaml, hyp=../git/yolov5/data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=6, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=mps, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=../git/yolov5/runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=3, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.2-175-g1460e57 Python-3.8.13 torch-1.13.0.dev20220926 MPS

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs in Weights & Biases
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir ../git/yolov5/runs/train', view at http://localhost:6006/
Overriding model.yaml nc=3 with nc=2

                 from  n    params  module                                  arguments                     
  0                -1  1      8800  models.common.Conv                      [3, 80, 6, 2, 2]              
  1                -1  1    115520  models.common.Conv                      [80, 160, 3, 2]               
  2                -1  4    309120  models.common.C3                        [160, 160, 4]                 
  3                -1  1    461440  models.common.Conv                      [160, 320, 3, 2]              
  4                -1  8   2259200  models.common.C3                        [320, 320, 8]                 
  5                -1  1   1844480  models.common.Conv                      [320, 640, 3, 2]              
  6                -1 12  13125120  models.common.C3                        [640, 640, 12]                
  7                -1  1   5531520  models.common.Conv                      [640, 960, 3, 2]              
  8                -1  4  11070720  models.common.C3                        [960, 960, 4]                 
  9                -1  1  11061760  models.common.Conv                      [960, 1280, 3, 2]             
 10                -1  4  19676160  models.common.C3                        [1280, 1280, 4]               
 11                -1  1   4099840  models.common.SPPF                      [1280, 1280, 5]               
 12                -1  1   1230720  models.common.Conv                      [1280, 960, 1, 1]             
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 14           [-1, 8]  1         0  models.common.Concat                    [1]                           
 15                -1  4  11992320  models.common.C3                        [1920, 960, 4, False]         
 16                -1  1    615680  models.common.Conv                      [960, 640, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 6]  1         0  models.common.Concat                    [1]                           
 19                -1  4   5332480  models.common.C3                        [1280, 640, 4, False]         
 20                -1  1    205440  models.common.Conv                      [640, 320, 1, 1]              
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 22           [-1, 4]  1         0  models.common.Concat                    [1]                           
 23                -1  4   1335040  models.common.C3                        [640, 320, 4, False]          
 24                -1  1    922240  models.common.Conv                      [320, 320, 3, 2]              
 25          [-1, 20]  1         0  models.common.Concat                    [1]                           
 26                -1  4   4922880  models.common.C3                        [640, 640, 4, False]          
 27                -1  1   3687680  models.common.Conv                      [640, 640, 3, 2]              
 28          [-1, 16]  1         0  models.common.Concat                    [1]                           
 29                -1  4  11377920  models.common.C3                        [1280, 960, 4, False]         
 30                -1  1   8296320  models.common.Conv                      [960, 960, 3, 2]              
 31          [-1, 12]  1         0  models.common.Concat                    [1]                           
 32                -1  4  20495360  models.common.C3                        [1920, 1280, 4, False]        
 33  [23, 26, 29, 32]  1     67284  models.yolo.Detect                      [2, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [320, 640, 960, 1280]]
Model summary: 575 layers, 140045044 parameters, 140045044 gradients, 208.8 GFLOPs

Transferred 955/963 items from md_v5a.0.0_rebuild_pt-1.12_zerolr.pt
optimizer: SGD(lr=0.01) with parameter groups 159 weight(decay=0.0), 163 weight(decay=0.000515625), 163 bias
train: Scanning '/megadetector/train/labels.cache' images and labels.
val: Scanning '/megadetector/val/labels.cache' images and labels... 3

AutoAnchor: 6.81 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to ../git/yolov5/runs/train/exp2/labels.jpg... 
Image sizes 640 train, 640 val
Using 6 dataloader workers
Logging results to ../git/yolov5/runs/train/exp2
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
  0%|          | 0/21 [00:00<?, ?it/s]                                          git/yolov5/utils/loss.py:208: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1664176335355/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  t = t[j]  # filter
       0/99         0G     0.1015     0.0242    0.02756         15        640: 100%|██████████| 21/21 [05:22<00:
                 Class     Images  Instances          P          R      mAP50   mAP50-95:   0%|          | 0/3 [
Traceback (most recent call last):
  File "/git/yolov5/train.py", line 630, in <module>
    main(opt)
  File "/git/yolov5/train.py", line 526, in main
    train(opt.hyp, opt, device, callbacks)
  File "/git/yolov5/train.py", line 349, in train
    results, maps, _ = validate.run(data_dict,
  File "/miniforge3/envs/cameratraps-detector/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "git/yolov5/val.py", line 254, in run
    correct = process_batch(predn, labelsn, iouv)
  File "git/yolov5/val.py", line 82, in process_batch
    iou = box_iou(labels[:, 1:], detections[:, :4])
  File "/git/yolov5/utils/metrics.py", line 286, in box_iou
    inter = (torch.min(a2, b2) - torch.max(a1, b1)).clamp(0).prod(2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!

Environment

YOLOv5 v6.2-175-g1460e57 Python-3.8.13 torch-1.13.0.dev20220926 MPS
MacOS 12.6
M1 Pro

Minimal Reproducible Example

PYTORCH_ENABLE_MPS_FALLBACK=1 python ~/git/yolov5/train.py --img 640 --batch 6 --save-period 3 --epochs 100 --data dataset.yaml --weights md_v5a.0.0_rebuild_pt-1.12_zerolr.pt --device mps

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@ehallein ehallein added the bug Something isn't working label Sep 27, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Sep 27, 2022

👋 Hello @ehallein, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected].

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher added a commit that referenced this issue Sep 27, 2022
May resolve #9613

Signed-off-by: Glenn Jocher <[email protected]>
glenn-jocher added a commit that referenced this issue Sep 27, 2022
* NMS MPS device wrapper

May resolve #9613

Signed-off-by: Glenn Jocher <[email protected]>

* Update general.py

Signed-off-by: Glenn Jocher <[email protected]>

Signed-off-by: Glenn Jocher <[email protected]>
@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 27, 2022

@ehallein good news 😃! Your original issue may now be fixed ✅ in PR #9620. To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@Zengyf-CVer
Copy link
Contributor

@glenn-jocher What is MPS?

@glenn-jocher
Copy link
Member

@Zengyf-CVer --device MPS is for M1 and M2 Apple computers.

On M2 Macbook for example inference runs in about 10 ms, but torch does not support all ops for MPS yet.

@ehallein
Copy link
Author

has fixed the bug, in that it allows training to continue. however training does appear to work correctly. Trained models do not detect anything (training with --device cpu works fine).
mAP are zero during training as well:
Screen Shot 2022-09-28 at 7 02 02 pm

Should training work with mps at all?

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 28, 2022

@ehallein we have no validated MPS training yet, but I suspect it will not work because detection inference does not work correctly (classification yes).

Since MPS classification works the issue may be in the Detect() module. I have not had time to debug, but before you start training I would try to debug detection. Please submit a PR for any fixes you discover.

PYTORCH_ENABLE_MPS_FALLBACK=1 python classify/predict.py --device mps  # correct results
PYTORCH_ENABLE_MPS_FALLBACK=1 python detect.py --device mps  # bad results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants