-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some bugs when training #1547
Comments
@wuzuiyuzui it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu. |
|
@louis-she oh interesting. Does this have performance drawbacks, i.e. slower dataloading? |
@glenn-jocher A possible reason of the bug could be the none fixed size of the label yield from the dataloader as the collect_fn does. I havn't test the performance drawbacks but i think it should not be much difference. Anyway I have openned a PR, just leave the option to default(which is False). |
@wuzuiyuzui please try PR #1555 and confirm that this fixes the problem for you. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hello, I ran into a difficult problem when using yolov5. Reinstalling the system did not help this problem. I was very confused about what happened. This problem has been bothering me for several days. I have closed the previous one. Questions and gave me detailed bugs, can you give me some help??I can test and detect but I can not train.
🐛 Bug
have some problems when training
To Reproduce (REQUIRED)
Input:
Output:
Using torch 1.7.0+cu101 CUDA:0 (GeForce RTX 2080 Ti, 10997MB)
Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=7
0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
24 [17, 20, 23] 1 32364 models.yolo.Detect [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7271276 parameters, 7271276 gradients
Transferred 364/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning 'coco128/labels/train2017.cache' for images and labels... 3219 found, 0 missing, 20 empty, 0 corrupted: 100%|██████████| 3219/3219 [00:00<?, ?it/s]
Scanning 'coco128/labels/val.cache' for images and labels... 246 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 248/248 [00:00<?, ?it/s]
Analyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9894
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 300 epochs...
0%| | 0/202 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ljy/yolov5-master/train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "/home/ljy/yolov5-master/train.py", line 293, in train
scaler.scale(loss).backward()
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 256, 20, 20], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f8ba4002b60
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
output: TensorDescriptor 0x7f8ba40033a0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
weight: FilterDescriptor 0x7f8ba403e080
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 256, 3, 3,
Pointer addresses:
input: 0x7f8a73b60000
output: 0x7f8c792e0000
weight: 0x7f8d5b660000
Process finished with exit code 1
Environment
If applicable, add screenshots to help explain your problem.
-cudnn 7.6.4
-nvidia-driver 440.95
-torch 1.7.0+cu101
-cuda 10.1
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: