-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
translation faq finished #33
Changes from 1 commit
6e400ea
6cb2533
63c9054
beac55c
be9ccfb
1f9fbee
eded488
f6482bb
b416a70
1690578
7f78fc4
403a507
4e74050
ea5c169
75a3fb9
2992880
4a1a7af
deb9280
4eb9f96
c370bf2
c9d3343
ee22b28
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,90 +1,95 @@ | ||
# Frequently Asked Questions | ||
# 常见问题解答 | ||
|
||
We list some common troubles faced by many users and their corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others to solve them. If the contents here do not cover your issue, please create an issue using the [provided templates](https://github.com/open-mmlab/mmdetection/blob/master/.github/ISSUE_TEMPLATE/error-report.md/) and make sure you fill in all required information in the template. | ||
我们在这里列出了使用时的一些常见问题及其相应的解决方案。 如果您发现有一些问题被遗漏,请随时提 PR 丰富这个列表。 如果您无法在此获得帮助,请使用 [issue模板](https://github.com/open-mmlab/mmdetection/blob/master/.github/ISSUE_TEMPLATE/error-report.md/ )创建问题,但是请在模板中填写所有必填信息,这有助于我们更快定位问题。 | ||
|
||
## MMCV Installation | ||
## MMCV 安装相关 | ||
|
||
- Compatibility issue between MMCV and MMDetection; "ConvWS is already registered in conv layer"; "AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, <=xxx." | ||
- MMCV 与 MMDetection 的兼容问题: "ConvWS is already registered in conv layer"; "AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, <=xxx." | ||
|
||
Please install the correct version of MMCV for the version of your MMDetection following the [installation instruction](https://mmdetection.readthedocs.io/en/latest/get_started.html#installation). | ||
请按 [安装说明](https://mmdetection.readthedocs.io/zh_CN/latest/get_started.html#installation) 为你的 MMDetection 安装正确版本的 MMCV 。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 最后的。前不需要空格 |
||
|
||
- "No module named 'mmcv.ops'"; "No module named 'mmcv._ext'". | ||
|
||
1. Uninstall existing mmcv in the environment using `pip uninstall mmcv`. | ||
2. Install mmcv-full following the [installation instruction](https://mmcv.readthedocs.io/en/latest/#installation). | ||
原因是安装了 `mmcv` 而不是 `mmcv-full`。 | ||
|
||
## PyTorch/CUDA Environment | ||
1. `pip uninstall mmcv` 卸载安装的 `mmcv` | ||
|
||
2. 安装 `mmcv-full` 根据 [安装说明](https://mmcv.readthedocs.io/zh/latest/#installation)。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
## PyTorch/CUDA 环境相关 | ||
|
||
- "RTX 30 series card fails when building MMCV or MMDet" | ||
|
||
1. Temporary work-around: do `MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .`. | ||
The common issue is `nvcc fatal : Unsupported gpu architecture 'compute_86'`. This means that the compiler should optimize for sm_86, i.e., nvidia 30 series card, but such optimizations have not been supported by CUDA toolkit 11.0. | ||
This work-around modifies the compile flag by adding `MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80'`, which tells `nvcc` to optimize for **sm_80**, i.e., Nvidia A100. Although A100 is different from the 30 series card, they use similar ampere architecture. This may hurt the performance but it works. | ||
2. PyTorch developers have updated that the default compiler flags should be fixed by [pytorch/pytorch#47585](https://github.com/pytorch/pytorch/pull/47585). So using PyTorch-nightly may also be able to solve the problem, though we have not tested it yet. | ||
1. 临时解决方案为使用命令 `MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .` 进行编译。 常见报错信息为 `nvcc fatal : Unsupported gpu architecture 'compute_86'` 意思是你的编译器不支持 sm_86 架构(包括英伟达 30 系列的显卡)的优化,至 CUDA toolkit 11.0 依旧未支持. 这个命令是通过增加宏 `MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80` 让 nvcc 编译器为英伟达 30 系列显卡进行 `sm_80` 的优化,虽然这有可能会无法发挥出显卡所有性能。 | ||
|
||
2. 有开发者已经在 [pytorch/pytorch#47585](https://github.com/pytorch/pytorch/pull/47585) 更新了 PyTorch 默认的编译 flag, 但是我们对此并没有进行测试。 | ||
|
||
- "invalid device function" or "no kernel image is available for execution". | ||
|
||
1. Check if your cuda runtime version (under `/usr/local/`), `nvcc --version` and `conda list cudatoolkit` version match. | ||
2. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built for the correct GPU architecture. | ||
You may need to set `TORCH_CUDA_ARCH_LIST` to reinstall MMCV. | ||
The GPU arch table could be found [here](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list), | ||
i.e. run `TORCH_CUDA_ARCH_LIST=7.0 pip install mmcv-full` to build MMCV for Volta GPUs. | ||
The compatibility issue could happen when using old GPUS, e.g., Tesla K80 (3.7) on colab. | ||
3. Check whether the running environment is the same as that when mmcv/mmdet has compiled. | ||
For example, you may compile mmcv using CUDA 10.0 but run it on CUDA 9.0 environments. | ||
1. 检查您正常安装了 CUDA runtime (一般在`/usr/local/`),或者使用 `nvcc --version` 检查本地版本,有时安装 PyTorch 会顺带安装一个 CUDA runtime,并且实际优先使用 conda 环境中的版本,你可以使用 `conda list cudatoolkit` 查看其版本。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please check this part, the translation does not match. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From line 29 to 40. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks,i'll check it |
||
|
||
2. 编译 extension 的 CUDA Toolkit 版本与运行时的 CUDA Toolkit 版本是否相符, | ||
|
||
* 如果您从源码自己编译的,使用 `python mmdet/utils/collect_env.py` 检查编译编译 extension 的 CUDA Toolkit 版本,然后使用 `conda list cudatoolkit` 检查当前 conda 环境是否有 CUDA Toolkit,若有检查版本是否匹配, 如不匹配,更换 conda 环境的 CUDA Toolkit,或者使用匹配的 CUDA Toolkit 中的 nvcc 编译即可,如环境中无 CUDA Toolkit,可以使用 `nvcc -V`。 | ||
|
||
等命令查看当前使用的 CUDA runtime。 | ||
|
||
* 如果您是通过 pip 下载的预编译好的版本,请确保与当前 CUDA runtime 一致。 | ||
|
||
3. 运行 `python mmdet/utils/collect_env.py` 检查是否为正确的 GPU 架构编译的 PyTorch, torchvision, 与 MMCV。 你或许需要设置 `TORCH_CUDA_ARCH_LIST` 来重新安装 MMCV,可以参考 [GPU 架构表](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list), | ||
例如, 运行 `TORCH_CUDA_ARCH_LIST=7.0 pip install mmcv-full` 为 Volta GPU 编译 MMCV。这种架构不匹配的问题一般会出现在使用一些旧型号的 GPU 时候出现, 例如, Tesla K80。 | ||
|
||
- "undefined symbol" or "cannot open xxx.so". | ||
|
||
1. If those symbols are CUDA/C++ symbols (e.g., libcudart.so or GLIBCXX), check whether the CUDA/GCC runtimes are the same as those used for compiling mmcv, | ||
i.e. run `python mmdet/utils/collect_env.py` to see if `"MMCV Compiler"`/`"MMCV CUDA Compiler"` is the same as `"GCC"`/`"CUDA_HOME"`. | ||
2. If those symbols are PyTorch symbols (e.g., symbols containing caffe, aten, and TH), check whether the PyTorch version is the same as that used for compiling mmcv. | ||
3. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built by and running on the same environment. | ||
1. 如果这些 symbol 属于 CUDA/C++ (如 libcudart.so 或者 GLIBCXX),使用 `python mmdet/utils/collect_env.py`检查 CUDA/GCC runtime 与编译 MMCV 的 CUDA 版本是否相同。 | ||
2. 如果这些 symbols 属于 PyTorch,(例如, symbols containing caffe, aten, and TH), 检查当前 Pytorch 版本是否与编译 MMCV 的版本一致。 | ||
3. 运行 `python mmdet/utils/collect_env.py` 检查 PyTorch, torchvision, MMCV 等的编译环境与运行环境一致。 | ||
|
||
- setuptools.sandbox.UnpickleableException: DistutilsSetupError("each element of 'ext_modules' option must be an Extension instance or 2-tuple") | ||
- "setuptools.sandbox.UnpickleableException: DistutilsSetupError("each element of 'ext_modules' option must be an Extension instance or 2-tuple")" | ||
ZwwWayne marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
1. If you are using miniconda rather than anaconda, check whether Cython is installed as indicated in [#3379](https://github.com/open-mmlab/mmdetection/issues/3379). | ||
You need to manually install Cython first and then run command `pip install -r requirements.txt`. | ||
2. You may also need to check the compatibility between the `setuptools`, `Cython`, and `PyTorch` in your environment. | ||
1. 如果你在使用 miniconda 而不是 anaconda,检查是否正确的安装了 Cython 如 [#3379](https://github.com/open-mmlab/mmdetection/issues/3379). | ||
2. 检查环境中的 `setuptools`, `Cython`, and `PyTorch` 相互之间版本是否匹配。 | ||
|
||
- "Segmentation fault". | ||
1. Check you GCC version and use GCC 5.4. This usually caused by the incompatibility between PyTorch and the environment (e.g., GCC < 4.9 for PyTorch). We also recommend the users to avoid using GCC 5.5 because many feedbacks report that GCC 5.5 will cause "segmentation fault" and simply changing it to GCC 5.4 could solve the problem. | ||
|
||
2. Check whether PyTorch is correctly installed and could use CUDA op, e.g. type the following command in your terminal. | ||
1. 检查 GCC 的版本,通常是因为 PyTorch 版本与 GCC 版本不匹配 (例如 GCC < 4.9 ),我们推荐用户使用 GCC 5.4,我们也不推荐使用 GCC 5.5, 因为有反馈 GCC 5.5 会导致 "segmentation fault" 并且切换到 GCC 5.4 就可以解决问题。 | ||
|
||
```shell | ||
python -c 'import torch; print(torch.cuda.is_available())' | ||
``` | ||
2. 检查是否正确安装了 CUDA 版本的 PyTorch 。 | ||
|
||
And see whether they could correctly output results. | ||
```shell | ||
python -c 'import torch; print(torch.cuda.is_available())' | ||
``` | ||
|
||
3. If Pytorch is correctly installed, check whether MMCV is correctly installed. | ||
是否返回True。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True前没有空格,其他地方请再次检查一下。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 谢谢杨同学 |
||
|
||
```shell | ||
python -c 'import mmcv; import mmcv.ops' | ||
``` | ||
3. 如果 `torch` 的安装是正确的,检查是否正确编译了 MMCV。 | ||
|
||
If MMCV is correctly installed, then there will be no issue of the above two commands. | ||
```shell | ||
python -c 'import mmcv; import mmcv.ops' | ||
``` | ||
|
||
4. If MMCV and Pytorch is correctly installed, you man use `ipdb`, `pdb` to set breakpoints or directly add 'print' in mmdetection code and see which part leads the segmentation fault. | ||
4. 如果 MMCV 与 PyTorch 都被正确安装了,则使用 `ipdb`, `pdb` 设置断点,直接查找哪一部分的代码导致了 `segmentation fault`。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. , 换成, |
||
|
||
## Training | ||
## Training 相关 | ||
|
||
- "Loss goes Nan" | ||
1. Check if the dataset annotations are valid: zero-size bounding boxes will cause the regression loss to be Nan due to the commonly used transformation for box regression. Some small size (width or height are smaller than 1) boxes will also cause this problem after data augmentation (e.g., instaboost). So check the data and try to filter out those zero-size boxes and skip some risky augmentations on the small-size boxes when you face the problem. | ||
2. Reduce the learning rate: the learning rate might be too large due to some reasons, e.g., change of batch size. You can rescale them to the value that could stably train the model. | ||
3. Extend the warmup iterations: some models are sensitive to the learning rate at the start of the training. You can extend the warmup iterations, e.g., change the `warmup_iters` from 500 to 1000 or 2000. | ||
4. Add gradient clipping: some models requires gradient clipping to stabilize the training process. The default of `grad_clip` is `None`, you can add gradient clippint to avoid gradients that are too large, i.e., set `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))` in your config file. If your config does not inherits from any basic config that contains `optimizer_config=dict(grad_clip=None)`, you can simply add `optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`. | ||
1. 检查数据的标注是否正常, 长或宽为 0 的框可能会导致回归 loss 变为 nan,一些小尺寸(宽度或高度小于 1)的框在数据增强(例如,instaboost)后也会导致此问题。 因此,可以检查标注并过滤掉那些特别小甚至面积为 0 的框,并关闭一些可能会导致 0 面积框出现数据增强。 | ||
2. 降低学习率:由于某些原因,例如 batch size 大小的变化, 导致当前学习率可能太大。 您可以降低为可以稳定训练模型的值。 | ||
3. 延长 warm up 的时间:一些模型在训练初始时对学习率很敏感,您可以把 `warmup_iters` 从 500 更改为 1000 或 2000。 | ||
4. 添加 gradient clipping: 一些模型需要梯度裁剪来稳定训练过程。 默认的 `grad_clip` 是 `None`, 你可以在 config 设置 `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))` 如果你的 config 没有继承任何包含 `optimizer_config=dict(grad_clip=None)`, 你可以直接设置`optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`. | ||
- ’GPU out of memory" | ||
1. There are some scenarios when there are large amounts of ground truth boxes, which may cause OOM during target assignment. You can set `gpu_assign_thr=N` in the config of assigner thus the assigner will calculate box overlaps through CPU when there are more than N GT boxes. | ||
2. Set `with_cp=True` in the backbone. This uses the sublinear strategy in PyTorch to reduce GPU memory cost in the backbone. | ||
3. Try mixed precision training using following the examples in `config/fp16`. The `loss_scale` might need further tuning for different models. | ||
|
||
1. 存在大量 ground truth boxes 或者大量 anchor 的场景,可能在 assigner 会 OOM。 您可以在 assigner 的配置中设置 `gpu_assign_thr=N`,这样当超过 N 个 GT boxes 时,assigner 会通过 CPU 计算 IOU。 | ||
2. 在 backbone 中设置 `with_cp=True`。 这使用 PyTorch 中的 `sublinear strategy` 来降低 backbone 占用的 GPU 显存。 | ||
3. 使用 `config/fp16` 中的示例尝试混合精度训练。`loss_scale` 可能需要针对不同模型进行调整。 | ||
- "RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one" | ||
1. This error indicates that your module has parameters that were not used in producing loss. This phenomenon may be caused by running different branches in your code in DDP mode. | ||
2. You can set ` find_unused_parameters = True` in the config to solve the above problems or find those unused parameters manually. | ||
1. 这个错误出现在存在参数没有在 forward 中使用,容易在 DDP 中运行不同分支时发生。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 此错误表明您的模块存在未用于损失的参数。 这种现象可能是由于你的代码在 DDP 模式下运行了不同的分支造成的。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
2. 你可以在 config 设置 `find_unused_parameters = True` 进行训练 (会降低训练速度)。 | ||
3. 你也可以通过在 config 中的 `optimizer_config` 里设置 `detect_anomalous_params=True` 查找哪些参数没有用到,但是需要 MMCV 的版本 >= 1.4.1。 | ||
|
||
|
||
|
||
## Evaluation | ||
## Evaluation 相关 | ||
|
||
- COCO Dataset, AP or AR = -1 | ||
1. According to the definition of COCO dataset, the small and medium areas in an image are less than 1024 (32\*32), 9216 (96\*96), respectively. | ||
2. If the corresponding area has no object, the result of AP and AR will set to -1. | ||
- 使用 COCO Dataset 的测评接口时, 测评结果中 AP 或者 AR = -1 | ||
1. 根据COCO数据集的定义,一张图像中的中等物体与小物体面积的阈值分别为 9216(96\*96)与 1024(32\*32)。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. COCO前后空格缺失 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
2. 如果在某个区间没有检测框 AP 与 AR 认定为 -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue模板 创建问题