Skip to content

Commit

Permalink
Update get_started.md
Browse files Browse the repository at this point in the history
1. translate `get_started.md` into Chinese
2. fix some typos in `en/get_started.md`
3. fix a broken link in `en/get_started.md`
  • Loading branch information
RangeKing committed Feb 21, 2022
1 parent 6825e4b commit a1f14f1
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 45 deletions.
13 changes: 6 additions & 7 deletions docs/en/get_started.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## Test a model

- single GPU
- single node multiple GPU
- multiple node
- single node multiple GPUs
- multiple nodes

You can use the following commands to infer a dataset.

Expand Down Expand Up @@ -76,7 +76,7 @@ If you want to specify the working directory in the command, you can add an argu

Optional arguments are:

- `--no-validate` (**not suggested**): By default, the codebase will perform evaluation during the training. To disable this behavior, use `--no-validate`.
- `--no-validate` (**not suggested**): By default, the codebase will perform an evaluation during the training. To disable this behavior, use `--no-validate`.
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file.

Expand All @@ -92,9 +92,8 @@ If you run MMRotate on a cluster managed with [slurm](https://slurm.schedmd.com/
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
```

If you have just multiple machines connected with ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility).
Usually it is slow if you do not have high speed networking like InfiniBand.
If you have just multiple machines connected with Ethernet, you can refer to PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
Usually, it is slow if you do not have high-speed networking like InfiniBand.

### Launch multiple jobs on a single machine

Expand Down Expand Up @@ -122,7 +121,7 @@ In `config2.py`,
dist_params = dict(backend='nccl', port=29501)
```

Then you can launch two jobs with `config1.py` ang `config2.py`.
Then you can launch two jobs with `config1.py` and `config2.py`.

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
Expand Down
75 changes: 37 additions & 38 deletions docs/zh_cn/get_started.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,54 @@
## Test a model
## 模型测试

- single GPU
- single node multiple GPU
- multiple node
- GPU 测试
- 单节点多 GPU 测试
- 多节点测试

You can use the following commands to infer a dataset.
可以使用以下命令来进行数据集推理。

```shell
# single-gpu
# 单 GPU 测试
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

# multi-gpu
# 多 GPU 测试
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]

# multi-node in slurm environment
# slurm 环境下的多节点测试
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments] --launcher slurm
```


Examples:
示例:

Inference RotatedRetinaNet on DOTA-1.0 dataset, which can generate compressed files for online [submission](https://captain-whu.github.io/DOTA/evaluation.html). (Please change the [data_root](../../configs/_base_/datasets/dotav1.py) firstly.)
使用RotatedRetinaNet模型在DOTA-1.0数据集上进行推理,并生成用于[官方评测](https://captain-whu.github.io/DOTA/evaluation.html)的压缩文件。(需要先修改 [数据集配置文件](../../configs/_base_/datasets/dotav1.py)
```shell
python ./tools/test.py \
configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_le90.py \
checkpoints/SOME_CHECKPOINT.pth --format-only \
--eval-options submission_dir=work_dirs/Task1_results
```
or
或者
```shell
./tools/dist_test.sh \
configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_le90.py \
checkpoints/SOME_CHECKPOINT.pth 1 --format-only \
--eval-options submission_dir=work_dirs/Task1_results
```

You can change the test set path in the [data_root](.../configs/_base_/datasets/dotav1.py) to the val set or trainval set for the offline evaluation.
可以将 [数据集配置文件](.../configs/_base_/datasets/dotav1.py) 中的测试集目录改为测试集或者训练测试集目录,用于离线评估。
```shell
python ./tools/test.py \
configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_le90.py \
checkpoints/SOME_CHECKPOINT.pth --eval mAP
```
or
或者
```shell
./tools/dist_test.sh \
configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_le90.py \
checkpoints/SOME_CHECKPOINT.pth 1 --eval mAP
```

You can also visualize the results.
也可以使用下面的命令将结果可视化。
```shell
python ./tools/test.py \
configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_le90.py \
Expand All @@ -58,71 +58,70 @@ python ./tools/test.py \



## Train a model
## 模型训练

### Train with a single GPU
### GPU 训练

```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

If you want to specify the working directory in the command, you can add an argument `--work_dir ${YOUR_WORK_DIR}`.
如果想在命令中指定工作目录,可以添加一个参数 `--work_dir ${YOUR_WORK_DIR}`

### Train with multiple GPUs
### 多 GPU 训练

```shell
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
```

Optional arguments are:
可选参数 [optional arguments] 如下:

- `--no-validate` (**not suggested**): By default, the codebase will perform evaluation during the training. To disable this behavior, use `--no-validate`.
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file.
- `--no-validate` **不建议**):默认情况下,在训练期间会进行测试。使用 `--no-validate`则会在训练期间关闭测试。
- `--work-dir ${WORK_DIR}`:覆盖配置文件中指定的工作目录。
- `--resume-from ${CHECKPOINT_FILE}`:从之前的 checkpoint 文件继续训练。

Difference between `resume-from` and `load-from`:
`resume-from` loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally.
`load-from` only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.
`resume-from` `load-from`的区别:
`resume-from` 同时加载模型权重和优化器状态,也会继承指定 checkpoint 的迭代轮数,经常被用于恢复意外中断的训练。
`load-from` 则是只加载模型权重,它的训练是从头开始的,经常被用于微调模型。

### Train with multiple machines
### 多机多 GPU 训练

If you run MMRotate on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)
如果在一个用 [slurm](https://slurm.schedmd.com/) 管理的集群上运行 MMRotate,可以使用脚本 `slurm_train.sh` 进行训练(这个脚本也支持单机训练)。

```shell
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
```

If you have just multiple machines connected with ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility).
Usually it is slow if you do not have high speed networking like InfiniBand.
如果只是用 Ethernet 连接了多台机器,可以参考 PyTorch [启动工具](https://pytorch.org/docs/stable/distributed.html#launch-utility)

### Launch multiple jobs on a single machine
通常情况下,如果没有像 InfiniBand 这样的高速网络,训练则会比较慢。

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
you need to specify different ports (29500 by default) for each job to avoid communication conflict.
### 在一台机器上启动多个任务

If you use `dist_train.sh` to launch training jobs, you can set the port in commands.
如果你想在一台机器上启动多个任务的话,比如在一个有 8 块 GPU 的机器上启动 2 个需要 4 块GPU的任务,你需要给不同的训练任务指定不同的端口(默认为 29500)来避免冲突。

如果你使用 `dist_train.sh` 来启动训练任务,你可以使用命令来设置端口。

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
```

If you use launch training jobs with Slurm, you need to modify the config files (usually the 6th line from the bottom in config files) to set different communication ports.
如果你用 Slurm 启动训练任务,则需要修改配置文件(通常是配置文件中从底部开始的第6行)来设置不同的通信端口。

In `config1.py`,
`config1.py` 中,设置:

```python
dist_params = dict(backend='nccl', port=29500)
```

In `config2.py`,
`config2.py` 中,设置:

```python
dist_params = dict(backend='nccl', port=29501)
```

Then you can launch two jobs with `config1.py` ang `config2.py`.
然后你可以使用 `config1.py` `config2.py` 来启动两个任务了。

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
Expand Down

0 comments on commit a1f14f1

Please sign in to comment.