English | 中文版
The whole training is not end-to-end which must be split into several phases to get a best performance. So, reviewing the paper and codes structure will make you understand the training phase better.
As the paper shows, the whole model needs two separate models: EdgeModel
and InpaintingModel
.
But in practice, the whole work actually needs three training phases with the two separate models
, which makes results best while the training is confusing.
IMPORTANT: The three training phases I define here are called model
in the original codes , which should not be confused with EdgeModel
and InpaintingModel
.
Phase | Command | Model | Input | Output | Description |
---|---|---|---|---|---|
1st | --model 1 | EdgeModel |
Masked Greyscale Image + Masked Edge + Mask | Full Edge | Train EdgeModel solely |
2nd | --model 2 | InpaintingModel |
Masked Image + Full canny Edge from Original full Image+ Mask | Full Image | Pre-train InpaintingModel solely to learn the importance of edges |
3rd | --model 3 | InpaintingModel |
Masked Image + Full Edge from 1st phase output + Mask | Full Image | Actual train InpaintingModel with the predicted edges from phase 1 |
- We need to prepare images dataset and masks dataset both.
- Mask dataset:
- Irregular Mask Dataset (download link) provided by Liu et al. is recommended to handle with normal irregular defects.
- Block Masks don't need dataset which can be random generated by codes.
- Image dataset:
- We should split the whole image dataset into train/validation/test parts.
python scripts/flist_train_split.py --path <your dataset directory> --output <output path> --train 28 --val 1 --test 1
This script will split 30 images into 28 for train, 1 for validation and 1 for test.
Images are split by order of names instead of shuffle, in order to get a best time-average-distribution dataset.
Now there should be three .filst
file in your <output path>
, which conclude absolute image paths.
- Copy the
config.yml.example
under root directory into your model path. Rename it intoconfig.yml
and edit it. Here is some key parameters related to dataset:
- Edit the parameter
MASK: 3
(recommended as above, 4 is also feasible). - Edit the parameter
TRAIN_FLIST
,VAL_FLIST
andTEST_FLIST
into your.flist
path which are got in step 2. - Edit the parameter
TRAIN_MASK_FLIST
,VAL_MASK_FLIST
andTEST_MASK_FLIST
into the same mask dataset path as we got in step 1.
Now my config.yml
is:
MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10 # random seed
DEVICE: 1 # 0: CPU, 1: GPU
GPU: [0] # list of gpu ids
DEBUG: 1 # turns on debugging mode
VERBOSE: 0 # turns on verbose mode in the output console
SKIP_PHASE2: 1 # When training Inpaint model, 2nd and 3rd phases (model 2--->model 3 ) by order are needed. But we can merge 2nd phase into the 3rd one to speed up (however, lower performance).
TRAIN_FLIST: <your path>/train.flist
VAL_FLIST: <your path>/val.flist
TEST_FLIST: <your path>/test.flist
TRAIN_EDGE_FLIST: ./
VAL_EDGE_FLIST: ./
TEST_EDGE_FLIST: ./
# three options below could be the same
TRAIN_MASK_FLIST: <your mask dataset path>
VAL_MASK_FLIST: <your mask dataset path>
TEST_MASK_FLIST: <your mask dataset path>
- Download weights files: Which are available in my page and edge-connect
- Strongly recommend you to start transfer learning with weight files. Otherwise you need about 10 days 2 million iterations training to coverage from scratch.
- mkdir a model path which contains the
config.yml
and four.pth
weights files. - edit the options in
config.yml
related to training:- Edit the parameter
DEVICE: 1
which is a new option to use GPU or not. - Edit the parameter
GPU: [0]
to act a multi-gpu training. - Edit the parameter
INPUT_SIZE
to define the resize of input images - Edit the parameter
BATCH_SIZE
to adapt your GPU RAM - Edit the following options as u wish:
SAVE_INTERVAL: 1000 # how many iterations to wait before saving model (0: never) SAMPLE_INTERVAL: 200 # how many iterations to wait before sampling (0: never) SAMPLE_SIZE: 12 # number of images to sample EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 1000 # how many iterations to wait before logging training status (0: never) PRINT_INTERVAL: 20 # how many iterations to wait before terminal prints training status (0: never)
- Edit the parameter
Before training, there are two training optimizations in my work you must know:
- Add a skip phase 2 optional mode which can combine phase 2 and phase 3 together, in order to accelerate. If you cannot understand what it means, refer to the Introduction above.
- Feel free about the checkpoints problems, the new checkpoint files are saved in your same model path.
They are named followed by a iteration mark, e.g.
InpaintingModel_dis_2074000.pth
. Also the latest checkpoints (identified by name) will be auto-load when the training begins.
- Train phase 1 which trains the
EdgeModel
.
python train.py --model 1 --path <your model dir path>
Check the samples at times and stop the training by yourself.
- Train phase 2 and 3 together which trains the
InpaintingModel
using the well-trainedEdgeModel
in step 1.
IMPORTANT: SKIP_PHASE2
should be 1
in config.yml
!
python train.py --model 3 --path <your model dir path>
Check the samples at times and stop the training by yourself. That's all!
- You can set
SKIP_PHASE2
into0
inconfig.yml
to train phase 2 (use--model 2
), and phase 3 separately by any order. - You can stop the training and then change the
SIGMA
inconfig.yml
, then restart training. This way is really tricky.
整个训练过程并不是端到端(end-to-end)的,根据论文为了得到最佳效果训练被分为了几个阶段。 有点复杂,所以理解论文并查看代码框架可以让你更好理解。
论文中说整个训练阶段有两个小模型:EdgeModel
和 InpaintingModel
.
但是根据代码为了得到最佳效果,实际上整个训练分为训练俩小模型和三个训练阶段,训练完还要test和eval,
所以一切变得都令人困惑。不用担心,这个手册写的可清晰了~
重点:这里被我成为阶段phase
,在原作代码中被成为model
,因为会和EdgeModel
和 InpaintingModel
混淆,所以我叫做阶段。
e.g. 训练命令行中的
--model
参数指定的就是我所说的阶段
阶段 | 对应命令行 | 训练的小模型 | 输入 | 输出 | 说明 |
---|---|---|---|---|---|
1st | --model 1 | EdgeModel |
Masked Greyscale Image + Masked Edge + Mask | Full Edge | 单独训练 EdgeModel |
2nd | --model 2 | InpaintingModel |
Masked Image + Full canny Edge from Original full Image+ Mask | Full Image | 单独预训练 InpaintingModel ,为了让它学到Edge的重要性 |
3rd | --model 3 | InpaintingModel |
Masked Image + Full Edge from 1st phase output + Mask | Full Image | 真正的训练 InpaintingModel ,使用来自阶段1的输出Edge |
- 我们需要同时准备图片和mask数据集:
- Mask dataset:
- 不规则 Mask Dataset (download link) 来自 Liu et al. ,推荐使用这个来对付不规则的图片缺陷。
- 规则的方块mask不需要数据集,可使用代码生成
- Image dataset:
- 接下来我们要把图片数据分成train/validation/test三个部分(Mask数据集不用).
python scripts/flist_train_split.py --path <your dataset directory> --output <output path> --train 28 --val 1 --test 1
这个脚本会默认将30张图片分为28张训练,1张验证,1张测试。注意,分的时候没有shuffle打乱,是根据文件名排序
一轮一轮均匀分的,因为动漫头像数据集是按年代排序的,我们想让数据集分布均匀。请修改脚本以适配你自己的数据集。
现在,在<output path>
目录下应该有三个.filst
文件了,它们包含了图片的绝对路径。
- 复制根目录下的
config.yml.example
到你的模型文件夹下. 重命名为config.yml
并编辑它. 下面是几个和数据集有关的配置需要修改:
- 修改
MASK: 3
(同样推荐使用4). - 修改
TRAIN_FLIST
,VAL_FLIST
和TEST_FLIST
变成你的.flist
路径。 - 修改
TRAIN_MASK_FLIST
,VAL_MASK_FLIST
和TEST_MASK_FLIST
变成你的mask数据集路径(三个相同).
目前为止我的 config.yml
是这样:
MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10 # random seed
DEVICE: 1 # 0: CPU, 1: GPU
GPU: [0] # list of gpu ids
DEBUG: 1 # turns on debugging mode
VERBOSE: 0 # turns on verbose mode in the output console
SKIP_PHASE2: 1 # When training Inpaint model, 2nd and 3rd phases (model 2--->model 3 ) by order are needed. But we can merge 2nd phase into the 3rd one to speed up (however, lower performance).
TRAIN_FLIST: <your path>/train.flist
VAL_FLIST: <your path>/val.flist
TEST_FLIST: <your path>/test.flist
TRAIN_EDGE_FLIST: ./
VAL_EDGE_FLIST: ./
TEST_EDGE_FLIST: ./
# three options below could be the same
TRAIN_MASK_FLIST: <your mask dataset path>
VAL_MASK_FLIST: <your mask dataset path>
TEST_MASK_FLIST: <your mask dataset path>
- 在这里my page 和这里 edge-connect 下载预训练的模型文件
- 强烈推荐你在预训练好的文件上进行迁移学习。 要知道,从0开始训练大概要花费10天,两百万次iterations来收敛到最佳(迁移学习大概十分之一时间)。
- 把你的
config.yml
和四个权重文件.pth
放到同一个模型目录下 - 修改
config.yml
中有关训练的配置:- 修改
DEVICE: 1
代表是否使用GPU. - 修改
GPU: [0]
如果你有多块GPU进行并行训练的话. - 修改
INPUT_SIZE
来定义输入图片的剪裁尺寸 - 修改
BATCH_SIZE
,以适合你的GPU显存 - 修改下面的一些训练时参数:
SAVE_INTERVAL: 1000 # how many iterations to wait before saving model (0: never) SAMPLE_INTERVAL: 200 # how many iterations to wait before sampling (0: never) SAMPLE_SIZE: 12 # number of images to sample EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 1000 # how many iterations to wait before logging training status (0: never) PRINT_INTERVAL: 20 # how many iterations to wait before terminal prints training status (0: never)
- 修改
训练之前,此项目有两个优化点你必须了解:
- 为了加速训练,提供了一种跳过阶段2(其实是同时结合阶段2和阶段3)的训练模式,对应配置
SKIP_PHASE2
。不理解的话请回看简介中的阶段说明。 - 不用担心
checkpoints
的存储问题:- 新的
checkpoints
会被存储在模型文件夹下,名字最后带有值。例如:InpaintingModel_dis_2074000.pth
. - 同时,开始训练的时候会自动加载最新(根据文件名判断)的
.pth
模型文件。
- 新的
- 训练阶段1,对应模型
EdgeModel
.
python train.py --model 1 --path <your model dir path>
时不时查看sample,自己手动停止。
- 训练阶段3,对应
InpaintingModel
,需要用到上一步中训练好的EdgeModel
的.pth
。
重点: 我们跳过了训练阶段2(实际上是融合了),在 config.yml
中SKIP_PHASE2
必须配置为 1
!
python train.py --model 3 --path <your model dir path>
时不时查看sample,自己手动停止。 训练完毕啦~
- 配置
SKIP_PHASE2
为0
来训练阶段 2 (使用--model 2
),阶段2和3能够以任何顺序接替训练。例如: 训练1天阶段2,训练1天阶段3,接着训练阶段2……checkpoints
文件不需要你担心。 - 中断训练后调整
SIGMA
配置, 然后继续训练。(tricky)