This is a submission (under development) for ICLR 2018 Reproducibility Challenge. The central theme of the work by the authors is to incorporate adversarial training for semantic-segmentation task which enables the segmentation-network to learn in a semi-supervised fashion on top of the traditional supervised learning. The authors claim significant improvement in the performance (measured in terms of mean IoU) of segmentation network after the supervised-training is extended with adversarial and semi-supervised training.
My plan is to reproduce the improvement in the performance of the segmentation network (Resnet-101) by including adversarial and semi-supervised learning scheme over the baseline supervised training and document my experience along the way. The authors have used two datasets, PASCAL VOC 12 (extended version) and Cityscapes, to demonstrate the benefits of their proposed training scheme. I will focus on PASCAL VOC 12 dataset for this work. Specifically, the target for this work is to reproduce the following table from the paper.
Method | Data Amount 1/2 full |
---|---|
Baseline (Resnet-101) | 69.8 73.6 |
Baseline + Adversarial Training | 72.6 74.9 |
Baseline + Adversarial Training + Semi-supervised Learning |
73.2 NA |
Following table summarizes the results I have been able to reproduce for the full dataset. For the full dataset, only the performance of the adversarial training on top of baseline can be evaluated.
Method (Full Dataset) | Original | Challenge |
---|---|---|
Baseline (Resnet-101) | 73.6 | 68.86 |
Baseline + Adversarial Training | 74.9 | 69.93 |
Baseline + Adversarial Training + Semi-supervised Learning |
NA | NA |
Following table summarized the results that I was able to reproduce for the semi-supervised training. Clearly, the semi-supervised training has a negative impact on the performance, even when compared to the baseline. It is highly likely that this behavior is related to one of the comments made by the reviewer. The discriminator during the early epochs in training is probably making noisy predictions which might make the training unstable. I'll update the scores once again by including semi-supervised training only after training discriminator for 5 epochs.
Method (1/2 Dataset) | Original | Challenge |
---|---|---|
Baseline (Resnet-101) | 69.8 | 68.05 |
Baseline + Adversarial Training | 72.6 | 70.31 |
Baseline + Adversarial Training + Semi-supervised Learning |
73.2 | 66.75 |
As evident from the above two tables, incorporating adversarial training definitely improves the performance of the baseline model. For the full dataset, the improvement is of the similar order compared to the original work. For the 1/2 split dataset, the improvement in the baseline performance with adversarial training is again significant.
However, I was not able to obtain any improvement in the performance by using semi-supervised training (update pending).
-
8th Dec 2017: Semi-supervised Learning with 1/2 of training data treated as unlabeled degrades the performance compare to baseline (68.05 mIoU) and baseline + adversarial training (70.31 mIoU). It might be related to one of the comments of the reviewer that initial predictions by the discriminator might be noisy which renders semi-supervised training unstable during early epochs. The authors have made a comment that semi-supervised training is only applied after 5k iterations. I'll include the results with this addition soon.
-
4th Dec 2017: Started working on Semi-supervised training.
-
2nd Dec 2017: Adversarial Training based on base105 improves mIoU from 68.86 to 69.93.
-
30th Nov 2017: Managed to improve adversarial training performance. For base105, mIoU was improved from 68.86 to 69.33.
-
28th Nov 2017: Started experiments with Imagenet-pretrained Resnet-101 segmentation network as the baseline. Best mIoU achieved is 65.97. So, moving forward to unsupervised training with the base104 (best baseline model) and base105 (baseline with best adversarial training results).
-
27th Nov 2017: Finally managed to stabilize the GAN training. I couldn't reproduce any significant improvement over the baseline Segmentation Network. In fact, the best performing segmentation network (base104 with mIoU 69.78) was worse off with the adversarial training (mIoU dropped to 68.07). I have documented the details of the experiments performed for adversarial training. As GAN training is considered to be very sensitive towards weight initialization, I feel this is the right time to incorporate ImageNet pretrained network in the training.
-
20th Nov 2017: Started working on adding adversarial learning for base-104 segmentation network.
-
17th Nov 2017: Baseline model (base-104) achieved mean IoU of 69.78 on the full dataset. The model is still significantly away from the target mIoU of 73.6. Only significant component missing from the implementation is using Resnet-101 pre-trained on Imagenet (I am currently using MS-COCO pretrained Network as the baseline). Other minor additions (
normalization of the input(included in base-105), number of iterations to wait before lr decay, etc) will also be included.
Name | Details | mIoU |
---|---|---|
base-101 | - No Normalization - No gradient for batch norm - Drop last batch if not complete - Volatile = false for eval - Poly Decay every 10 iterations - learnable upsampling with transposed convolution |
35.91 |
base102 | Exactly like base-101, except - no polynomial decay - fixed bilinear upsampling layers |
68.84 |
base103 | Exactly like base-102, except - with polynomial decay(every 10 iter)) |
68.88 |
base104 | Exactly like base-103, except -with poly decay (every iter) |
69.78 |
base105 | base-104, except - with normalization of input to 0 mean and unit variance |
68.86 |
base110 | - ImageNet pretrained - Normalization - poly decay(eveyr iter) same lr for all layers |
65.97 |
base111 | - Imagenent pretrained - Normalization - poly decay (every iter) - 10x lr for classification module |
65.67 |
Name | Details | miou |
---|---|---|
adv101 | - base105 as G - Optim(D): SGD lr 0.0001, momentum=0.5,decay= 0.0001 |
68.96 |
adv102 | - base105 - 0.25 label smoothing for real labels in D - Optim(D) SGD lr 0.0001, momentum=0.5,decay= 0.0001 |
67.14 |
adv103 | - base105 - 0.25 label smoothing for real labels in D - Optim(D) ADAM |
68.07 |
adv104 | - base104 - 0.25 label smoothing for real labels in D - Optim(D) SGD lr 0.0001, momentum=0.5,decay= 0.0001 |
63.37 |
adv105 | base104 as G - everything else like adv103 |
Very poor (didn't finish training) |
adv105-cuda | - base105 - 0.25 label smoothing for real labels in D - Optim(D) SGD lr 0.0001, momentum=0.5,decay= 0.0001 - batch size 21 |
Very poor (didn't finish training) |
adv106 | - base104 - optim(D) ADAM - batch_size = 21 |
61.50 |
adv201 | - base 105 - label smoothing 0.25 - Adam |
69.33 |
adv202 | - base105 - label smoothing 0.1 - d_optim Adam |
69.93 |
adv203 | - base105 - label smoothing 0.1 - Adam d_lr = 0.0001 and g_lr = 0.00025 |
69.72 |
adv204 | - base105 - label smoothing 0.1 - Adam d_lr = 0.00001, g_lr = 0.00025 |
69.28 |