Distilling doesn't work as expected. #27

youjin-c · 2022-10-20T20:41:45Z

Hello,

Since the last question, #24, I tried 512x512 resolution training for both teacher and student models.
I found that the teacher model in 512x512 works fine, but student training is not working.
I wonder if I can get some hints why

Tfake img

Sfake img (274/1000 epoch)

training options

!python train.py --dataroot database/face2smile \
  --model cycle_gan \
  --log_dir logs/cycle_gan/face2smile/teacher_512 \
  --netG inception_9blocks \
  --real_stat_A_path real_stat_512/face2smile_A.npz \
  --real_stat_B_path real_stat_512/face2smile_B.npz \
  --batch_size 4 \
  --num_threads 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --save_latest_freq 10000 --save_epoch_freq 5 \
  --epoch_base 176 --iter_base 223395 \
  --nepochs 324 --nepochs_decay 500 \
  --preprocess scale_width --load_size 512 \

!python distill.py --dataroot database/face2smile \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir logs/cycle_gan/face2smile/student_512 \
  --restore_teacher_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_D_A.pth \
  --real_stat_path real_stat_512/face2smile_B.npz \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 32 \
  --eval_batch_size 4 \
  --batch_size 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --preprocess scale_width --load_size 512 \
  --save_epoch_freq 2 --save_latest_freq 1000 \
  --nepochs 500 --nepochs_decay 500 \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student

The text was updated successfully, but these errors were encountered:

alanspike · 2022-10-20T20:45:08Z

Maybe you could try to increase the reconstruction loss for student training. For example, increase --lambda_recon to 100 and give it a try.

youjin-c · 2022-10-20T20:46:54Z

@alanspike Thanks. Let me give it a try!

youjin-c · 2022-10-23T23:04:21Z

Hi, @alanspike,
I tried the options below, but sadly the student Sfake looks like the original input image rather than trained.

  --lambda_distill 1.0 \
  --lambda_recon 100 \

This one I referred from the Jupyter notebook tutorial

  --lambda_distill 2.8 \
  --lambda_recon 1000 \

This is Sfake of epoch 526 / 1000.

This is the input image.

alanspike · 2022-10-26T17:09:42Z

Thanks for sharing the results. It's a bit weird since the lambda_recon is used to increase the reconstruction loss between the student model and the teacher model as shown here so that the generated images from the student model should be more similar to the teacher model. Just wondering that the output from teacher model is normal, right? Also, did you observe larger reconstruction loss after increasing lambda_recon?

youjin-c · 2022-10-26T19:58:01Z

Hello @alanspike , thanks for checking in.

yes the teacher model works fine. This is the Tfake image of the same epoch.

Below are the logs with different options. I noticed G_recon increased, and D values decreased dramatically..almost zero's. Do you think increasing the learning rate can be helpful?
--lambda_distill 1.0 \ --lambda_recon 5 \
epoch: 274, iters: 43200, time: 0.964) G_gan: 0.928 G_distill: -15.980 G_recon: 0.393 D_fake: 0.014 D_real: 0.010
--lambda_distill 1.0 \ --lambda_recon 100 \
epoch: 342, iters: 54000, time: 1.009) G_gan: 0.996 G_distill: -15.916 G_recon: 5.616 D_fake: 0.004 D_real: 0.005
--lambda_distill 2.8 \ --lambda_recon 1000 \
epoch: 527, iters: 83200, time: 0.966) G_gan: 0.991 G_distill: -44.196 G_recon: 53.165 D_fake: 0.001 D_real: 0.001

alanspike · 2022-11-01T15:24:30Z

Could you try to set the weight of the adversarial loss as zero and see whether the reconstruction loss is decreasing?

youjin-c · 2022-11-01T20:42:54Z

@alanspike
Do you mean --lambda_B (in case I am training AtoB) or --lambda_identity by the weight of the adversarial loss?

alanspike · 2022-11-04T16:57:36Z

Could you maybe set this loss as zero, and comment the training of discriminator here? I'm not sure about the reason so I wonder maybe we could try to remove the discriminator (without adversarial training) and just use the reconstruction and distillation loss, to see whether the G_recon could decrease during the training. If the G_recon could decrease to a reasonable value, then we should be able to get the smiling face.

youjin-c · 2022-11-04T21:03:01Z

Thank you @alanspike!
I am running the distiller as you commented.

While waiting for the result, I'd like to share some photos I acquired during the last distilling.

  --lambda_distill 1.1 \
  --lambda_recon 10 \

(epoch: 338, iters: 53400, time: 0.941) G_gan: 0.992 G_distill: -17.534 G_recon: 0.668 D_fake: 0.006 D_real: 0.004 
End of epoch 338 / 1000 	 Time Taken: 167.23 sec
###(Evaluate epoch: 338, iters: 53405, time: 63.128) fid: 73.587 fid-mean: 73.792 fid-best: 72.101 
Saving the model at the end of epoch 338, iters 53405
learning rate = 0.0002000

I could see these patterns for all distilling, regardless of the options.

youjin-c · 2022-11-07T22:42:52Z

@alanspike Hello,
I ran the distiller with --lambda_gan zero and commented on the training of discriminator (line 182 - line 185), learning rate 0.0002000, and decay after epoch 500, ran till epoch 1000.

I could see that G_recon decreased swaying and very slowly. Even ran 1000 epochs, it just reached around 0.437 at minimum.
I guess the learning rate is small, or I should run more epochs to see G_recon decrease more, or both.
I wonder about your opinion on it, and below are some photos from the latest epoch.

This is the log file during distilling

alanspike · 2022-11-15T04:27:15Z

Maybe the obtained student network is too small using the default target FLOPs for the larger-resolution. Could you try using larger FLOPs to compress?

youjin-c · 2022-11-15T14:19:29Z

@alanspike sure, thanks your opinion. Let me run with larger FLOPs and update the result here.

youjin-c mentioned this issue Nov 1, 2022

distilling on higer resolution mit-han-lab/gan-compression#105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distilling doesn't work as expected. #27

Distilling doesn't work as expected. #27

youjin-c commented Oct 20, 2022

alanspike commented Oct 20, 2022

youjin-c commented Oct 20, 2022

youjin-c commented Oct 23, 2022

alanspike commented Oct 26, 2022

youjin-c commented Oct 26, 2022

alanspike commented Nov 1, 2022

youjin-c commented Nov 1, 2022

alanspike commented Nov 4, 2022

youjin-c commented Nov 4, 2022

youjin-c commented Nov 7, 2022

alanspike commented Nov 15, 2022

youjin-c commented Nov 15, 2022

Distilling doesn't work as expected. #27

Distilling doesn't work as expected. #27

Comments

youjin-c commented Oct 20, 2022

alanspike commented Oct 20, 2022

youjin-c commented Oct 20, 2022

youjin-c commented Oct 23, 2022

alanspike commented Oct 26, 2022

youjin-c commented Oct 26, 2022

alanspike commented Nov 1, 2022

youjin-c commented Nov 1, 2022

alanspike commented Nov 4, 2022

youjin-c commented Nov 4, 2022

youjin-c commented Nov 7, 2022

alanspike commented Nov 15, 2022

youjin-c commented Nov 15, 2022