Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distilling doesn't work as expected. #27

Open
youjin-c opened this issue Oct 20, 2022 · 12 comments
Open

Distilling doesn't work as expected. #27

youjin-c opened this issue Oct 20, 2022 · 12 comments

Comments

@youjin-c
Copy link

Hello,

Since the last question, #24, I tried 512x512 resolution training for both teacher and student models.
I found that the teacher model in 512x512 works fine, but student training is not working.
I wonder if I can get some hints why

Tfake img
image
Sfake img (274/1000 epoch)
image

training options

!python train.py --dataroot database/face2smile \
  --model cycle_gan \
  --log_dir logs/cycle_gan/face2smile/teacher_512 \
  --netG inception_9blocks \
  --real_stat_A_path real_stat_512/face2smile_A.npz \
  --real_stat_B_path real_stat_512/face2smile_B.npz \
  --batch_size 4 \
  --num_threads 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --save_latest_freq 10000 --save_epoch_freq 5 \
  --epoch_base 176 --iter_base 223395 \
  --nepochs 324 --nepochs_decay 500 \
  --preprocess scale_width --load_size 512 \
!python distill.py --dataroot database/face2smile \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir logs/cycle_gan/face2smile/student_512 \
  --restore_teacher_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_D_A.pth \
  --real_stat_path real_stat_512/face2smile_B.npz \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 32 \
  --eval_batch_size 4 \
  --batch_size 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --preprocess scale_width --load_size 512 \
  --save_epoch_freq 2 --save_latest_freq 1000 \
  --nepochs 500 --nepochs_decay 500 \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student
@alanspike
Copy link
Contributor

Maybe you could try to increase the reconstruction loss for student training. For example, increase --lambda_recon to 100 and give it a try.

@youjin-c
Copy link
Author

@alanspike Thanks. Let me give it a try!

@youjin-c
Copy link
Author

Hi, @alanspike,
I tried the options below, but sadly the student Sfake looks like the original input image rather than trained.

  --lambda_distill 1.0 \
  --lambda_recon 100 \

This one I referred from the Jupyter notebook tutorial

  --lambda_distill 2.8 \
  --lambda_recon 1000 \

image
This is Sfake of epoch 526 / 1000.

image
This is the input image.

@alanspike
Copy link
Contributor

Thanks for sharing the results. It's a bit weird since the lambda_recon is used to increase the reconstruction loss between the student model and the teacher model as shown here so that the generated images from the student model should be more similar to the teacher model. Just wondering that the output from teacher model is normal, right? Also, did you observe larger reconstruction loss after increasing lambda_recon?

@youjin-c
Copy link
Author

Hello @alanspike , thanks for checking in.

yes the teacher model works fine. This is the Tfake image of the same epoch.
image

Below are the logs with different options. I noticed G_recon increased, and D values decreased dramatically..almost zero's. Do you think increasing the learning rate can be helpful?
--lambda_distill 1.0 \ --lambda_recon 5 \
epoch: 274, iters: 43200, time: 0.964) G_gan: 0.928 G_distill: -15.980 G_recon: 0.393 D_fake: 0.014 D_real: 0.010
--lambda_distill 1.0 \ --lambda_recon 100 \
epoch: 342, iters: 54000, time: 1.009) G_gan: 0.996 G_distill: -15.916 G_recon: 5.616 D_fake: 0.004 D_real: 0.005
--lambda_distill 2.8 \ --lambda_recon 1000 \
epoch: 527, iters: 83200, time: 0.966) G_gan: 0.991 G_distill: -44.196 G_recon: 53.165 D_fake: 0.001 D_real: 0.001

@alanspike
Copy link
Contributor

Could you try to set the weight of the adversarial loss as zero and see whether the reconstruction loss is decreasing?

@youjin-c
Copy link
Author

youjin-c commented Nov 1, 2022

@alanspike
Do you mean --lambda_B (in case I am training AtoB) or --lambda_identity by the weight of the adversarial loss?

@alanspike
Copy link
Contributor

Could you maybe set this loss as zero, and comment the training of discriminator here? I'm not sure about the reason so I wonder maybe we could try to remove the discriminator (without adversarial training) and just use the reconstruction and distillation loss, to see whether the G_recon could decrease during the training. If the G_recon could decrease to a reasonable value, then we should be able to get the smiling face.

@youjin-c
Copy link
Author

youjin-c commented Nov 4, 2022

Thank you @alanspike!
I am running the distiller as you commented.

While waiting for the result, I'd like to share some photos I acquired during the last distilling.

  --lambda_distill 1.1 \
  --lambda_recon 10 \
(epoch: 338, iters: 53400, time: 0.941) G_gan: 0.992 G_distill: -17.534 G_recon: 0.668 D_fake: 0.006 D_real: 0.004 
End of epoch 338 / 1000 	 Time Taken: 167.23 sec
###(Evaluate epoch: 338, iters: 53405, time: 63.128) fid: 73.587 fid-mean: 73.792 fid-best: 72.101 
Saving the model at the end of epoch 338, iters 53405
learning rate = 0.0002000

00015 (2)
00012

I could see these patterns for all distilling, regardless of the options.

@youjin-c
Copy link
Author

youjin-c commented Nov 7, 2022

@alanspike Hello,
I ran the distiller with --lambda_gan zero and commented on the training of discriminator (line 182 - line 185), learning rate 0.0002000, and decay after epoch 500, ran till epoch 1000.

I could see that G_recon decreased swaying and very slowly. Even ran 1000 epochs, it just reached around 0.437 at minimum.
I guess the learning rate is small, or I should run more epochs to see G_recon decrease more, or both.
I wonder about your opinion on it, and below are some photos from the latest epoch.

This is the log file during distilling
00015 (2)
00013 (1)
00012

@alanspike
Copy link
Contributor

Maybe the obtained student network is too small using the default target FLOPs for the larger-resolution. Could you try using larger FLOPs to compress?

@youjin-c
Copy link
Author

@alanspike sure, thanks your opinion. Let me run with larger FLOPs and update the result here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants