Loss is always nan #4

taoyu17 · 2020-10-27T15:21:58Z

Hello PeteryuX,

Thanks a lot for sharing your implementation of ESRGAN.

I have been testing some of the GAN based superresolution network recently. I have got a lot of training HR/LR images and would like to train the ESRGAN (PSNR+ESRGAN) network using your training code.

I have followed your instructions on data preparation and converted my 1,825,587 pairs of LR/HR samples to *bin.tfrecord checked dataset_checker no problem, LR/HR images displayed well, modified few lines of your code for the hardcoded paths etc. and started PSNR training on the RTX3090 GPU. However, the calculated and printed out "loss" is always "nan" in every iteration, and even after "successfully" finished PSNR training, the loss_D and loss_G in ESRGAN training is also shown as "nan".

in psnr training:
...
Training [>> ] 20004/600000, loss=nan, lr=2.0e-04 2.0 step/sec
...

in esrgan training:
...
Training [>>> ] 40000/285240, loss_G=nan, loss_D=nan, lr_G=1.0e-04, lr_D=1.0e-04 1.4 step/sec
[*] save ckpt file at ./checkpoints/esrgan/ckpt-32
Training [>>>> ] 47877/285240, loss_G=nan, loss_D=nan, lr_G=1.0e-04, lr_D=1.0e-04 1.4 step/sec
...

Do you have any suggestions on this issue?

I here attach the psnr+esrgan parameter files:

psnr.yaml:
batch_size: 64
input_size: 32
gt_size: 128
ch_size: 3
scale: 4
sub_name: 'psnr_pretrain'
pretrain_name: null

network_G:
nf: 64
nb: 23

train_dataset:
path: '/data/EOSC/EOSC_sub_bin.tfrecord'
num_samples: 1825587
using_bin: True
using_flip: True
using_rot: True
test_dataset:
EOSC_path: '/data2/EOSC_test'

niter: 600000
lr: !!float 2e-4
lr_steps: [200000, 300000, 400000, 500000]
lr_rate: 0.5

adam_beta1_G: 0.9
adam_beta2_G: 0.99

w_pixel: 1.0
pixel_criterion: l1
save_steps: 20000

esrgan.yaml:
batch_size: 64
input_size: 32
gt_size: 128
ch_size: 3
scale: 4
sub_name: 'esrgan'
pretrain_name: 'psnr_pretrain'

network_G:
nf: 64
nb: 23
network_D:
nf: 64

train_dataset:
path: '/data/EOSC/EOSC_sub_bin.tfrecord'
num_samples: 1825587
using_bin: True
using_flip: False
using_rot: False
test_dataset:
EOSC_path: '/data2/EOSC_test'

niter: 285240
lr_G: !!float 1e-4
lr_D: !!float 1e-4
lr_steps: [60000, 120000, 180000, 240000]
lr_rate: 0.5

adam_beta1_G: 0.9
adam_beta2_G: 0.99
adam_beta1_D: 0.9
adam_beta2_D: 0.99

w_pixel: !!float 1e-2
pixel_criterion: l1

w_feature: 1.0
feature_criterion: l1

w_gan: !!float 5e-3
gan_type: ragan # gan | ragan

save_steps: 20000

Any help would be much appreciated! Thank you!

Lfywx · 2022-04-11T07:17:36Z

Did you solve it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss is always nan #4

Loss is always nan #4

taoyu17 commented Oct 27, 2020 •

edited

Loading

Lfywx commented Apr 11, 2022

Loss is always nan #4

Loss is always nan #4

Comments

taoyu17 commented Oct 27, 2020 • edited Loading

Lfywx commented Apr 11, 2022

taoyu17 commented Oct 27, 2020 •

edited

Loading