Unable to reproduce results #11

alexgaskell10 · 2020-06-26T12:27:39Z

I have been unable to reproduce the results shown in the paper. I have trained the model for 20k steps and the loss has fallen nicely throughout training. When I produce the output summaries, however (using mode=decode- I presume this is correct?) they are not good. An illustrative output summary is shown below. If I resume from that checkpoint and train the model further, it says loss is NAN and stops training.

What am I missing here? The command I use to train the model is:

python run_summarization.py
--mode=train
--data_path=$DATA_DIR/train.bin
--vocab_path=$DATA_DIR/vocab
--log_root=logroot
--exp_name=exp
--max_dec_steps=210
--max_enc_steps=2500
--num_sections=5
--max_section_len=500
--batch_size=1
--vocab_size=50000
--use_do=True
--optimizer=adagrad
--do_prob=0.25
--hier=True
--split_intro=True
--fixed_attn=True
--legacy_encoder=False
--coverage=False
--lr=0.05

Illustrative example
background : of of .
.
the the under of either of the private public private medicine medicine private private other has successfully investigated .
here we by case of this by in first chronic chronic of of the the private of the patients [UNK] 75 symptoms causing the .
it history .
this is method the first successful chronic mortality.19 without chronic chronic of .
, [ , the condition the percentage of adult .
without mortality.19 mortality.19 the with without without without without without of without without without without of other private private the other .
results results the would suggest and identifying private private private of malignancy improve increases .
we also demonstrated the susceptibility and new new elderly this report chronic chronic .
chronic of of with with asthma4 without asthma4 the significantly higher .
there , it it greater greater than .

armancohan · 2020-06-27T17:13:09Z

I think at 20K steps the model is still undertrained.
I suggest starting with a smaller section length and number of sections and then at final steps increasing those. Something like --max_section_len=400, --num_sections=4, --max_dec_steps=100, -max_enc_steps=1600

alexgaskell10 · 2020-06-29T08:50:59Z

Thanks for coming back to me. 2 follow-up questions:

Start training from scratch with this setup or resume from my latest checkpoint?
Will this method help prevent training being corrupted with loss being NAN?

armancohan · 2020-06-30T01:03:00Z

I would start from scratch. I also remember seeing some nan issues although this was a while ago (as far as I recall nan's were more likely to occur in longer sequences).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce results #11

Unable to reproduce results #11

alexgaskell10 commented Jun 26, 2020

armancohan commented Jun 27, 2020

alexgaskell10 commented Jun 29, 2020

armancohan commented Jun 30, 2020

Unable to reproduce results #11

Unable to reproduce results #11

Comments

alexgaskell10 commented Jun 26, 2020

armancohan commented Jun 27, 2020

alexgaskell10 commented Jun 29, 2020

armancohan commented Jun 30, 2020