[Finetune] print metrics by rank0 for train and evaluation in fine-tuning #258

harborn · 2024-06-20T08:15:45Z

print metrics by rank0 for train and evaluation in fine-tuning
example:

 ***** train metrics *****                                                                                                                                                                                                                                     
   epoch                    =        1.0                
   throughput               =     0.0583
   total_flos               =   605304GF
   train_loss               =     0.8543
   train_runtime            = 0:06:52.01
   train_samples_per_second =      0.058
   train_steps_per_second   =      0.005
 ***** eval metrics *****
   epoch                   =        1.0                
   eval_loss               =        nan
   eval_runtime            = 0:01:24.54
   eval_samples_per_second =      0.012
   eval_steps_per_second   =      0.012

here throughput is samples/sec

carsonwang · 2024-06-24T06:59:23Z

llm_on_ray/finetune/finetune.py

+        result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        trainer.save_model()
+        metrics = result.metrics
+        metrics["throughput"] = len(tokenized_dataset["train"]) / metrics["train_runtime"]


Do you need to multiply the number of epoch? Can you please share the result of one of your runs? How does the result compare with the result on Gaudi dashboard?

harborn changed the title ~~print metrics by rank0 for train and evaluation in fine-tuning~~ [Finetune] print metrics by rank0 for train and evaluation in fine-tuning Jun 20, 2024

harborn force-pushed the fine-tuning-matrics branch from ee262e6 to 7a89e4f Compare June 24, 2024 01:54

carsonwang reviewed Jun 24, 2024

View reviewed changes

harborn added 3 commits June 24, 2024 09:39

print metrics by rank0 for train and evalutation in fine-tuning

89a5fc7

update

402564e

update

7a89e4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Finetune] print metrics by rank0 for train and evaluation in fine-tuning #258

[Finetune] print metrics by rank0 for train and evaluation in fine-tuning #258

harborn commented Jun 20, 2024 •

edited

Loading

carsonwang Jun 24, 2024

[Finetune] print metrics by rank0 for train and evaluation in fine-tuning #258

Are you sure you want to change the base?

[Finetune] print metrics by rank0 for train and evaluation in fine-tuning #258

Conversation

harborn commented Jun 20, 2024 • edited Loading

carsonwang Jun 24, 2024

Choose a reason for hiding this comment

harborn commented Jun 20, 2024 •

edited

Loading