Skip to content

Latest commit

 

History

History
92 lines (69 loc) · 2.85 KB

ORTModule_Convergence_Notes.md

File metadata and controls

92 lines (69 loc) · 2.85 KB

ORTModule Training Convergence Investigation

1. Discovering

Convergence issues can be identified by:

  • Large discrepancies in core training metrics including training loss, evaluation loss, model specific AUC metrics.
  • Runtime failures (for example when the loss scaler reaches the minimum, triggering an exception).

Before looking into this further, we should clarify a few things (if possible):

  • If we change the seed for the baseline run, whether the metric diff is big? (Make sure the discrepancy is not introduced by randomness)
  • What are the very first steps we see obvious divergence?
  • Still reproducible once randomness is removed?
  • Set same seeds
  • Set the dropout ratio to 0
  • Set compute to be deterministic and torch-comparable (TODO(pengwa): need a flag for this).

2. Collect Activation Statistics

Add a few lines of code, run script to collect statistics:

Baseline ORTModule
+ from onnxruntime.training.utils.hooks import SubscriberManager,
+                                              StatisticsSubscriber
+ sub_m = SubscriberManager()
+ sub_m.subscribe(model, [StatisticsSubscriber(output_dir="pt_out",
+                                              override_output_dir=True)])
model = ORTModule(model)
+ from onnxruntime.training.utils.hooks import SubscriberManager,
+                                              StatisticsSubscriber
+ sub_m = SubscriberManager()
+ sub_m.subscribe(model, [StatisticsSubscriber(output_dir="ort_out",
+                                              override_output_dir=True)])
  • Run training script to the steps that trigger the divergence.
  • A folder named pt_out is created in the current working directory.
  • For each step, there is a folder containing summaries for every activation tensor.
  • Run training script to the steps that trigger the divergence.
  • Similarly, a folder named ort_out is created in the current working directory.
  • StatisticsSubscriber can be subscribed before OR after wrapping ORTModule.

Arguments:

  • output_dir: the directory in all activation statistics files will be stored.
  • start_step [optional]: the first step that runs subscriber actions.
  • end_step [optional]: the end step (exclusively) that runs subscriber actions.
  • override_output_dir: whether output_dir can be overridden if it already exists.

Check StatisticsSubscriber implementation for more information.

Run command to generate per-step summary

python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output

Manually compare the generated per-step summary to find the first big diff.