Fix zero stage2 cpu_offload when some model trainable parameters skip…

…ped in training (#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in #707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <[email protected]>
microsoft · Mar 27, 2021 · 7fcc891 · 7fcc891
1 parent 39013dd
commit 7fcc891
Showing 1 changed file with 6 additions and 2 deletions.
diff --git a/deepspeed/runtime/zero/stage2.py b/deepspeed/runtime/zero/stage2.py
@@ -883,8 +883,12 @@ def complete_grad_norm_calculation_for_cpu_offload(self, params):
         for p in params:
             if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
                 param_id = self.get_param_id(p)
-                param_norm = self.norm_for_param_grads[param_id]
-                total_norm += param_norm.item()**2
+                # as some model have trainable parameters but skipped in training,
+                # their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run,
+                # so they have no norm_for_param_grads
+                if param_id in self.norm_for_param_grads:
+                    param_norm = self.norm_for_param_grads[param_id]
+                    total_norm += param_norm.item()**2
 
         # Sum across all model parallel GPUs.
         total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])