Skip to content

Commit

Permalink
Update from MSFT (#21)
Browse files Browse the repository at this point in the history
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Cheng Li <[email protected]>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <[email protected]>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
  • Loading branch information
5 people authored Mar 15, 2021
1 parent 8b080ac commit 43b69c3
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 13 deletions.
6 changes: 6 additions & 0 deletions deepspeed/runtime/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -585,6 +585,12 @@ def _configure_distributed_model(self, model):
def _configure_optimizer(self, client_optimizer, model_parameters):

if client_optimizer is not None:
client_optimizer.param_groups[:] = [
pg for pg in client_optimizer.param_groups if len(pg["params"]) != 0
]
logger.info(
"Removing param_group that has no 'params'in the client Optimizer")

basic_optimizer = client_optimizer
if self.global_rank == 0:
logger.info('Using client Optimizer as basic optimizer')
Expand Down
4 changes: 2 additions & 2 deletions deepspeed/runtime/lr_schedules.py
Original file line number Diff line number Diff line change
Expand Up @@ -706,8 +706,8 @@ def __init__(self,
self.min_lrs = self._format_param(self.optimizer, warmup_min_lr, "min_lr")
self.max_lrs = self._format_param(self.optimizer, warmup_max_lr, "max_lr")
self.delta_lrs = [big - small for big, small in zip(self.max_lrs, self.min_lrs)]
self.warmup_num_steps = warmup_num_steps
self.inverse_log_warm_up = 1.0 / math.log(warmup_num_steps)
self.warmup_num_steps = max(2, warmup_num_steps)
self.inverse_log_warm_up = 1.0 / math.log(self.warmup_num_steps)
self.last_batch_iteration = last_batch_iteration

def get_lr(self):
Expand Down
21 changes: 11 additions & 10 deletions docs/_tutorials/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ net = PipelineModule(layers=net.to_layers(), num_stages=2)
```

**Note:**
the `lamda` in the middle of `layers` above is not a `torch.nn.Module`
the `lambda` in the middle of `layers` above is not a `torch.nn.Module`
type. Any object that implements `__call__()` can be a layer in a
`PipelineModule`: this allows for convenient data transformations in the
pipeline.
Expand Down Expand Up @@ -165,7 +165,7 @@ These modifications can be accomplished with a short subclass:
class TransformerBlockPipe(TransformerBlock)
def forward(self, inputs):
hidden, mask = inputs
outputs = super().forward(hidden, mask)
output = super().forward(hidden, mask)
return (output, mask)
stack = [ TransformerBlockPipe() for _ in range(num_layers) ]
```
Expand Down Expand Up @@ -269,17 +269,18 @@ by DeepSpeed:
* `partition_method="uniform"` balances the number of layers per stage.

### Memory-Efficient Model Construction
Building a `Sequential` and providing it `PipelineModule` is a convenient way
of specifying a pipeline parallel model. However, this approach encounters
scalability issues for massive models. Starting from a `Sequential` allocates
the model in CPU memory redundantly by every worker. A machine with 16 GPUs
must have as much local CPU memory as 16 times the model size.
Building a `Sequential` container and providing it to a `PipelineModule` is a convenient way
of specifying a pipeline parallel model. However, this approach encounters scalability issues
for massive models because each worker replicates the whole model in CPU memory.
For example, a machine with 16 GPUs must have as much local CPU memory as 16 times the model size.

DeepSpeed provides a `LayerSpec` class that delays the construction of
modules until the model layers have been partitioned across workers. Then,
the modules are built on the GPU that owns the layer.
modules until the model layers have been partitioned across workers.
Then each worker will allocate only the layers it's assigned to. So, continuing the
example from the previous paragraph, a machine with 16 GPUs will need to allocate a
total of 1x model size on its CPU, compared to 16x in the LayerSpec example.

Here's an example of the abbreviated AlexNet model, but expressed only
Here is an example of the abbreviated AlexNet model, but expressed only
with `LayerSpec`s. Note that the syntax is almost unchanged: `nn.ReLU(inplace=True)`
simply becomes `LayerSpec(nn.ReLU, inplace=True)`.
```python
Expand Down
2 changes: 1 addition & 1 deletion version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.3.11
0.3.12

0 comments on commit 43b69c3

Please sign in to comment.