Merge branch 'master' into optimizer_step/training_step

Lightning-AI · Nov 2, 2020 · 68ffdb0 · 68ffdb0
2 parents 56e4688 + ac3f739
commit 68ffdb0
Show file tree

Hide file tree

Showing 47 changed files with 1,553 additions and 221 deletions.
diff --git a/.github/workflows/ci_dockers.yml b/.github/workflows/ci_dockers.yml
@@ -108,8 +108,11 @@ jobs:
             pytorch_version: 1.6
           - python_version: 3.6
             pytorch_version: 1.4
-          #- python_version: 3.7
-          #  pytorch_version: 1.8  # todo
+          - python_version: 3.7
+            pytorch_version: 1.7
+          # TODO
+          # - python_version: 3.7
+          #   pytorch_version: 1.8
     steps:
       - name: Checkout
         uses: actions/checkout@v2

diff --git a/.github/workflows/docker-builds.yml b/.github/workflows/docker-builds.yml
@@ -14,7 +14,7 @@ jobs:
       fail-fast: false
       matrix:
         python_version: [3.6, 3.7, 3.8]
-        pytorch_version: [1.3, 1.4, 1.5, 1.6]
+        pytorch_version: [1.3, 1.4, 1.5, 1.6, 1.7]
         exclude:
           # excludes PT 1.3 as it is missing on pypi
           - python_version: 3.8

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,10 +17,18 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Added multiclass AUROC metric ([#4236](https://github.com/PyTorchLightning/pytorch-lightning/pull/4236))
 
+- Added timeout for `tpu_device_exists` to ensure process does not hang indefinitely ([#4340](https://github.com/PyTorchLightning/pytorch-lightning/pull/4340))
+
+- Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience ([#3807](https://github.com/PyTorchLightning/pytorch-lightning/pull/3807)) 
+
 ### Changed
 
 - W&B log in sync with Trainer step ([#4405](https://github.com/PyTorchLightning/pytorch-lightning/pull/4405))
 
+- Hook `on_after_backward` is called only when `optimizer_step` is being called ([#4439](https://github.com/PyTorchLightning/pytorch-lightning/pull/4439))
+
+- Moved `track_and_norm_grad` into `training loop` and called only when `optimizer_step` is being called ([#4439](https://github.com/PyTorchLightning/pytorch-lightning/pull/4439))
+
 ### Deprecated
 
 - Deprecated passing `ModelCheckpoint` instance to `checkpoint_callback` Trainer argument ([#4336](https://github.com/PyTorchLightning/pytorch-lightning/pull/4336))
@@ -31,6 +39,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Fixed error using `auto_select_gpus=True` with `gpus=-1` ([#4209](https://github.com/PyTorchLightning/pytorch-lightning/pull/4209))
 
+- Fixed that metrics do not store computational graph for all seen data ([#4313](https://github.com/PyTorchLightning/pytorch-lightning/pull/4313))
+
+- Fixed AMP unscale for `on_after_backward` ([#4439](https://github.com/PyTorchLightning/pytorch-lightning/pull/4439))
 
 ## [1.0.4] - 2020-10-27
 
@@ -74,7 +85,6 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Fixed WandbLogger not uploading checkpoint artifacts at the end of training ([#4341](https://github.com/PyTorchLightning/pytorch-lightning/pull/4341))
 
-
 ## [1.0.3] - 2020-10-20
 
 ### Added

diff --git a/README.md b/README.md
@@ -183,7 +183,7 @@ trainer = pl.Trainer()
 trainer.fit(autoencoder, DataLoader(train), DataLoader(val))
 ```
 
-#### And without changing a single line of code, you could run on GPU/TPUss
+#### And without changing a single line of code, you could run on GPUs/TPUs
 ```python
 # 8 GPUs
 trainer = Trainer(max_epochs=1, gpus=8)

diff --git a/dockers/base-conda/Dockerfile b/dockers/base-conda/Dockerfile
@@ -74,7 +74,7 @@ ENV CONDA_ENV=lightning
 COPY environment.yml environment.yml
 
 # conda init
-RUN conda create -y --name $CONDA_ENV && \
+RUN conda create -y --name $CONDA_ENV cudatoolkit=${CUDA_VERSION} && \
     conda init bash && \
     # NOTE: this requires that the channel is presented in the yaml before packages
     # replace channel to nigtly if needed, fix PT version and remove Horovod as it will be installe later

diff --git a/docs/source/metrics.rst b/docs/source/metrics.rst
@@ -150,6 +150,19 @@ Example implementation:
         def compute(self):
             return self.correct.float() / self.total
 
+Metrics support backpropagation, if all computations involved in the metric calculation
+are differentiable. However, note that the cached state is detached from the computational
+graph and cannot be backpropagated. Not doing this would mean storing the computational
+graph for each update call, which can lead to out-of-memory errors.
+In practise this means that:
+
+.. code-block:: python
+
+    metric = MyMetric()
+    val = metric(pred, target) # this value can be backpropagated
+    val = metric.compute() # this value cannot be backpropagated
+
+
 **********
 Metric API
 **********
@@ -453,4 +466,3 @@ embedding_similarity [func]
 
 .. autofunction:: pytorch_lightning.metrics.functional.self_supervised.embedding_similarity
     :noindex:
-
diff --git a/docs/source/optimizers.rst b/docs/source/optimizers.rst
@@ -48,6 +48,10 @@ to manually manage the optimization process. To do so, do the following:
         opt_d.step()
         opt_d.zero_grad()
 
+        # log losses
+        self.log('loss_a', loss_a)
+        self.log('loss_b', loss_b)
+
 .. note:: This is only recommended for experts who need ultimate flexibility
 
 Manual optimization does not yet support accumulated gradients but will be live in 1.1.0
@@ -108,7 +112,7 @@ Every optimizer you use can be paired with any `LearningRateScheduler <https://p
    def configure_optimizers(self):
       return {
           'optimizer': Adam(...),
-          'scheduler': ReduceLROnPlateau(optimizer, ...),
+          'lr_scheduler': ReduceLROnPlateau(optimizer, ...),
           'monitor': 'metric_to_track'
       }
 

diff --git a/docs/source/tpu.rst b/docs/source/tpu.rst
@@ -128,13 +128,27 @@ That's it! Your model will train on all 8 TPU cores.
 
 ----------------
 
-Single TPU core training
+TPU core training
+
 ------------------------
-Lightning supports training on a single TPU core. Just pass the TPU core ID [1-8] in a list.
+
+Lightning supports training on a single TPU core or 8 TPU cores.
+
+The Trainer parameters ``tpu_cores`` defines how many TPU cores to train on (1 or 8) / Single TPU to train on [1].
+
+For Single TPU training, Just pass the TPU core ID [1-8] in a list.
+
+Single TPU core training. Model will train on TPU core ID 5.
 
 .. code-block:: python
 
-    trainer = pl.Trainer(tpu_cores=[1])
+    trainer = pl.Trainer(tpu_cores=[5])
+
+8 TPU cores training. Model will train on 8 TPU cores.
+
+.. code-block:: python
+
+    trainer = pl.Trainer(tpu_cores=8)
 
 ----------------
 

diff --git a/docs/source/weights_loading.rst b/docs/source/weights_loading.rst
@@ -65,8 +65,8 @@ You can customize the checkpointing behavior to monitor any quantity of your tra
     # 3. Init ModelCheckpoint callback, monitoring 'val_loss'
     checkpoint_callback = ModelCheckpoint(monitor='val_loss')
 
-    # 4. Pass your callback to checkpoint_callback trainer flag
-    trainer = Trainer(checkpoint_callback=checkpoint_callback)
+    # 4. Add your callback to the callbacks list
+    trainer = Trainer(callbacks=[checkpoint_callback])
 
 You can also control more advanced options, like `save_top_k`, to save the best k models and the mode of the monitored quantity (min/max/auto, where the mode is automatically inferred from the name of the monitored quantity), `save_weights_only` or `period` to set the interval of epochs between checkpoints, to avoid slowdowns.
 
@@ -89,14 +89,14 @@ You can also control more advanced options, like `save_top_k`, to save the best
         save_top_k=3,
         mode='min')
 
-    trainer = Trainer(checkpoint_callback=checkpoint_callback)
+    trainer = Trainer(callbacks=[checkpoint_callback])
     
 You can retrieve the checkpoint after training by calling
 
 .. code-block:: python
 
         checkpoint_callback = ModelCheckpoint(dirpath='my/path/')
-        trainer = Trainer(checkpoint_callback=checkpoint_callback)
+        trainer = Trainer(callbacks=[checkpoint_callback])
         trainer.fit(model)
         checkpoint_callback.best_model_path
 

diff --git a/environment.yml b/environment.yml
@@ -26,7 +26,7 @@ dependencies:
     - python>=3.6
     - pip>20.1
     - numpy>=1.16.4
-    - pytorch>=1.3
+    - pytorch>=1.3,<1.8
     - future>=0.17.1
     - PyYAML>=5.1
     - tqdm>=4.41.0
@@ -41,7 +41,7 @@ dependencies:
     - torchtext>=0.3.1
 
     # Examples
-    - torchvision>=0.4.1
+    - torchvision>=0.4.1,<0.9.0
 
     - pip:
         - test-tube>=0.7.5

diff --git a/notebooks/05-trainer-flags-overview.ipynb b/notebooks/05-trainer-flags-overview.ipynb
@@ -2223,7 +2223,7 @@
       "source": [
         "from pytorch_lightning.callbacks import ModelCheckpoint\n",
         "\n",
-        "trainer = pl.Trainer(checkpoint_callback=ModelCheckpoint(monitor='val_loss'))\n",
+        "trainer = pl.Trainer(callbacks=[ModelCheckpoint(monitor='val_loss')])\n",
         "\n",
         "trainer.fit(model, train_loader, val_loader)"
       ],
@@ -2265,7 +2265,7 @@
         "    prefix='',\n",
         ")\n",
         "\n",
-        "trainer = Trainer(checkpoint_callback=checkpoint_callback)\n",
+        "trainer = Trainer(callbacks=[checkpoint_callback])\n",
         "\n",
         "trainer.fit(model, train_loader, val_loader)"
       ],
@@ -2471,7 +2471,7 @@
         "# **NOTE: this saves weights to some/path NOT my/path\n",
         "checkpoint = ModelCheckpoint(filepath='some/path')\n",
         "trainer = pl.Trainer(\n",
-        "    checkpoint_callback=checkpoint,\n",
+        "    callbacks=[checkpoint],\n",
         "    weights_save_path='my/path'\n",
         ")\n",
         "trainer.fit(model, train_loader, val_loader)"

diff --git a/pytorch_lightning/accelerators/accelerator.py b/pytorch_lightning/accelerators/accelerator.py
@@ -132,11 +132,6 @@ def optimizer_zero_grad(self, batch_idx, optimizer, opt_idx):
         model_ref.optimizer_zero_grad(self.trainer.current_epoch, batch_idx, optimizer, opt_idx)
 
     def clip_gradients(self, optimizer, clip_val=None):
-
-        if self.trainer.amp_backend == AMPType.NATIVE:
-            self.trainer.scaler.unscale_(optimizer)
-
-        # apply clip gradients
         # TODO: separate TPU case from here
         self._clip_gradients(optimizer, clip_val)