+
+
+.. Add callout items below this line
+
+.. displayitem::
+ :header: Prepare your code (Optional)
+ :description: Prepare your code to run on any hardware
+ :col_css: col-md-4
+ :button_link: accelerator_prepare.html
+ :height: 150
+ :tag: basic
+
+.. displayitem::
+ :header: Basic
+ :description: Learn the basics of single and multi-GPU training.
+ :col_css: col-md-4
+ :button_link: gpu_basic.html
+ :height: 150
+ :tag: basic
+
+.. displayitem::
+ :header: Intermediate
+ :description: Learn about different distributed strategies, torchelastic and how to optimize communication layers.
+ :col_css: col-md-4
+ :button_link: gpu_intermediate.html
+ :height: 150
+ :tag: intermediate
+
+.. displayitem::
+ :header: Advanced
+ :description: Train 1 trillion+ parameter models with these techniques.
+ :col_css: col-md-4
+ :button_link: gpu_advanced.html
+ :height: 150
+ :tag: advanced
+
+.. displayitem::
+ :header: Expert
+ :description: Develop new strategies for training and deploying larger and larger models.
+ :col_css: col-md-4
+ :button_link: gpu_expert.html
+ :height: 150
+ :tag: expert
+
+.. displayitem::
+ :header: FAQ
+ :description: Frequently asked questions about GPU training.
+ :col_css: col-md-4
+ :button_link: gpu_faq.html
+ :height: 150
-.. code-block:: python
-
- # setup.py
- #!/usr/bin/env python
-
- from setuptools import setup, find_packages
-
- setup(
- name="src",
- version="0.0.1",
- description="Describe Your Cool Project",
- author="",
- author_email="",
- url="https://github.com/YourSeed", # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
- install_requires=["pytorch-lightning"],
- packages=find_packages(),
- )
-
-2. Setup your project like so:
-
-.. code-block:: bash
-
- /project
- /src
- some_file.py
- /or_a_folder
- setup.py
-
-3. Install as a root-level package
-
-.. code-block:: bash
-
- cd /project
- pip install -e .
-
-You can then call your scripts anywhere
-
-.. code-block:: bash
-
- cd /project/src
- python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'
-
-
-Horovod
-^^^^^^^
-`Horovod `_ allows the same training script to be used for single-GPU,
-multi-GPU, and multi-node training.
-
-Like Distributed Data Parallel, every process in Horovod operates on a single GPU with a fixed
-subset of the data. Gradients are averaged across all GPUs in parallel during the backward pass,
-then synchronously applied before beginning the next step.
-
-The number of worker processes is configured by a driver application (`horovodrun` or `mpirun`). In
-the training script, Horovod will detect the number of workers from the environment, and automatically
-scale the learning rate to compensate for the increased total batch size.
-
-Horovod can be configured in the training script to run with any number of GPUs / processes as follows:
-
-.. code-block:: python
-
- # train Horovod on GPU (number of GPUs / machines provided on command-line)
- trainer = Trainer(strategy="horovod", accelerator="gpu", devices=1)
-
- # train Horovod on CPU (number of processes / machines provided on command-line)
- trainer = Trainer(strategy="horovod")
-
-When starting the training job, the driver application will then be used to specify the total
-number of worker processes:
-
-.. code-block:: bash
-
- # run training with 4 GPUs on a single machine
- horovodrun -np 4 python train.py
-
- # run training with 8 GPUs on two machines (4 GPUs each)
- horovodrun -np 8 -H hostname1:4,hostname2:4 python train.py
-
-See the official `Horovod documentation `_ for details
-on installation and performance tuning.
-
-
-Bagua
-^^^^^
-`Bagua `_ is a deep learning training acceleration framework which supports
-multiple advanced distributed training algorithms including:
-
-- `Gradient AllReduce `_ for centralized synchronous communication, where gradients are averaged among all workers.
-- `Decentralized SGD `_ for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
-- `ByteGrad `_ and `QAdam `_ for low precision communication, where data is compressed into low precision before communication.
-- `Asynchronous Model Average `_ for asynchronous communication, where workers are not required to be synchronized in the same iteration in a lock-step style.
-
-By default, Bagua uses *Gradient AllReduce* algorithm, which is also the algorithm implemented in Distributed Data Parallel and Horovod,
-but Bagua can usually produce a higher training throughput due to its backend written in Rust.
-
-.. code-block:: python
-
- # train on 4 GPUs (using Bagua mode)
- trainer = Trainer(strategy="bagua", accelerator="gpu", devices=4)
-
-
-By specifying the ``algorithm`` in the ``BaguaStrategy``, you can select more advanced training algorithms featured by Bagua:
-
-
-.. code-block:: python
-
- # train on 4 GPUs, using Bagua Gradient AllReduce algorithm
- trainer = Trainer(
- strategy=BaguaStrategy(algorithm="gradient_allreduce"),
- accelerator="gpu",
- devices=4,
- )
-
- # train on 4 GPUs, using Bagua ByteGrad algorithm
- trainer = Trainer(
- strategy=BaguaStrategy(algorithm="bytegrad"),
- accelerator="gpu",
- devices=4,
- )
-
- # train on 4 GPUs, using Bagua Decentralized SGD
- trainer = Trainer(
- strategy=BaguaStrategy(algorithm="decentralized"),
- accelerator="gpu",
- devices=4,
- )
-
- # train on 4 GPUs, using Bagua Low Precision Decentralized SGD
- trainer = Trainer(
- strategy=BaguaStrategy(algorithm="low_precision_decentralized"),
- accelerator="gpu",
- devices=4,
- )
-
- # train on 4 GPUs, using Asynchronous Model Average algorithm, with a synchronization interval of 100ms
- trainer = Trainer(
- strategy=BaguaStrategy(algorithm="async", sync_interval_ms=100),
- accelerator="gpu",
- devices=4,
- )
-
-To use *QAdam*, we need to initialize
-`QAdamOptimizer `_ first:
-
-.. code-block:: python
-
- from pytorch_lightning.strategies import BaguaStrategy
- from bagua.torch_api.algorithms.q_adam import QAdamOptimizer
-
-
- class MyModel(pl.LightningModule):
- ...
-
- def configure_optimizers(self):
- # initialize QAdam Optimizer
- return QAdamOptimizer(self.parameters(), lr=0.05, warmup_steps=100)
-
-
- model = MyModel()
- trainer = Trainer(
- accelerator="gpu",
- devices=4,
- strategy=BaguaStrategy(algorithm="qadam"),
- )
- trainer.fit(model)
-
-Bagua relies on its own `launcher `_ to schedule jobs.
-Below, find examples using ``bagua.distributed.launch`` which follows ``torch.distributed.launch`` API:
-
-.. code-block:: bash
-
- # start training with 8 GPUs on a single node
- python -m bagua.distributed.launch --nproc_per_node=8 train.py
-
-If the ssh service is available with passwordless login on each node, you can launch the distributed job on a
-single node with ``baguarun`` which has a similar syntax as ``mpirun``. When staring the job, ``baguarun`` will
-automatically spawn new processes on each of your training node provided by ``--host_list`` option and each node in it
-is described as an ip address followed by a ssh port.
-
-.. code-block:: bash
-
- # Run on node1 (or node2) to start training on two nodes (node1 and node2), 8 GPUs per node
- baguarun --host_list hostname1:ssh_port1,hostname2:ssh_port2 --nproc_per_node=8 --master_port=port1 train.py
-
-
-.. note:: You can also start training in the same way as Distributed Data Parallel. However, system optimizations like
- `Bagua-Net `_ and
- `Performance autotuning `_ can only be enabled through bagua
- launcher. It is worth noting that with ``Bagua-Net``, Distributed Data Parallel can also achieve
- better performance without modifying the training script.
-
-
-See `Bagua Tutorials `_ for more details on installation and advanced features.
-
-
-DP/DDP2 caveats
-^^^^^^^^^^^^^^^
-In DP and DDP2 each GPU within a machine sees a portion of a batch.
-DP and ddp2 roughly do the following:
-
-.. testcode::
-
- def distributed_forward(batch, model):
- batch = torch.Tensor(32, 8)
- gpu_0_batch = batch[:8]
- gpu_1_batch = batch[8:16]
- gpu_2_batch = batch[16:24]
- gpu_3_batch = batch[24:]
-
- y_0 = model_copy_gpu_0(gpu_0_batch)
- y_1 = model_copy_gpu_1(gpu_1_batch)
- y_2 = model_copy_gpu_2(gpu_2_batch)
- y_3 = model_copy_gpu_3(gpu_3_batch)
-
- return [y_0, y_1, y_2, y_3]
-
-So, when Lightning calls any of the `training_step`, `validation_step`, `test_step`
-you will only be operating on one of those pieces.
-
-.. testcode::
-
- # the batch here is a portion of the FULL batch
- def training_step(self, batch, batch_idx):
- y_0 = batch
-
-For most metrics, this doesn't really matter. However, if you want
-to add something to your computational graph (like softmax)
-using all batch parts you can use the `training_step_end` step.
-
-.. testcode::
-
- def training_step_end(self, outputs):
- # only use when on dp
- outputs = torch.cat(outputs, dim=1)
- softmax = softmax(outputs, dim=1)
- out = softmax.mean()
- return out
-
-In pseudocode, the full sequence is:
-
-.. code-block:: python
-
- # get data
- batch = next(dataloader)
-
- # copy model and data to each gpu
- batch_splits = split_batch(batch, num_gpus)
- models = copy_model_to_gpus(model)
-
- # in parallel, operate on each batch chunk
- all_results = []
- for gpu_num in gpus:
- batch_split = batch_splits[gpu_num]
- gpu_model = models[gpu_num]
- out = gpu_model(batch_split)
- all_results.append(out)
-
- # use the full batch for something like softmax
- full_out = model.training_step_end(all_results)
-
-To illustrate why this is needed, let's look at DataParallel
-
-.. testcode::
-
- def training_step(self, batch, batch_idx):
- x, y = batch
- y_hat = self(batch)
-
- # on dp or ddp2 if we did softmax now it would be wrong
- # because batch is actually a piece of the full batch
- return y_hat
-
-
- def training_step_end(self, step_output):
- # step_output has outputs of each part of the batch
-
- # do softmax here
- outputs = torch.cat(outputs, dim=1)
- softmax = softmax(outputs, dim=1)
- out = softmax.mean()
-
- return out
-
-If `training_step_end` is defined it will be called regardless of TPU, DP, DDP, etc... which means
-it will behave the same regardless of the backend.
-
-Validation and test step have the same option when using DP.
-
-.. testcode::
-
- def validation_step_end(self, step_output):
- ...
-
-
- def test_step_end(self, step_output):
- ...
-
-
-Distributed and 16-bit precision
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Due to an issue with Apex and DataParallel (PyTorch and NVIDIA issue), Lightning does
-not allow 16-bit and DP training. We tried to get this to work, but it's an issue on their end.
-
-Below are the possible configurations we support.
-
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| 1 GPU | 1+ GPUs | DP | DDP | 16-bit | command |
-+=======+=========+=====+=====+========+=======================================================================+
-| Y | | | | | `Trainer(accelerator="gpu", devices=1)` |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| Y | | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| | Y | Y | | | `Trainer(accelerator="gpu", devices=k, strategy='dp')` |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| | Y | | Y | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| | Y | | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-
-
-Implement Your Own Distributed (DDP) training
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lightning.strategies.ddp.DDPStrategy.init_dist_connection`.
-
-If you also need to use your own DDP implementation, override :meth:`pytorch_lightning.strategies.ddp.DDPStrategy.configure_ddp`.
-
-
-Batch size
-----------
-When using distributed training make sure to modify your learning rate according to your effective
-batch size.
-
-Let's say you have a batch size of 7 in your dataloader.
-
-.. testcode::
-
- class LitModel(LightningModule):
- def train_dataloader(self):
- return Dataset(..., batch_size=7)
-
-In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * devices * num_nodes.
-
-.. code-block:: python
-
- # effective batch size = 7 * 8
- Trainer(accelerator="gpu", devices=8, strategy="ddp")
- Trainer(accelerator="gpu", devices=8, strategy="ddp_spawn")
- Trainer(accelerator="gpu", devices=8, strategy="ddp_sharded")
- Trainer(accelerator="gpu", devices=8, strategy="horovod")
-
- # effective batch size = 7 * 8 * 10
- Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp")
- Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp_spawn")
- Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp_sharded")
- Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="horovod")
-
-In DDP2 or DP, your effective batch size will be 7 * num_nodes.
-The reason is that the full batch is visible to all GPUs on the node when using DDP2.
-
-.. code-block:: python
-
- # effective batch size = 7
- Trainer(accelerator="gpu", devices=8, strategy="ddp2")
- Trainer(accelerator="gpu", devices=8, strategy="dp")
-
- # effective batch size = 7 * 10
- Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp2")
- Trainer(accelerator="gpu", devices=8, strategy="dp")
-
-
-.. note:: Huge batch sizes are actually really bad for convergence. Check out:
- `Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour `_
-
-----------
-
-Torch Distributed Elastic
--------------------------
-Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' or 'ddp2' backend and the number of GPUs you want to use in the trainer.
-
-.. code-block:: python
-
- Trainer(accelerator="gpu", devices=8, strategy="ddp")
-
-To launch a fault-tolerant job, run the following on all nodes.
-
-.. code-block:: bash
-
- python -m torch.distributed.run
- --nnodes=NUM_NODES
- --nproc_per_node=TRAINERS_PER_NODE
- --rdzv_id=JOB_ID
- --rdzv_backend=c10d
- --rdzv_endpoint=HOST_NODE_ADDR
- YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
-
-To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes.
-
-.. code-block:: bash
-
- python -m torch.distributed.run
- --nnodes=MIN_SIZE:MAX_SIZE
- --nproc_per_node=TRAINERS_PER_NODE
- --rdzv_id=JOB_ID
- --rdzv_backend=c10d
- --rdzv_endpoint=HOST_NODE_ADDR
- YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
-
-See the official `Torch Distributed Elastic documentation `_ for details
-on installation and more use cases.
-
-----------
-
-Jupyter Notebooks
------------------
-Unfortunately any `ddp_` is not supported in jupyter notebooks. Please use `dp` for multiple GPUs. This is a known
-Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!
-
-----------
-
-Pickle Errors
---------------
-Multi-GPU training sometimes requires your model to be pickled. If you run into an issue with pickling
-try the following to figure out the issue
-
-.. code-block:: python
-
- import pickle
-
- model = YourModel()
- pickle.dumps(model)
+.. raw:: html
-However, if you use `ddp` the pickling requirement is not there and you should be fine. If you use `ddp_spawn` the
-pickling requirement remains. This is a limitation of Python.
+
+
diff --git a/docs/source/accelerators/gpu_advanced.rst b/docs/source/accelerators/gpu_advanced.rst
new file mode 100644
index 00000000000000..eadeb03edd7ce9
--- /dev/null
+++ b/docs/source/accelerators/gpu_advanced.rst
@@ -0,0 +1,16 @@
+:orphan:
+
+.. _gpu_advanced:
+
+GPU training (Advanced)
+=======================
+**Audience:** Users looking to scale massive models (ie: 1 Trillion parameters).
+
+----
+
+For experts pushing the state-of-the-art in model development, Lightning offers various techniques to enable Trillion+ parameter-scale models.
+
+----
+
+..
+ .. include:: ../advanced/model_parallel.rst
diff --git a/docs/source/accelerators/gpu_basic.rst b/docs/source/accelerators/gpu_basic.rst
new file mode 100644
index 00000000000000..43be718180aa97
--- /dev/null
+++ b/docs/source/accelerators/gpu_basic.rst
@@ -0,0 +1,97 @@
+:orphan:
+
+.. _gpu_basic:
+
+GPU training (Basic)
+====================
+**Audience:** Users looking to save money and run large models faster using single or multiple
+
+----
+
+What is a GPU?
+--------------
+A Graphics Processing Unit (GPU), is a specialized hardware accelerator designed to speed up mathematical computations used in gaming and deep learning.
+
+----
+
+Train on 1 GPU
+--------------
+
+Make sure you're running on a machine with at least one GPU. There's no need to specify any NVIDIA flags
+as Lightning will do it for you.
+
+.. testcode::
+ :skipif: torch.cuda.device_count() < 1
+
+ trainer = Trainer(accelerator="gpu", devices=1)
+
+----------------
+
+
+.. _multi_gpu:
+
+Train on multiple GPUs
+----------------------
+
+To use multiple GPUs, set the number of devices in the Trainer or the index of the GPUs.
+
+.. code::
+
+ trainer = Trainer(accelerator="gpu", devices=4)
+
+Choosing GPU devices
+^^^^^^^^^^^^^^^^^^^^
+
+You can select the GPU devices using ranges, a list of indices or a string containing
+a comma separated list of GPU ids:
+
+.. testsetup::
+
+ k = 1
+
+.. testcode::
+ :skipif: torch.cuda.device_count() < 2
+
+ # DEFAULT (int) specifies how many GPUs to use per node
+ Trainer(accelerator="gpu", devices=k)
+
+ # Above is equivalent to
+ Trainer(accelerator="gpu", devices=list(range(k)))
+
+ # Specify which GPUs to use (don't use when running on cluster)
+ Trainer(accelerator="gpu", devices=[0, 1])
+
+ # Equivalent using a string
+ Trainer(accelerator="gpu", devices="0, 1")
+
+ # To use all available GPUs put -1 or '-1'
+ # equivalent to list(range(torch.cuda.device_count()))
+ Trainer(accelerator="gpu", devices=-1)
+
+The table below lists examples of possible input formats and how they are interpreted by Lightning.
+
++------------------+-----------+---------------------+---------------------------------+
+| `devices` | Type | Parsed | Meaning |
++==================+===========+=====================+=================================+
+| 3 | int | [0, 1, 2] | first 3 GPUs |
++------------------+-----------+---------------------+---------------------------------+
+| -1 | int | [0, 1, 2, ...] | all available GPUs |
++------------------+-----------+---------------------+---------------------------------+
+| [0] | list | [0] | GPU 0 |
++------------------+-----------+---------------------+---------------------------------+
+| [1, 3] | list | [1, 3] | GPUs 1 and 3 |
++------------------+-----------+---------------------+---------------------------------+
+| "3" | str | [0, 1, 2] | first 3 GPUs |
++------------------+-----------+---------------------+---------------------------------+
+| "1, 3" | str | [1, 3] | GPUs 1 and 3 |
++------------------+-----------+---------------------+---------------------------------+
+| "-1" | str | [0, 1, 2, ...] | all available GPUs |
++------------------+-----------+---------------------+---------------------------------+
+
+.. note::
+
+ When specifying number of ``devices`` as an integer ``devices=k``, setting the trainer flag
+ ``auto_select_gpus=True`` will automatically help you find ``k`` GPUs that are not
+ occupied by other processes. This is especially useful when GPUs are configured
+ to be in "exclusive mode", such that only one process at a time can access them.
+ For more details see the :doc:`trainer guide <../common/trainer>`.
diff --git a/docs/source/accelerators/gpu_expert.rst b/docs/source/accelerators/gpu_expert.rst
new file mode 100644
index 00000000000000..947850b13f65fe
--- /dev/null
+++ b/docs/source/accelerators/gpu_expert.rst
@@ -0,0 +1,23 @@
+:orphan:
+
+.. _gpu_expert:
+
+GPU training (Expert)
+=====================
+**Audience:** Experts creating new scaling techniques such as Deepspeed or FSDP
+
+----
+
+Lightning enables experts focused on researching new ways of optimizing distributed training/inference strategies to create new strategies and plug them into Lightning.
+
+For example, Lightning worked closely with the Microsoft team to develop a Deepspeed integration and with the Facebook(Meta) team to develop a FSDP integration.
+
+
+----
+
+.. include:: ../extensions/strategy.rst
+
+
+----
+
+.. include:: ../advanced/strategy_registry.rst
diff --git a/docs/source/accelerators/gpu_faq.rst b/docs/source/accelerators/gpu_faq.rst
new file mode 100644
index 00000000000000..c697b2ca7b3549
--- /dev/null
+++ b/docs/source/accelerators/gpu_faq.rst
@@ -0,0 +1,97 @@
+:orphan:
+
+.. _gpu_faq:
+
+GPU training (FAQ)
+==================
+
+******************************************************************
+How should I adjust the learning rate when using multiple devices?
+******************************************************************
+
+When using distributed training make sure to modify your learning rate according to your effective
+batch size.
+
+Let's say you have a batch size of 7 in your dataloader.
+
+.. testcode::
+
+ class LitModel(LightningModule):
+ def train_dataloader(self):
+ return Dataset(..., batch_size=7)
+
+In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * devices * num_nodes.
+
+.. code-block:: python
+
+ # effective batch size = 7 * 8
+ Trainer(accelerator="gpu", devices=8, strategy="ddp")
+ Trainer(accelerator="gpu", devices=8, strategy="ddp_spawn")
+ Trainer(accelerator="gpu", devices=8, strategy="ddp_sharded")
+ Trainer(accelerator="gpu", devices=8, strategy="horovod")
+
+ # effective batch size = 7 * 8 * 10
+ Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp")
+ Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp_spawn")
+ Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp_sharded")
+ Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="horovod")
+
+In DDP2 or DP, your effective batch size will be 7 * num_nodes.
+The reason is that the full batch is visible to all GPUs on the node when using DDP2.
+
+.. code-block:: python
+
+ # effective batch size = 7
+ Trainer(accelerator="gpu", devices=8, strategy="ddp2")
+ Trainer(accelerator="gpu", devices=8, strategy="dp")
+
+ # effective batch size = 7 * 10
+ Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy="ddp2")
+ Trainer(accelerator="gpu", devices=8, strategy="dp")
+
+
+.. note:: Huge batch sizes are actually really bad for convergence. Check out:
+ `Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour