Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ColossalAI strategy #14224

Merged
merged 66 commits into from
Oct 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
d78eb95
add colossalai strategy
ver217 Aug 16, 2022
5742d32
polish colossalai plugin's code (#2)
1SAA Aug 17, 2022
1ad8b63
[colossalai] add package available flag and testing conditions (#3)
1SAA Aug 19, 2022
8c27c94
[colossalai] add destroy in teardown
1SAA Aug 19, 2022
735e0dc
[test] add tests and fix bugs (#5)
ver217 Sep 6, 2022
13b4e90
format code and remove useless import
ver217 Sep 6, 2022
4a0e9a8
[tests] add basic tests (#7)
1SAA Sep 6, 2022
8c73106
reformat codes (#8)
ver217 Sep 6, 2022
d320c73
add compatibility check for gradient accumulation (#9)
1SAA Sep 7, 2022
5a0dd7d
[hotfix] fix the CI to run tests (#10)
1SAA Sep 8, 2022
6abecbd
[update] make the interfaces for felxible (#13)
1SAA Sep 9, 2022
91a0e99
fix import error
1SAA Sep 13, 2022
c22f87a
remove amp_level from colossalai precision
1SAA Sep 13, 2022
2a2391c
fix the import error of ColoInitContext (#29)
1SAA Sep 13, 2022
2936576
[update] polished code and updated state_dict function (#30)
1SAA Sep 14, 2022
c42f783
[hotfix] set device before optimizer ceration, add docstrings for mor…
1SAA Sep 14, 2022
d3403c3
add strategy_name, add assert in train, valid acc (#32)
1SAA Sep 15, 2022
453599e
add load_model_state_dict and its unit test (#33)
1SAA Sep 15, 2022
bab3a22
remove redundant on_load_checkpoint functions (#34)
1SAA Sep 15, 2022
ff835bf
update (#35)
1SAA Sep 19, 2022
372dc37
add reduce and all gather methods and fix ci installation
rohitgr7 Sep 29, 2022
6da5978
fix colossalai strategy
rohitgr7 Sep 30, 2022
11db725
Merge branch 'master' into feature/colossalai
rohitgr7 Sep 30, 2022
754b732
fix communication api errors (#41)
1SAA Sep 30, 2022
27aaa96
run gpu ci
rohitgr7 Sep 30, 2022
f4e4e69
run gpu ci
rohitgr7 Sep 30, 2022
402c3b0
run gpu ci
rohitgr7 Sep 30, 2022
f471f58
fix reduce op
rohitgr7 Sep 30, 2022
7b36eeb
fix precision plugin
rohitgr7 Sep 30, 2022
f34dccd
Merge branch 'master' into feature/colossalai
rohitgr7 Sep 30, 2022
e3dce2d
fix reduce function when reduce_op is 'avg' or 'mean' (#42)
1SAA Oct 1, 2022
b4c1205
update dockerfiles
rohitgr7 Oct 3, 2022
e675065
update requirements
rohitgr7 Oct 3, 2022
da59c60
basic docs
rohitgr7 Oct 3, 2022
671e36d
improvements
rohitgr7 Oct 3, 2022
9ead8ff
remove redundant reassignment
rohitgr7 Oct 3, 2022
18efd12
32 precision
rohitgr7 Oct 3, 2022
597368e
revert fp32 pr (#45)
1SAA Oct 3, 2022
739c4b7
add license
rohitgr7 Oct 4, 2022
56c21c8
improvements
rohitgr7 Oct 4, 2022
3016fe4
remove device reassignment
rohitgr7 Oct 4, 2022
ab344d5
mypy
rohitgr7 Oct 4, 2022
2d0b9ac
reviews
rohitgr7 Oct 5, 2022
b6e22ab
Merge branch 'master' into feature/colossalai
rohitgr7 Oct 5, 2022
b41d21e
try fix docker image
rohitgr7 Oct 6, 2022
6a8864e
Merge branch 'master' into feature/colossalai
rohitgr7 Oct 6, 2022
f98b816
fix
rohitgr7 Oct 6, 2022
d9046a7
fix
rohitgr7 Oct 6, 2022
879b17b
dockerfiles
Oct 7, 2022
56c34da
format
Oct 7, 2022
3c989f3
push docker images to hub. REVERT LATER
Oct 7, 2022
28cb217
Revert "push docker images to hub. REVERT LATER"
Oct 7, 2022
96c0955
Merge branch 'master' into feature/colossalai
otaj Oct 8, 2022
d5a9d14
Merge branch 'master' into feature/colossalai
otaj Oct 10, 2022
dc364f8
Merge branch 'master' into feature/colossalai
otaj Oct 10, 2022
9d4d3b5
guard importing colossalai so that it doesn't poision cuda forks
Oct 10, 2022
561ba0b
local imports
Oct 10, 2022
754cfd7
set type checking flag
Oct 10, 2022
10e8118
Merge branch 'master' into feature/colossalai
otaj Oct 10, 2022
5fef75b
Revert "set type checking flag"
Oct 10, 2022
481c372
resolve doc issue
Oct 10, 2022
1de5c47
Apply suggestions from code review
otaj Oct 10, 2022
5d7c9c1
resolve doc issue
Oct 10, 2022
f4d381f
less standalone
Oct 10, 2022
7d4805b
more suggestions
Oct 10, 2022
8a9afc7
one less standalone test
Oct 10, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .azure/gpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ jobs:
set -e
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'bagua' not in line] ; open(fname, 'w').writelines(lines)"
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)"
otaj marked this conversation as resolved.
Show resolved Hide resolved

PYTORCH_VERSION=$(python -c "import torch; print(torch.__version__.split('+')[0])")
python ./requirements/pytorch/adjust-versions.py requirements/pytorch/base.txt ${PYTORCH_VERSION}
Expand All @@ -110,6 +111,11 @@ jobs:
CUDA_VERSION_BAGUA=$(python -c "print([ver for ver in [116,113,111,102] if $CUDA_VERSION_MM >= ver][0])")
pip install "bagua-cuda$CUDA_VERSION_BAGUA"

PYTORCH_VERSION_COLOSSALAI=$(python -c "import torch; print(torch.__version__.split('+')[0][:4])")
CUDA_VERSION_MM_COLOSSALAI=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda)))")
CUDA_VERSION_COLOSSALAI=$(python -c "print([ver for ver in [11.3, 11.1] if $CUDA_VERSION_MM_COLOSSALAI >= ver][0])")
pip install "colossalai==0.1.10+torch${PYTORCH_VERSION_COLOSSALAI}cu${CUDA_VERSION_COLOSSALAI}" --find-links https://release.colossalai.org

pip list
env:
PACKAGE_NAME: pytorch
Expand Down
36 changes: 30 additions & 6 deletions dockers/base-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,10 @@ RUN \
libopenmpi-dev \
openmpi-bin \
ssh \
ninja-build \
libnccl2=$TO_INSTALL_NCCL \
libnccl-dev=$TO_INSTALL_NCCL && \
# Install python
# Install python
add-apt-repository ppa:deadsnakes/ppa && \
apt-get install -y \
python${PYTHON_VERSION} \
Expand All @@ -65,7 +66,7 @@ RUN \
&& \
update-alternatives --install /usr/bin/python${PYTHON_VERSION%%.*} python${PYTHON_VERSION%%.*} /usr/bin/python${PYTHON_VERSION} 1 && \
update-alternatives --install /usr/bin/python python /usr/bin/python${PYTHON_VERSION} 1 && \
# Cleaning
# Cleaning
apt-get autoremove -y && \
apt-get clean && \
rm -rf /root/.cache && \
Expand All @@ -82,14 +83,15 @@ RUN \
rm get-pip.py && \
pip install -q fire && \
# Disable cache \
CUDA_VERSION_MM=$(python -c "print(''.join('$CUDA_VERSION'.split('.')[:2]))") && \
export CUDA_VERSION_MM=$(python -c "print(''.join('$CUDA_VERSION'.split('.')[:2]))") && \
pip config set global.cache-dir false && \
# set particular PyTorch version
python ./requirements/pytorch/adjust-versions.py requirements/pytorch/base.txt ${PYTORCH_VERSION} && \
python ./requirements/pytorch/adjust-versions.py requirements/pytorch/extra.txt ${PYTORCH_VERSION} && \
python ./requirements/pytorch/adjust-versions.py requirements/pytorch/examples.txt ${PYTORCH_VERSION} && \
# Install all requirements \
pip install -r requirements/pytorch/devel.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \

# Install base requirements \
pip install -r requirements/pytorch/base.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \
rm assistant.py

ENV \
Expand All @@ -108,7 +110,7 @@ RUN \
export HOROVOD_BUILD_CUDA_CC_LIST=${HOROVOD_BUILD_CUDA_CC_LIST//"."/""} && \
echo $HOROVOD_BUILD_CUDA_CC_LIST && \
cmake --version && \
pip install --no-cache-dir -r ./requirements/pytorch/strategies.txt && \
pip install --no-cache-dir horovod && \
horovodrun --check-build

RUN \
Expand Down Expand Up @@ -136,6 +138,28 @@ RUN \
if [[ "$CUDA_VERSION_MM" = "$CUDA_VERSION_BAGUA" ]]; then python -c "import bagua_core; bagua_core.install_deps()"; fi && \
python -c "import bagua; print(bagua.__version__)"

RUN \
# install ColossalAI
SHOULD_INSTALL_COLOSSAL=$(python -c "import torch; print(1 if int(torch.__version__.split('.')[1]) > 9 else 0)") && \
if [[ "$SHOULD_INSTALL_COLOSSAL" = "1" ]]; then \
PYTORCH_VERSION_COLOSSALAI=$(python -c "import torch; print(torch.__version__.split('+')[0][:4])") ; \
CUDA_VERSION_MM_COLOSSALAI=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda)))") ; \
CUDA_VERSION_COLOSSALAI=$(python -c "print([ver for ver in [11.3, 11.1] if $CUDA_VERSION_MM_COLOSSALAI >= ver][0])") ; \
pip install "colossalai==0.1.10+torch${PYTORCH_VERSION_COLOSSALAI}cu${CUDA_VERSION_COLOSSALAI}" --find-links https://release.colossalai.org ; \
python -c "import colossalai; print(colossalai.__version__)" ; \
fi

RUN \
# install rest of strategies
# remove colossalai from requirements since they are installed separately
SHOULD_INSTALL_COLOSSAL=$(python -c "import torch; print(1 if int(torch.__version__.split('.')[1]) > 9 else 0)") && \
if [[ "$SHOULD_INSTALL_COLOSSAL" = "0" ]]; then \
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
fi && \
echo "$SHOULD_INSTALL_COLOSSAL" && \
cat requirements/pytorch/strategies.txt && \
pip install -r requirements/pytorch/devel.txt -r requirements/pytorch/strategies.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html

COPY requirements/pytorch/check-avail-extras.py check-avail-extras.py
COPY requirements/pytorch/check-avail-strategies.py check-avail-strategies.py

Expand Down
6 changes: 5 additions & 1 deletion dockers/release/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,11 @@ RUN \
fi && \
# otherwise there is collision with folder name ans pkg name on Pypi
cd lightning && \
pip install .["extra","loggers","strategies"] --no-cache-dir && \
SHOULD_INSTALL_COLOSSAL=$(python -c "import torch; print(1 if int(torch.__version__.split('.')[1]) > 9 else 0)") && \
if [[ "$SHOULD_INSTALL_COLOSSAL" = "0" ]]; then \
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
fi && \
pip install .["extra","loggers","strategies"] --no-cache-dir --find-links https://release.colossalai.org && \
cd .. && \
rm -rf lightning

Expand Down
4 changes: 3 additions & 1 deletion docs/source-pytorch/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ precision
:template: classtemplate.rst

ApexMixedPrecisionPlugin
ColossalAIPrecisionPlugin
DeepSpeedPrecisionPlugin
DoublePrecisionPlugin
FullyShardedNativeMixedPrecisionPlugin
Expand Down Expand Up @@ -285,7 +286,7 @@ strategies
:template: classtemplate.rst

BaguaStrategy
HivemindStrategy
ColossalAIStrategy
DDPFullyShardedNativeStrategy
DDPFullyShardedStrategy
DDPShardedStrategy
Expand All @@ -294,6 +295,7 @@ strategies
DDPStrategy
DataParallelStrategy
DeepSpeedStrategy
HivemindStrategy
HorovodStrategy
HPUParallelStrategy
IPUStrategy
Expand Down
1 change: 1 addition & 0 deletions docs/source-pytorch/extensions/plugins.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ The full list of built-in precision plugins is listed below.
:template: classtemplate.rst

ApexMixedPrecisionPlugin
ColossalAIPrecisionPlugin
DeepSpeedPrecisionPlugin
DoublePrecisionPlugin
FullyShardedNativeMixedPrecisionPlugin
Expand Down
3 changes: 3 additions & 0 deletions docs/source-pytorch/extensions/strategy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ The below table lists all relevant strategies available in Lightning with their
* - collaborative
- :class:`~pytorch_lightning.strategies.HivemindStrategy`
- Strategy for training collaboratively on local machines or unreliable GPUs across the internet. :ref:`Learn more. <strategies/hivemind:Training on unreliable mixed GPUs across the internet>`
* - colossalai
- :class:`~pytorch_lightning.strategies.ColossalAIStrategy`
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.or/>`__
* - fsdp_native
- :class:`~pytorch_lightning.strategies.DDPFullyShardedNativeStrategy`
- Strategy for Fully Sharded Data Parallel provided by PyTorch. :ref:`Learn more. <advanced/model_parallel:PyTorch Fully Sharded Training>`
Expand Down
1 change: 1 addition & 0 deletions requirements/pytorch/strategies.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# NOTE: the upper bound for the package version is only set for CI stability, and it is dropped while installing this package
# in case you want to preserve/enforce restrictions on the latest compatible version, add "strict" as an in-line comment

colossalai>=0.1.10
otaj marked this conversation as resolved.
Show resolved Hide resolved
fairscale>=0.4.5, <=0.4.6
deepspeed>=0.6.0, <=0.7.0
# no need to install with [pytorch] as pytorch is already installed
Expand Down
2 changes: 2 additions & 0 deletions src/pytorch_lightning/plugins/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from pytorch_lightning.plugins.io.hpu_plugin import HPUCheckpointIO
from pytorch_lightning.plugins.layer_sync import LayerSync, NativeSyncBatchNorm
from pytorch_lightning.plugins.precision.apex_amp import ApexMixedPrecisionPlugin
from pytorch_lightning.plugins.precision.colossalai import ColossalAIPrecisionPlugin
from pytorch_lightning.plugins.precision.deepspeed import DeepSpeedPrecisionPlugin
from pytorch_lightning.plugins.precision.double import DoublePrecisionPlugin
from pytorch_lightning.plugins.precision.fsdp_native_native_amp import FullyShardedNativeNativeMixedPrecisionPlugin
Expand All @@ -27,6 +28,7 @@
"XLACheckpointIO",
"HPUCheckpointIO",
"ApexMixedPrecisionPlugin",
"ColossalAIPrecisionPlugin",
"DeepSpeedPrecisionPlugin",
"DoublePrecisionPlugin",
"IPUPrecisionPlugin",
Expand Down
2 changes: 2 additions & 0 deletions src/pytorch_lightning/plugins/precision/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from pytorch_lightning.plugins.precision.apex_amp import ApexMixedPrecisionPlugin
from pytorch_lightning.plugins.precision.colossalai import ColossalAIPrecisionPlugin
from pytorch_lightning.plugins.precision.deepspeed import DeepSpeedPrecisionPlugin
from pytorch_lightning.plugins.precision.double import DoublePrecisionPlugin
from pytorch_lightning.plugins.precision.fsdp_native_native_amp import FullyShardedNativeNativeMixedPrecisionPlugin
Expand All @@ -26,6 +27,7 @@

__all__ = [
"ApexMixedPrecisionPlugin",
"ColossalAIPrecisionPlugin",
"DeepSpeedPrecisionPlugin",
"DoublePrecisionPlugin",
"FullyShardedNativeNativeMixedPrecisionPlugin",
Expand Down
90 changes: 90 additions & 0 deletions src/pytorch_lightning/plugins/precision/colossalai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Any, Callable, Optional, Union

from lightning_utilities.core.rank_zero import WarningCache
from torch import Tensor
from torch.optim import Optimizer

import pytorch_lightning as pl
from lightning_lite.utilities.types import Steppable
from pytorch_lightning.plugins.precision.precision_plugin import PrecisionPlugin
from pytorch_lightning.utilities.enums import PrecisionType

warning_cache = WarningCache()


class ColossalAIPrecisionPlugin(PrecisionPlugin):
carmocca marked this conversation as resolved.
Show resolved Hide resolved
"""Precision plugin for ColossalAI integration.

Args:
precision: Half precision (16).

Raises:
ValueError:
If precison is not 16.
"""

def __init__(self, precision: Union[str, int] = 16) -> None:
if not (precision == PrecisionType.HALF):
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError(
f"`Trainer(strategy='colossalai', precision={precision!r})` is not supported."
" Consider setting `precision=16`."
)
super().__init__()
self.precision = precision

def backward( # type: ignore[override]
self,
tensor: Tensor,
model: "pl.LightningModule",
optimizer: Optional[Steppable],
optimizer_idx: Optional[int],
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved
*args: Any,
**kwargs: Any,
) -> None:
assert optimizer is not None
optimizer.backward(tensor)
carmocca marked this conversation as resolved.
Show resolved Hide resolved

def clip_grad_by_norm(self, optimizer: Optimizer, clip_val: Union[int, float]) -> None:
optimizer.clip_grad_norm(None, clip_val)

def clip_grad_by_value(self, optimizer: Optimizer, clip_val: Union[int, float]) -> None:
raise NotImplementedError("`clip_grad_by_value` is not supported by `ColossalAI`")

def optimizer_step( # type: ignore[override]
self,
optimizer: Steppable,
model: "pl.LightningModule",
optimizer_idx: int,
closure: Callable[[], Any],
**kwargs: Any,
) -> Any:
closure_result = closure()
self._after_closure(model, optimizer, optimizer_idx)
skipped_backward = closure_result is None
if isinstance(model, pl.LightningModule) and model.automatic_optimization and skipped_backward:
raise ValueError(
"Skipping backward by returning `None` from your `training_step` is not supported by `ColossalAI`."
)
optimizer.step()

def _track_grad_norm(self, trainer: "pl.Trainer") -> None:
if trainer.track_grad_norm == -1:
return
# the gradients are not available in the model due to gradient partitioning in zero stage >= 2
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved
warning_cache.warn(
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved
f"You set `Trainer(track_grad_norm={trainer.track_grad_norm!r})' but this is not supported for ColossalAI."
" The setting will be ignored."
)
1 change: 1 addition & 0 deletions src/pytorch_lightning/strategies/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# limitations under the License.
from lightning_lite.strategies.registry import _StrategyRegistry
from pytorch_lightning.strategies.bagua import BaguaStrategy # noqa: F401
from pytorch_lightning.strategies.colossalai import ColossalAIStrategy # noqa: F401
from pytorch_lightning.strategies.ddp import DDPStrategy # noqa: F401
from pytorch_lightning.strategies.ddp_spawn import DDPSpawnStrategy # noqa: F401
from pytorch_lightning.strategies.deepspeed import DeepSpeedStrategy # noqa: F401
Expand Down
13 changes: 13 additions & 0 deletions src/pytorch_lightning/strategies/bagua.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
from typing import Any, Dict, List, Optional, Union
Expand Down
Loading