Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: remove PyTorch 2.5.0 checks #1877

Merged
merged 7 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/source/tutorials/memory_optimizations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,9 @@ tensors will be offloaded.

*Sounds great! How do I use it?*

To enable activation offloading, use ``enable_activation_offloading=True``. If you are on torch
version later than PyTorch 2.5.0, it will allow the usage of multiple CUDA streams automatically.
To enable activation offloading, use the ``enable_activation_offloading`` config entry or flag
in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True``. To allow
usage of streams, make sure you are on a torch version equal to or later than PyTorch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
usage of streams, make sure you are on a torch version equal to or later than PyTorch.
usage of streams, make sure you are on a torch version equal to or later than PyTorch 2.5.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumping this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a merge here? The docs in main read:

To enable activation offloading, use enable_activation_offloading=True. If you are on torch version later than PyTorch 2.5.0, it will allow the usage of multiple CUDA streams automatically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can just revert the changes here to leave the file as-as since it's been updated in another PR @JP-sDEV


.. _glossary_grad_accm:

Expand Down
6 changes: 3 additions & 3 deletions recipes/lora_finetune_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,9 @@ class LoRAFinetuneRecipeDistributed(FTRecipeInterface):
back during the backward pass. As always, there is a tradeoff--these savings in memory can
come at the cost of training performance and CPU resources. To recover some runtime cost,
we've added an option to enable offloading on a different stream to permit overlapping with
the computation. This option is currently only available on PyTorch 2.5 or later and will
be enabled by default if an acceptable torch version is found. Activation offloading can be
used in conjunction with activation checkpointing.
the computation. This option is currently only available on PyTorch 2.5.0 or later and will be
enabled by default if an acceptable torch version is found. Activation offloading can be used in
conjunction with activation checkpointing.

- Precision. Full fp32 and bf16 training are supported. Precision is controlled using the ``dtype``
flag. When ``dtype=bf16``, all activations, gradients and optimizer states are in bfloat16. In
Expand Down
4 changes: 2 additions & 2 deletions tests/torchtune/modules/test_attention_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def test_packed_block_causal_mask_sdpa(self, seq_lens):

@pytest.mark.skipif(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looks like we need to keep this check, in case the hardware that runs the gpu tests on GitHub CI does not support flex attention

not _SUPPORTS_FLEX_ATTENTION,
reason="Please install a nightly build of torch (>=2.5.0) to run this test.",
reason="Hardware does not support Flex Attention.",
)
@gpu_test(gpu_count=1)
def test_packed_block_causal_mask_flex(self):
Expand All @@ -100,7 +100,7 @@ def test_packed_block_causal_mask_flex(self):
class TestSDPAOrFlexAttention:
@pytest.mark.skipif(
not _SUPPORTS_FLEX_ATTENTION,
reason="Please install a nightly build of torch (>=2.5.0) to run this test.",
reason="Hardware does not support Flex Attention.",
)
@mock.patch("torchtune.modules.attention_utils.compile_friendly_flex_attention")
@mock.patch(
Expand Down
6 changes: 3 additions & 3 deletions torchtune/modules/attention_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,9 @@ def packed_block_causal_mask(
seq_lens: List[torch.Tensor],
SalmanMohammadi marked this conversation as resolved.
Show resolved Hide resolved
) -> _MaskType:
"""
Create a block causal document mask for a batch of packed sequences. If on
torch version >= 2.5.0, this is done by creating a mask_mod function with the
block causal logic and passing this into :func:`torch.nn.attention.flex_attention.create_block_mask`.
Create a block causal document mask for a batch of packed sequences. If
flex attention is supported by the current hardware, block causal logic and
passing this into :func:`torch.nn.attention.flex_attention.create_block_mask`.
The resultant BlockMask is a compressed representation of the full block causal
mask. If on an older version, a standard 2D block causal mask is created and returned.

Expand Down
6 changes: 1 addition & 5 deletions torchtune/modules/common_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,11 +149,7 @@ def _register_reparametrize_state_dict_hooks(
RuntimeError: If the low RAM reparametrize hook is used on Windows or an incompatible torch version.
"""
if _use_low_cpu_ram:
if torch.__version__ < "2.5.0.dev20240906":
raise RuntimeError(
"Low RAM reparametrize_as_dtype_state_dict_post_hook requires PyTorch 2.5.0.dev20240906 or later."
)
elif sys.platform == "win32":
if sys.platform == "win32":
# mmap.MAP_SHARED is not supported on Windows but this change targets colab.
raise RuntimeError(
"Low RAM reparametrize_as_dtype_state_dict_post_hook is not supported on Windows."
Expand Down
19 changes: 6 additions & 13 deletions torchtune/training/_activation_offloading.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# LICENSE file in the root directory of this source tree.

import contextlib
from typing import Optional, Union
from typing import Union
from warnings import warn

import psutil
Expand Down Expand Up @@ -38,9 +38,9 @@ class OffloadActivations(saved_tensors_hooks):
memory on the CPU. Pinned memory allows the Tensor to be moved back onto GPU more quickly
but is a limited resource. Default: True.

use_streams (Optional[bool]): Whether or not to use streams for performance optimization where
use_streams (bool): Whether or not to use streams for performance optimization where
the communications get overlapped with the computation. Requires a torch build
after torch-2.5.0.dev20240907. Default: True if a later torch build is found, else False.
after torch-2.5.0.]. Default: True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
after torch-2.5.0.]. Default: True.
after torch-2.5.0. Default: True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumping this


max_fwd_stash_size (int): The maximum size of the forward stash, or the maximum number of
consecutive activations to keep alive during the forward pass. This number must be at
Expand All @@ -67,15 +67,12 @@ class OffloadActivations(saved_tensors_hooks):
def __init__(
self,
use_pin_memory: bool = True,
use_streams: Optional[bool] = None,
use_streams: bool = True,
max_fwd_stash_size: int = 5,
min_offload_size: int = 1024,
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the check below and make use_streams: bool = True

Copy link
Contributor Author

@JP-sDEV JP-sDEV Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to this use_streams check found in __init__?

        if use_streams is None:
            # Default to True if an acceptable torch is installed (later nightly/version or from source)
            self.use_streams = torch.__version__ >= "2.5.0.dev20240907"
        else:
            self.use_streams = use_streams

or should it be changed like this, where use_steams is set to False?

if use_streams is False:
            # Default to True if an acceptable torch is installed (later nightly/version or from source)
            self.use_streams = torch.__version__ >= "2.5.0.dev20240907"
        else:
            self.use_streams = use_streams

Copy link
Contributor

@RdoubleA RdoubleA Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! it would just be:

self.use_streams = use_streams

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed that OffloadingActivations in torchtune/training/_activation_offloading.py still checks for torch version: 2.5.0.dev20240907. Should this check also be removed?

        # for streaming
        if self.use_streams:
            if torch.__version__ < "2.5.0.dev20240907":
                raise RuntimeError(
                    "OffloadActivations with use_streams=True requires PyTorch 2.5.0.dev20240907 or later."
                )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Let's remove that as well. I believe these may have been added after I put the issue up, or maybe I just missed it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will update all 2.5.0.dev20240907 checks in the file

if use_streams is None:
# Default to True if an acceptable torch is installed (later nightly/version or from source)
self.use_streams = torch.__version__ >= "2.5.0.dev20240907"
else:
self.use_streams = use_streams

self.use_streams: bool = use_streams

self.min_tensor_size_bytes = (
min_offload_size # we don't want to bother with small tensors
Expand All @@ -98,10 +95,6 @@ def __init__(

# for streaming
if self.use_streams:
if torch.__version__ < "2.5.0.dev20240907":
raise RuntimeError(
"OffloadActivations with use_streams=True requires PyTorch 2.5.0.dev20240907 or later."
)
self.s1 = torch.cuda.Stream() # comms stream
self.fwd_stash = {} # tensor_id => (activation, ev1)
if max_fwd_stash_size < 1:
Expand Down
27 changes: 9 additions & 18 deletions torchtune/training/_compile.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
)
from torchtune.modules.loss import CEWithChunkedOutputLoss
from torchtune.modules.model_fusion import DeepFusionModel
from torchtune.utils import get_logger, torch_version_ge
from torchtune.utils import get_logger

log = get_logger("INFO")

Expand All @@ -42,23 +42,14 @@ def compile_model(
backend = os.environ.get("TORCH_COMPILE_BACKEND", "inductor")
if isinstance(model, DeepFusionModel):
model = model.decoder
if torch_version_ge("2.5.0"):
if verbose:
log.info("Compiling model layers with torch.compile...")
for m in reversed(list(model.modules())):
if isinstance(m, TransformerSelfAttentionLayer) or isinstance(
m, TransformerCrossAttentionLayer
):
m.compile(backend=backend)
else:
if verbose:
log.info(
"""
Compiling full model with torch.compile...
For faster compile times via per-layer compile, please run on PyTorch nightlies.
"""
)
model.compile(backend=backend)
# Per-layer compilation by default
if verbose:
log.info("Compiling model layers with torch.compile...")
for m in reversed(list(model.modules())):
if isinstance(m, TransformerSelfAttentionLayer) or isinstance(
m, TransformerCrossAttentionLayer
):
m.compile(backend=backend)


def compile_loss(loss: nn.Module, verbose: bool = True) -> nn.Module:
Expand Down
5 changes: 1 addition & 4 deletions torchtune/utils/_import_guard.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,8 @@
# LICENSE file in the root directory of this source tree.

import torch
from torchtune.utils._version import torch_version_ge

# We can only use flex attention / BlockMask if torch version >= 2.5.0 and GPU is Turing / SM75 and above
_SUPPORTS_FLEX_ATTENTION = (
torch_version_ge("2.5.0")
and torch.cuda.is_available()
and torch.cuda.get_device_capability() >= (7, 5)
torch.cuda.is_available() and torch.cuda.get_device_capability() >= (7, 5)
)
Loading