Integrate MS-AMP Support for FP8 Precision #2224

muellerzr · 2023-12-06T16:08:23Z

Integrate MS-AMP to the `Accelerator`

What does this add?

This PR introduces an additional backend for FP8 support through MS-AMP which has shown to decrease memory and increase throughput when using FP8 precision

Who is it for?

Individuals training with FP8 (H100/4090's, etc)

Issues linked to

Azure/MS-AMP#128

What parts of the API does this impact?

User-facing:

Two new arguments were added to the FP8RecipeKwargs:

enable_ms_amp (bool): Whether a user should use MS-AMP. True by default if it's available in the environment
optimization_level (str), should be one of "O1" or "O2". "O3" is for DeepSpeed and we need to wait for them to update to v0.9.3 of deepspeed to match what Accelerate supports

General guideline to optimization levels:

O1: Weight gradients and all_reduce communications are done in fp8, reducing GPU
memory usage and communication bandwidth
O2: First-order optimizer states are in 8-bit, and second order states are in FP16.
Only available when using Adam or AdamW. This maintains accuracy and can potentially save the highest
memory.
03: Specifically for DeepSpeed, implements capabilities so weights and master weights of models
are stored in FP8. If fp8 is selected and deepspeed is enabled, will be used by default.
(Not available currently).

As a result, "O2" is the default. Here is an overview of each optimization level and what it does, taken from their docs:

Optimization Level	Computation(GEMM)	Comm	Weight	Master Weight	Weight Gradient	Optimizer States
FP16 AMP	FP16	FP32	FP32	N/A	FP32	FP32+FP32
Nvidia TE	FP8	FP32	FP32	N/A	FP32	FP32+FP32
MS-AMP O1	FP8	FP8	FP16	N/A	FP8	FP32+FP32
MS-AMP O2	FP8	FP8	FP16	N/A	FP8	FP8+FP16
MS-AMP O3	FP8	FP8	FP8	FP16	FP8	FP8+FP16

Internal structure:

With how fp8 optimization works with MS-AMP, we get the best bang for our buck if we combine both MS-AMP and transformers engine. As a result when preparing the model and optimizer we run through the same fix that exists for TPU optimizers so that we can replace the new te.Linear layers with the equivalent ms-amp ones that increase throughput without decreasing performance.

Basic Usage Example(s):

A user can either do:

accelerator = Accelerator(mixed_precision="fp8")

Or use the FP8RecipeKwargs:

# To disable MS-AMP if available in the environment
kwarg_handlers = [FP8RecipeKwargs(enable_ms_amp=False)]

# To change the optimization level
kwarg_handlers = [FP8RecipeKwargs(optimization_level="O1")]

accelerator = Accelerator(
    mixed_precision="fp8",
    kwargs_handlers=kwarg_handlers,
)

Benchmarks

When running on bloomz-560m I saw the following speedups on the first 100 batches:

Batch size: 8
Max seq length: 256
GPU used: single 4090

Model Configuration	Time per Batch	Peak Memory
FP16	0.183s	20.62 GB
BF16	0.139s	15.41 GB
Raw TransformersEngine	0.129s	12.02 GB
TE + MS-AMP	0.108s	10.66 GB

I also verified the loss curves for the experiment were identical

HuggingFaceDocBuilderDev · 2023-12-06T16:12:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr · 2023-12-06T16:16:21Z

src/accelerate/utils/dataclasses.py

    """

    margin: int = 0
    interval: int = 1
    fp8_format: str = "E4M3"
-    amax_history_len: int = 1
+    amax_history_len: int = 1024


I noticed this default was different than the one NVIDIA has, so set it to 1024. Asking what is critical about it

casper-hansen · 2023-12-07T09:23:06Z

03: Specifically for DeepSpeed, implements capabilities so weights and master weights of models
are stored in FP8. If fp8 is selected and deepspeed is enabled, will be used by default.
(Not available currently).

Can we not use DeepSpeed with O3 FP8? Storing gradients in FP8 is already a great step, but storing weights in FP8 would be an even better step, especially when we also enable zero3 to split the model.

bloomz-560m

Small model which already sees improvements is really nice! Do you by chance have the time+compute to do a Llama/Mistral 7B run to check the difference there?

muellerzr · 2023-12-07T10:37:30Z

@casper-hansen as mentioned in the PR, we cannot until they update the deepspeed version they require

pacman100

Hello Zach, thank you for working on adding MS AMP FP8 support 🔥🚀✨! The experiments you performed with bloomz-560M are already showing nice improvements. Looking forward to experiments on larger scales.

Left a couple suggestions/comment.

pacman100 · 2023-12-07T10:17:21Z

src/accelerate/utils/dataclasses.py


    def __post_init__(self):
        self.fp8_format = self.fp8_format.upper()
        if self.fp8_format not in ["E4M3", "HYBRID"]:
            raise ValueError("`fp8_format` must be 'E4M3' or 'HYBRID'.")
        if self.amax_compute_algo not in ["max", "most_recent"]:
            raise ValueError("`amax_compute_algo` must be 'max' or 'most_recent'")
+        if self.enable_ms_amp and not is_msamp_available():
+            self.enable_ms_amp = False


raise warning mentioning that msamp is not available but the flag is True

pacman100 · 2023-12-07T10:34:26Z

src/accelerate/accelerator.py

@@ -1286,13 +1292,16 @@ def prepare(self, *args, device_placement=None):
            result = tuple(
                self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
            )
+            if self.mixed_precision == "fp8" and self.fp8_recipe_handler.enable_ms_amp:
+                # MS-AMP needs both model and optimizer
+                args = self._prepare_ms_amp(*result)


Suggested change

args = self._prepare_ms_amp(*result)

result = self._prepare_ms_amp(*result)

given that the second pass takes the result

pacman100 · 2023-12-07T10:40:59Z

Also, what happens if we run this with latest DeepSpeed?

SunMarc

Thanks for adding ms_amp support ! This looks very good overall. I left a couple of comments. One thing that can be added is some docs about mixed-precision training in the How-to Guide and Concept guide. Correct me if i'm wrong but I wasn't able to find much doc about that. This can be done in a follow-up PR. LMK what you think. The table you linked in this PR is a very good summary.

SunMarc · 2023-12-07T15:11:18Z

src/accelerate/accelerator.py

+        model: torch.nn.Module,
+        device_placement: bool = None,
+        evaluation_mode: bool = False,
+        first_pass: bool = True,


Add it in the docstring. Moreover, maybe change the default value to False to have the same behavior as _prepare_one

SunMarc · 2023-12-07T15:13:26Z

src/accelerate/accelerator.py

            elif isinstance(obj, torch.optim.Optimizer):
                optimizer = self.prepare_optimizer(obj, device_placement=device_placement)
                return optimizer
        # Second pass of preparation: LR scheduler (which need the full list of optimizers)
        elif isinstance(obj, LRScheduler):
            scheduler = self.prepare_scheduler(obj)
            return scheduler
+        # Second pass of preparation: FP8 with MS-AMP
+        elif isinstance(obj, torch.nn.Module):
+            return self.prepare_model(obj, device_placement=device_placement)


should first_pass be set to False since the default value is True ?

muellerzr · 2023-12-07T18:58:05Z

Closing as there's a very odd performance bug with ending accuracy with bert using TransformerEngine. Revisiting this implementation in a seperate PR in the guise as an alternative framework instead of directly wrapping with TE

muellerzr added 15 commits November 13, 2023 12:20

Core parts

05c245e

Initial support

58f8adc

Use MSAMP by default

d523b38

Add to utils

b7bfb1c

No self

586abca

WOrking version

68e76df

Should work with deepspeed

f879995

Bookmark

7d87151

WOrking implementation!

692a697

Clean up codeparrot script

4e5f1f1

Clean up lingering

27d7ce1

Revert changes for nlp example

8032cfe

Space

de1f479

Comment

58e0040

Clean

8ec5750

muellerzr requested review from BenjaminBossan, pacman100 and SunMarc December 6, 2023 16:08

muellerzr linked an issue Dec 6, 2023 that may be closed by this pull request

Feature Request: Support MS-AMP #2143

Closed

Or not and

ac81e53

muellerzr commented Dec 6, 2023

View reviewed changes

pacman100 reviewed Dec 7, 2023

View reviewed changes

SunMarc reviewed Dec 7, 2023

View reviewed changes

muellerzr closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate MS-AMP Support for FP8 Precision #2224

Integrate MS-AMP Support for FP8 Precision #2224

muellerzr commented Dec 6, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 6, 2023

muellerzr Dec 6, 2023

casper-hansen commented Dec 7, 2023

muellerzr commented Dec 7, 2023

pacman100 left a comment •

edited

Loading

pacman100 Dec 7, 2023

pacman100 Dec 7, 2023

pacman100 Dec 7, 2023

pacman100 commented Dec 7, 2023

SunMarc left a comment •

edited

Loading

SunMarc Dec 7, 2023

SunMarc Dec 7, 2023

muellerzr commented Dec 7, 2023

	args = self._prepare_ms_amp(*result)
	result = self._prepare_ms_amp(*result)

Integrate MS-AMP Support for FP8 Precision #2224

Integrate MS-AMP Support for FP8 Precision #2224

Conversation

muellerzr commented Dec 6, 2023 • edited Loading

Integrate MS-AMP to the Accelerator

What does this add?

Who is it for?

Issues linked to

What parts of the API does this impact?

User-facing:

Internal structure:

Basic Usage Example(s):

Benchmarks

HuggingFaceDocBuilderDev commented Dec 6, 2023

muellerzr Dec 6, 2023

Choose a reason for hiding this comment

casper-hansen commented Dec 7, 2023

muellerzr commented Dec 7, 2023

pacman100 left a comment • edited Loading

Choose a reason for hiding this comment

pacman100 Dec 7, 2023

Choose a reason for hiding this comment

pacman100 Dec 7, 2023

Choose a reason for hiding this comment

pacman100 Dec 7, 2023

Choose a reason for hiding this comment

pacman100 commented Dec 7, 2023

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

SunMarc Dec 7, 2023

Choose a reason for hiding this comment

SunMarc Dec 7, 2023

Choose a reason for hiding this comment

muellerzr commented Dec 7, 2023

muellerzr commented Dec 6, 2023 •

edited

Loading

Integrate MS-AMP to the `Accelerator`

pacman100 left a comment •

edited

Loading

SunMarc left a comment •

edited

Loading