[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step #3002

dumpmemory · 2023-03-13T10:42:45Z

Describe the bug
when i using Peft LoRA to train a gpt2 model, the gpu memory increase with every forward step with Zero3 adn zero.init function. when i disable zero.init, it worked as normal.

To Reproduce

the all details can be found in GPT2 Training GPU Memory Increase with LoRA and Zero 3 huggingface/peft#161

Expected behavior
run with no gpu memory increasing
ds_report output
Please run ds_report to give us details about your setup.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1
deepspeed install path ........... ['/opt/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.2, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

GPU count and types 8x3080 ti

The text was updated successfully, but these errors were encountered:

dumpmemory · 2023-03-13T10:44:24Z

I have also try the tohtana/nested_zero_init branch, which did not fix it.

tohtana · 2023-04-22T07:33:07Z

@dumpmemory
I found that Zero3's all-gathered parameters are not freed for LoRA Linear modules.
The following fix prevented the memory leak in my environment. Can you try this?

$ git diff
diff --git a/src/peft/tuners/lora.py b/src/peft/tuners/lora.py
index 1d1680d..97f0a4e 100644
--- a/src/peft/tuners/lora.py
+++ b/src/peft/tuners/lora.py
@@ -484,7 +484,7 @@ class Linear(nn.Linear, LoraLayer):
                 self.unmerge()
             result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
         elif self.r[self.active_adapter] > 0 and not self.merged:
-            result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
+            result = torch.matmul(x, transpose(self.weight, not self.fan_in_fan_out)) + self.bias

             x = x.to(self.lora_A[self.active_adapter].weight.dtype)

Although Zero3 sets an empty tensor to self.weight.data, PyTorch does not free the memory in the original code.
The reference to the buffer for all-gathered parameters might be alive, but I couldn't write a simple repro using only PyTorch.

dumpmemory · 2023-04-22T15:58:19Z

@dumpmemory I found that Zero3's all-gathered parameters are not freed for LoRA Linear modules. The following fix prevented the memory leak in my environment. Can you try this?
$ git diff
diff --git a/src/peft/tuners/lora.py b/src/peft/tuners/lora.py
index 1d1680d..97f0a4e 100644
--- a/src/peft/tuners/lora.py
+++ b/src/peft/tuners/lora.py
@@ -484,7 +484,7 @@ class Linear(nn.Linear, LoraLayer):
                 self.unmerge()
             result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
         elif self.r[self.active_adapter] > 0 and not self.merged:
-            result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
+            result = torch.matmul(x, transpose(self.weight, not self.fan_in_fan_out)) + self.bias

             x = x.to(self.lora_A[self.active_adapter].weight.dtype)
Although Zero3 sets an empty tensor to self.weight.data, PyTorch does not free the memory in the original code. The reference to the buffer for all-gathered parameters might be alive, but I couldn't write a simple repro using only PyTorch.

I will test this, thanks for your help. I will update result later

dumpmemory · 2023-04-24T06:40:20Z

It wokred ! With peft commit 10a2a6db5dc9cabb63a36c0fb489aeb2b9a1e433 and modification above , deepspeed 0.9.1 and torch 2.0. Thanks for your help.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/opt/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

dumpmemory · 2023-04-24T06:42:50Z

I will try to make a pr following your idea on peft . Thanks again.

tohtana · 2023-04-30T03:41:39Z

@dumpmemory This memory leak can also be fixed by setting memory_efficient_linear to false in the configuration of zero.

By default, DeepSpeed replaces PyTorch's linear with a different implementation. This might cause the memory leak. I will investigate what the memory_efficient_linear does.

dumpmemory · 2023-04-30T14:37:05Z

@dumpmemory This memory leak can also be fixed by setting memory_efficient_linear to false in the configuration of zero.

By default, DeepSpeed replaces PyTorch's linear with a different implementation. This might cause the memory leak. I will investigate what the memory_efficient_linear does.

Thanks for your work !

tjruwase · 2023-05-01T18:36:08Z

@dumpmemory, can you please try PR #3413 created by @tohtana? Thanks!

dumpmemory · 2023-05-02T07:45:06Z

@dumpmemory, can you please try PR #3413 created by @tohtana? Thanks!

Yes i can. Can i test it after my holiday ? Thanks

tjruwase · 2023-05-02T11:38:00Z

@dumpmemory, of course! By the way, the PR is merged so you can use the master branch when you are ready.

Happy holidays to you! Thanks for your help.

dumpmemory · 2023-05-05T03:46:27Z

It worked with peft(commit 10a2a6db5dc9cabb63a36c0fb489aeb2b9a1e433 ) and peft 3.0

dumpmemory added bug Something isn't working training labels Mar 13, 2023

tjruwase assigned tohtana Mar 24, 2023

dumpmemory mentioned this issue Apr 24, 2023

GPT2 Training GPU Memory Increase with LoRA and Zero 3 huggingface/peft#161

Closed

dumpmemory mentioned this issue Apr 24, 2023

try to fix Zero3 Memory Leak following @tohtana idea huggingface/peft#363

Closed

tohtana mentioned this issue Apr 30, 2023

Save tensors in context of memory_efficient_linear #3413

Merged

dumpmemory closed this as completed May 9, 2023

Andy666G mentioned this issue May 31, 2023

[BUG] Cannot free parameter with ZeRO3 + offload parameter in Pytorch1.9 #3646

Closed

kouroshHakha mentioned this issue Oct 26, 2023

[Templates] Finetuneing template 04 OOMS due to gpu memory leak when using LoRA + V100s ray-project/ray#40714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step #3002

[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step #3002

dumpmemory commented Mar 13, 2023 •

edited

Loading

dumpmemory commented Mar 13, 2023

tohtana commented Apr 22, 2023

dumpmemory commented Apr 22, 2023

dumpmemory commented Apr 24, 2023

dumpmemory commented Apr 24, 2023

tohtana commented Apr 30, 2023

dumpmemory commented Apr 30, 2023

tjruwase commented May 1, 2023

dumpmemory commented May 2, 2023

tjruwase commented May 2, 2023

dumpmemory commented May 5, 2023

[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step #3002

[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step #3002

Comments

dumpmemory commented Mar 13, 2023 • edited Loading

dumpmemory commented Mar 13, 2023

tohtana commented Apr 22, 2023

dumpmemory commented Apr 22, 2023

dumpmemory commented Apr 24, 2023

dumpmemory commented Apr 24, 2023

tohtana commented Apr 30, 2023

dumpmemory commented Apr 30, 2023

tjruwase commented May 1, 2023

dumpmemory commented May 2, 2023

tjruwase commented May 2, 2023

dumpmemory commented May 5, 2023

dumpmemory commented Mar 13, 2023 •

edited

Loading