Adds Vera (Vector Based Random Matrix Adaption) #2 #1564

BenjaminBossan · 2024-03-14T17:12:12Z

Continuation of #1039.

Should now be 95% on par with that PR, with some minor changes on my part + resolving merge conflicts.

Examples and docs have not been included yet.

~~TODOS:~~

Add documentation
Add examples

https://arxiv.org/abs/2310.11454

Notable changes vis-à-vis 1039:

Some refactors around how the initialization of VeraModel proceeds, should be more straightforward now.
Add tests around saving and loading, which needs some special considerations for VeRA.
Fixed some issues with multiple adapters, requires more strictness (e.g. not allowing multiple different prng keys on the same model).
projection_prng_key now has a valid default value (0) in the config.
Removed support for Embedding to reduce complexity: Supporting Embedding layers with VeRA makes very little sense because its shape is always different from the linear layers' shapes. Therefore, they cannot share the vera_A and vera_B matrices, resulting in an error. The only conceivable way to support Embedding layers would be to only target that layer (and possibly the output layer if it shares the weight), but that more or less defeats the purpose of using VeRA. We may revisit support for Embeddings in the future, maybe if we can enable vera_A and vera_B to be of different shapes. Until then, let's support the most common use cases and simplify our lives.

Should now be 95% on par with huggingface#1039, with some minor changes on my part + resolving merge conflicts. Examples have not been included yet.

HuggingFaceDocBuilderDev · 2024-03-14T17:15:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* changes to support fsdp+qlora and dsz3+qlora * address comments * add example and start docs * quality * deepspeed fixes * dsz3+qlora docs * section link fix * add fsdp+qlora docs * Apply suggestions from code review Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: Younes Belkada <[email protected]> * address comments --------- Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: Younes Belkada <[email protected]>

Needed to update hf-doc-builder

Supporting Embedding layers with VeRA makes very little sense because its shape is always different from the linear layers' shapes. Therefore, they cannot share the vera_A and vera_B matrices, resulting in an error. The only conceivable way to support Embedding layers would be to only target that layer (and possibly the output layer if it shares the weight), but that more or less defeats the purpose of using VeRA. We may revisit support for Embeddings in the future, maybe if we can enable vera_A and vera_B to be of different shapes. Until then, let's support the most common use cases and simplify our lives.

It was annoying that the default value was invalid and would raise an error.

Same as for LoRA and IA3, these Deberta tests fail for some reason.

BenjaminBossan · 2024-03-21T11:42:44Z

To ensure that the vera_A and vera_B weights are shared (but not other tensors), I added some tests that check their corresponding data_ptr()s.

Moreover, I wrote a small script to check the amount of memory taken by the model. For this, I used a very high rank of 10000, so that vera_A and vera_B should be quite large. Then I compared the GPU memory taken for a model with a single layer having a VeRA adapter vs a model with many layers having a VeRA adapter. We should expect that both should take roughly the same memory, since most parameters are shared.

Here is the script:

from transformers import AutoModelForCausalLM
from peft import get_peft_model, VeraConfig, LoraConfig
from peft.tuners.vera import VeraLayer
from peft.tuners.lora import LoraLayer
import gc
import torch

RANK = 10000
model_id = "facebook/opt-125m"

config_cls = VeraConfig
layer_cls = VeraLayer

def get_gpu_memory():
    torch.cuda.synchronize()  # Wait for all kernels to finish
    gpu_info = {
        'allocated': f"{torch.cuda.memory_allocated(0) / 2**30:.4f}GB",
        'reserved': f"{torch.cuda.memory_reserved(0) / 2**30:.4f}GB",
    }
    print(gpu_info)

print("before loading the base model")
get_gpu_memory()

model = AutoModelForCausalLM.from_pretrained(model_id).cuda()
print("after loading the model")
get_gpu_memory()

config = config_cls(task_type="CAUSAL_LM", target_modules=["model.decoder.layers.0.self_attn.k_proj"], r=RANK)
model = get_peft_model(model, config)
num_vera_layers = len([m for m in model.modules() if isinstance(m, layer_cls)])
print(f"after adding {num_vera_layers} adapted layers with rank {RANK}")
get_gpu_memory()

del model
torch.cuda.empty_cache()
gc.collect()

print("after resetting")
get_gpu_memory()

model = AutoModelForCausalLM.from_pretrained(model_id).cuda()
print("after loading the base model")
get_gpu_memory()

config = config_cls(task_type="CAUSAL_LM", target_modules=["v_proj", "q_proj"], r=10000)
model = get_peft_model(model, config)
num_vera_layers = len([m for m in model.modules() if isinstance(m, layer_cls)])
print(f"after adding {num_vera_layers} adapted layers with rank {RANK}")
get_gpu_memory()

For VeRA, the results are:

before loading the base model
{'allocated': '0.0000GB', 'reserved': '0.0000GB'}
after loading the model
{'allocated': '0.4677GB', 'reserved': '0.5176GB'}
after adding 1 adapted layers with rank 10000
{'allocated': '0.5264GB', 'reserved': '0.5762GB'}
after resetting
{'allocated': '0.0000GB', 'reserved': '0.0000GB'}
after loading the base model
{'allocated': '0.4677GB', 'reserved': '0.5176GB'}
after adding 24 adapted layers with rank 10000
{'allocated': '0.5273GB', 'reserved': '0.5762GB'}

As we can see, when adapting 24 layers vs 1 layer, the memory used is almost identical. We expect a small increase because vera_lambda_b and vera_lambda_d are not shared, so this is in line with our expectations.

As a sanity check, if we do the same with LoRA instead of VeRA, we see a big increase in memory used:

before loading the base model
{'allocated': '0.0000GB', 'reserved': '0.0000GB'}
after loading the model
{'allocated': '0.4677GB', 'reserved': '0.5176GB'}
after adding 1 adapted layers with rank 10000
{'allocated': '0.5263GB', 'reserved': '0.5762GB'}
after resetting
{'allocated': '0.0000GB', 'reserved': '0.0000GB'}
after loading the base model
{'allocated': '0.4677GB', 'reserved': '0.5176GB'}
after adding 24 adapted layers with rank 10000
{'allocated': '1.8740GB', 'reserved': '1.9238GB'}

All if this is a strong indicator to me that the memory sharing actually works. If anyone has ideas for more tests, let me know.

BenjaminBossan · 2024-03-21T11:46:01Z

@dkopi @vvvm23 I think I'm pretty much finished with the implementation itself, docs and examples are yet to come. Still, if you have time, I'd be happy with a review or if you can run some tests to see if the implementation performs as expected. The changes compared to the original PR are documented above, the core VeRA computation hasn't been changed, though.

review-notebook-app · 2024-03-21T14:44:46Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

vvvm23 · 2024-03-24T20:21:26Z

Hi @BenjaminBossan, I can do a review some time this week.

dkopi

Looks good 👌

BenjaminBossan · 2024-04-03T09:53:50Z

@vvvm23 Did you have time to take a look?

vvvm23

looks good to me, few small nitpicks. sorry for the delay on this!

vvvm23 · 2024-04-07T13:19:06Z

src/peft/tuners/tuners_utils.py

+            adapter_name (`str`):
+                The adapter name.
+        """
+        pass


[nit] why not raise NotImplementedError? Avoid silent failures if something incorrectly calls the hook.

Passing is a valid outcome here, if we raised here, all non-VeRA adapters would suddenly error ;)

vvvm23 · 2024-04-07T13:21:04Z

src/peft/tuners/vera/config.py

+            pattern is not in the common layers pattern.
+    """
+
+    r: int = field(default=8, metadata={"help": "Vera attention dimension"})


perhaps we should increase the default value? 8 is rather small for VeRA (paper used 256-1024 for their experiments)

Yes, makes sense, I'll go with 256.

vvvm23 · 2024-04-07T13:21:59Z

src/peft/tuners/vera/config.py

+        },
+    )
+    vera_dropout: float = field(default=0.0, metadata={"help": "Vera dropout"})
+    d_initial: float = field(default=1.0, metadata={"help": "Initial init value for d vector."})


0.1 may be a better default value, see Table 6 in the paper

Right, makes sense.

vvvm23 · 2024-04-07T13:24:34Z

src/peft/tuners/vera/model.py

+                if isinstance(module, Conv1D):  # TODO: feels fragile, thoughts?
+                    module_shape = module_shape[::-1]


remove this TODO? I feel this behaviour is actually fine, the semantics of Conv1D are unlikely to change

- better default for r - better default for d_initial - remove unnecessary comment

BenjaminBossan

Thanks for the feedback, Alex, your comments should be addressed now.

BenjaminBossan · 2024-04-08T10:43:54Z

src/peft/tuners/tuners_utils.py

+            adapter_name (`str`):
+                The adapter name.
+        """
+        pass


Passing is a valid outcome here, if we raised here, all non-VeRA adapters would suddenly error ;)

BenjaminBossan · 2024-04-08T10:56:23Z

src/peft/tuners/vera/config.py

+            pattern is not in the common layers pattern.
+    """
+
+    r: int = field(default=8, metadata={"help": "Vera attention dimension"})


Yes, makes sense, I'll go with 256.

BenjaminBossan · 2024-04-08T10:59:09Z

src/peft/tuners/vera/config.py

+        },
+    )
+    vera_dropout: float = field(default=0.0, metadata={"help": "Vera dropout"})
+    d_initial: float = field(default=1.0, metadata={"help": "Initial init value for d vector."})


Right, makes sense.

BenjaminBossan · 2024-04-08T11:04:18Z

src/peft/tuners/vera/model.py

+                if isinstance(module, Conv1D):  # TODO: feels fragile, thoughts?
+                    module_shape = module_shape[::-1]


pacman100

Thank you @BenjaminBossan for all the work on Vera continuing the efforts of @vvvm23, all looks great with examples, documentation and tests! 🔥🚀✨

It would be great to add @vvvm23 and @dkopi as co-authors for all their guidance and work!

Left a minor nit.

pacman100 · 2024-04-18T08:01:57Z

src/peft/tuners/vera/model.py

+        >>> import transformers
+        >>> from peft import VeraConfig, PeftModel, get_peft_model
+
+        >>> target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "fc_in", "fc_out", "wte"]


Don't think all the target modules have same shape and this also includes embedding layer.

A few models that work with LoRA don't work with VeRA (yet) because the weight shapes of the target layers are not identical.

See huggingface#1659

BenjaminBossan · 2024-04-18T10:57:48Z

@pacman100 Thanks for the feedback, indeed I hadn't checked the docstring example. It is now changed to a working model.

Your comment also prompted me to take a look at the models that are pre-configured in TRANSFORMERS_MODELS_TO_VERA_TARGET_MODULES_MAPPING. This was just a copy of the the LoRA settings. Unfortunately, not all models work, I had to exclude some popular ones like Mistral, Mixtral, Phi, and gemma. The issue is again that the shapes of the target layer weights can differ. Hopefully, we can add the feature of supporting multiple different weight shapes in the future.

It would be great to add @vvvm23 and @dkopi as co-authors for all their guidance and work!

Yes, that was indeed my plan. @vvvm23 @dkopi could you please let me know how you want to be added as co-authors?

dkopi · 2024-04-18T12:42:11Z

@BenjaminBossan You can add:
Co-authored-by: Dawid <[email protected]>
Thanks!

vvvm23 · 2024-04-18T16:29:03Z

Likewise, you can add
Co-authored-by: Alex McKinney <[email protected]>

Thanks @BenjaminBossan for bringing this PR to completion!

BenjaminBossan · 2024-04-19T09:06:29Z

Done 🎉

Thanks again so much @vvvm23 for doing the majority of the work and @dkopi for your constant feedback.

Let's hope that VeRA gains traction in the community. For the future, I'll add this list of improvements for VeRA that have yet to be implemented (contributions are welcome):

Make VeRA work with different weight shapes. This is IMO the biggest limitation right now. Most straightforward way would be to have one pair of fixed vera_A/vera_B weights per target weight shape.
There are cases where the shapes are the same when transposed, so (4096, 1024) vs (1024, 4096) (down and up projections for instance) - I wonder if we can use the same weights here and just transpose them.
If this is done, implement VeRA for more layer types than just Linear. Right now, supporting, for instance, Embedding, makes little sense, as it almost always has a different shape than Linear.
Support quantized weights, most notably bnb.
Support DoRA ("DVoRA")

vvvm23 · 2024-04-19T09:23:40Z

Thanks again @BenjaminBossan ! please tag me in issues and PRs related to improvements :)

[WIP] Initial commit

c324ab2

Should now be 95% on par with huggingface#1039, with some minor changes on my part + resolving merge conflicts. Examples have not been included yet.

BenjaminBossan marked this pull request as draft March 14, 2024 17:12

BenjaminBossan mentioned this pull request Mar 14, 2024

[WIP] Adds Vera (Vector Based Random Matrix Adaption) #1039

Closed

BenjaminBossan and others added 6 commits March 14, 2024 18:17

Make style

d8fe06b

Make style 2

463cd8e

Feat: Support for Conv2D DoRA (huggingface#1516)

20bf5d9

TST Report slowest tests (huggingface#1556)

a344708

Merge branch 'main' into add-vera-2

1c4a765

BenjaminBossan changed the title ~~[WIP] Initial commit~~ [WIP] Adds Vera (Vector Based Random Matrix Adaption) #2 Mar 15, 2024

BenjaminBossan added 12 commits March 15, 2024 12:17

More make style

e7fabe3

Needed to update hf-doc-builder

Some further work, still WIP

0abb38d

More tests and fixes for VeRA

2778657

Add checks for require_grad

a829035

Some minor fixes, don't raise unnecessary errors

9b80b1b

Fix issue caused by order of init for VeRA

2ccb286

Merge branch 'main' into add-vera-2

f4dd9a3

projection_prng_key now defaults to 0

2d9687b

It was annoying that the default value was invalid and would raise an error.

Skip failing Deberta + Vera tests

209abd2

Same as for LoRA and IA3, these Deberta tests fail for some reason.

Add a sanity check to data_ptr test

f0a319d

More sanity checks for data_ptr

30755a9

BenjaminBossan added 2 commits March 21, 2024 15:14

Add VeRA example notebook

b707ea8

Add some docs

1f86941

dkopi approved these changes Mar 27, 2024

View reviewed changes

vvvm23 reviewed Apr 7, 2024

View reviewed changes

Address reviewer feedback

4cc496f

- better default for r - better default for d_initial - remove unnecessary comment

BenjaminBossan commented Apr 8, 2024

View reviewed changes

BenjaminBossan added 3 commits April 11, 2024 12:36

Merge branch 'main' into add-vera-2

ee86485

Merge branch 'main' into add-vera-2

924c235

Make style

4739ef9

BenjaminBossan marked this pull request as ready for review April 15, 2024 11:51

BenjaminBossan changed the title ~~[WIP] Adds Vera (Vector Based Random Matrix Adaption) #2~~ Adds Vera (Vector Based Random Matrix Adaption) #2 Apr 15, 2024

BenjaminBossan requested review from pacman100 and younesbelkada April 15, 2024 11:52

pacman100 approved these changes Apr 18, 2024

View reviewed changes

BenjaminBossan added 4 commits April 18, 2024 12:26

Reviewer feedback: Adjust docstring

eefcc4f

Update supported models for VeRA

a0dab53

A few models that work with LoRA don't work with VeRA (yet) because the weight shapes of the target layers are not identical.

Merge branch 'main' into add-vera-2

5979b7b

Fix adapter name handling

fec23e7

See huggingface#1659

pacman100 approved these changes Apr 18, 2024

View reviewed changes

BenjaminBossan merged commit 5a4b9ca into huggingface:main Apr 19, 2024
14 checks passed

BenjaminBossan deleted the add-vera-2 branch April 19, 2024 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Vera (Vector Based Random Matrix Adaption) #2 #1564

Adds Vera (Vector Based Random Matrix Adaption) #2 #1564

BenjaminBossan commented Mar 14, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 14, 2024

BenjaminBossan commented Mar 21, 2024

BenjaminBossan commented Mar 21, 2024

review-notebook-app bot commented Mar 21, 2024

vvvm23 commented Mar 24, 2024

dkopi left a comment

BenjaminBossan commented Apr 3, 2024

vvvm23 left a comment

vvvm23 Apr 7, 2024

BenjaminBossan Apr 8, 2024

vvvm23 Apr 7, 2024

BenjaminBossan Apr 8, 2024

vvvm23 Apr 7, 2024

BenjaminBossan Apr 8, 2024

vvvm23 Apr 7, 2024

BenjaminBossan Apr 8, 2024

BenjaminBossan left a comment

BenjaminBossan Apr 8, 2024

BenjaminBossan Apr 8, 2024

BenjaminBossan Apr 8, 2024

BenjaminBossan Apr 8, 2024

pacman100 left a comment

pacman100 Apr 18, 2024

BenjaminBossan commented Apr 18, 2024

dkopi commented Apr 18, 2024

vvvm23 commented Apr 18, 2024 •

edited

Loading

BenjaminBossan commented Apr 19, 2024

vvvm23 commented Apr 19, 2024

		if isinstance(module, Conv1D): # TODO: feels fragile, thoughts?
		module_shape = module_shape[::-1]

Adds Vera (Vector Based Random Matrix Adaption) #2 #1564

Adds Vera (Vector Based Random Matrix Adaption) #2 #1564

Conversation

BenjaminBossan commented Mar 14, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 14, 2024

BenjaminBossan commented Mar 21, 2024

BenjaminBossan commented Mar 21, 2024

review-notebook-app bot commented Mar 21, 2024

vvvm23 commented Mar 24, 2024

dkopi left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Apr 3, 2024

vvvm23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Apr 18, 2024

dkopi commented Apr 18, 2024

vvvm23 commented Apr 18, 2024 • edited Loading

BenjaminBossan commented Apr 19, 2024

vvvm23 commented Apr 19, 2024

BenjaminBossan commented Mar 14, 2024 •

edited

Loading

vvvm23 commented Apr 18, 2024 •

edited

Loading