-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Vera (Vector Based Random Matrix Adaption) #2 #1564
Conversation
Should now be 95% on par with huggingface#1039, with some minor changes on my part + resolving merge conflicts. Examples have not been included yet.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* changes to support fsdp+qlora and dsz3+qlora * address comments * add example and start docs * quality * deepspeed fixes * dsz3+qlora docs * section link fix * add fsdp+qlora docs * Apply suggestions from code review Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: Younes Belkada <[email protected]> * address comments --------- Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: Younes Belkada <[email protected]>
Needed to update hf-doc-builder
Supporting Embedding layers with VeRA makes very little sense because its shape is always different from the linear layers' shapes. Therefore, they cannot share the vera_A and vera_B matrices, resulting in an error. The only conceivable way to support Embedding layers would be to only target that layer (and possibly the output layer if it shares the weight), but that more or less defeats the purpose of using VeRA. We may revisit support for Embeddings in the future, maybe if we can enable vera_A and vera_B to be of different shapes. Until then, let's support the most common use cases and simplify our lives.
It was annoying that the default value was invalid and would raise an error.
Same as for LoRA and IA3, these Deberta tests fail for some reason.
To ensure that the Moreover, I wrote a small script to check the amount of memory taken by the model. For this, I used a very high rank of 10000, so that Here is the script: from transformers import AutoModelForCausalLM
from peft import get_peft_model, VeraConfig, LoraConfig
from peft.tuners.vera import VeraLayer
from peft.tuners.lora import LoraLayer
import gc
import torch
RANK = 10000
model_id = "facebook/opt-125m"
config_cls = VeraConfig
layer_cls = VeraLayer
def get_gpu_memory():
torch.cuda.synchronize() # Wait for all kernels to finish
gpu_info = {
'allocated': f"{torch.cuda.memory_allocated(0) / 2**30:.4f}GB",
'reserved': f"{torch.cuda.memory_reserved(0) / 2**30:.4f}GB",
}
print(gpu_info)
print("before loading the base model")
get_gpu_memory()
model = AutoModelForCausalLM.from_pretrained(model_id).cuda()
print("after loading the model")
get_gpu_memory()
config = config_cls(task_type="CAUSAL_LM", target_modules=["model.decoder.layers.0.self_attn.k_proj"], r=RANK)
model = get_peft_model(model, config)
num_vera_layers = len([m for m in model.modules() if isinstance(m, layer_cls)])
print(f"after adding {num_vera_layers} adapted layers with rank {RANK}")
get_gpu_memory()
del model
torch.cuda.empty_cache()
gc.collect()
print("after resetting")
get_gpu_memory()
model = AutoModelForCausalLM.from_pretrained(model_id).cuda()
print("after loading the base model")
get_gpu_memory()
config = config_cls(task_type="CAUSAL_LM", target_modules=["v_proj", "q_proj"], r=10000)
model = get_peft_model(model, config)
num_vera_layers = len([m for m in model.modules() if isinstance(m, layer_cls)])
print(f"after adding {num_vera_layers} adapted layers with rank {RANK}")
get_gpu_memory() For VeRA, the results are:
As we can see, when adapting 24 layers vs 1 layer, the memory used is almost identical. We expect a small increase because As a sanity check, if we do the same with LoRA instead of VeRA, we see a big increase in memory used:
All if this is a strong indicator to me that the memory sharing actually works. If anyone has ideas for more tests, let me know. |
@dkopi @vvvm23 I think I'm pretty much finished with the implementation itself, docs and examples are yet to come. Still, if you have time, I'd be happy with a review or if you can run some tests to see if the implementation performs as expected. The changes compared to the original PR are documented above, the core VeRA computation hasn't been changed, though. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hi @BenjaminBossan, I can do a review some time this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👌
@vvvm23 Did you have time to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, few small nitpicks. sorry for the delay on this!
adapter_name (`str`): | ||
The adapter name. | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] why not raise NotImplementedError
? Avoid silent failures if something incorrectly calls the hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing is a valid outcome here, if we raised here, all non-VeRA adapters would suddenly error ;)
src/peft/tuners/vera/config.py
Outdated
pattern is not in the common layers pattern. | ||
""" | ||
|
||
r: int = field(default=8, metadata={"help": "Vera attention dimension"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should increase the default value? 8 is rather small for VeRA (paper used 256-1024 for their experiments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense, I'll go with 256.
src/peft/tuners/vera/config.py
Outdated
}, | ||
) | ||
vera_dropout: float = field(default=0.0, metadata={"help": "Vera dropout"}) | ||
d_initial: float = field(default=1.0, metadata={"help": "Initial init value for d vector."}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1
may be a better default value, see Table 6 in the paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense.
src/peft/tuners/vera/model.py
Outdated
if isinstance(module, Conv1D): # TODO: feels fragile, thoughts? | ||
module_shape = module_shape[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this TODO? I feel this behaviour is actually fine, the semantics of Conv1D are unlikely to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- better default for r - better default for d_initial - remove unnecessary comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, Alex, your comments should be addressed now.
adapter_name (`str`): | ||
The adapter name. | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing is a valid outcome here, if we raised here, all non-VeRA adapters would suddenly error ;)
src/peft/tuners/vera/config.py
Outdated
pattern is not in the common layers pattern. | ||
""" | ||
|
||
r: int = field(default=8, metadata={"help": "Vera attention dimension"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense, I'll go with 256.
src/peft/tuners/vera/config.py
Outdated
}, | ||
) | ||
vera_dropout: float = field(default=0.0, metadata={"help": "Vera dropout"}) | ||
d_initial: float = field(default=1.0, metadata={"help": "Initial init value for d vector."}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense.
src/peft/tuners/vera/model.py
Outdated
if isinstance(module, Conv1D): # TODO: feels fragile, thoughts? | ||
module_shape = module_shape[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @BenjaminBossan for all the work on Vera continuing the efforts of @vvvm23, all looks great with examples, documentation and tests! 🔥🚀✨
It would be great to add @vvvm23 and @dkopi as co-authors for all their guidance and work!
Left a minor nit.
src/peft/tuners/vera/model.py
Outdated
>>> import transformers | ||
>>> from peft import VeraConfig, PeftModel, get_peft_model | ||
|
||
>>> target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "fc_in", "fc_out", "wte"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think all the target modules have same shape and this also includes embedding layer.
A few models that work with LoRA don't work with VeRA (yet) because the weight shapes of the target layers are not identical.
@pacman100 Thanks for the feedback, indeed I hadn't checked the docstring example. It is now changed to a working model. Your comment also prompted me to take a look at the models that are pre-configured in
Yes, that was indeed my plan. @vvvm23 @dkopi could you please let me know how you want to be added as co-authors? |
@BenjaminBossan You can add: |
Likewise, you can add Thanks @BenjaminBossan for bringing this PR to completion! |
Done 🎉 Thanks again so much @vvvm23 for doing the majority of the work and @dkopi for your constant feedback. Let's hope that VeRA gains traction in the community. For the future, I'll add this list of improvements for VeRA that have yet to be implemented (contributions are welcome):
|
Thanks again @BenjaminBossan ! please tag me in issues and PRs related to improvements :) |
Continuation of #1039.
Should now be 95% on par with that PR, with some minor changes on my part + resolving merge conflicts.
Examples and docs have not been included yet.
TODOS:https://arxiv.org/abs/2310.11454
Notable changes vis-à-vis 1039:
VeraModel
proceeds, should be more straightforward now.projection_prng_key
now has a valid default value (0) in the config.Embedding
to reduce complexity: Supporting Embedding layers with VeRA makes very little sense because its shape is always different from the linear layers' shapes. Therefore, they cannot share the vera_A and vera_B matrices, resulting in an error. The only conceivable way to support Embedding layers would be to only target that layer (and possibly the output layer if it shares the weight), but that more or less defeats the purpose of using VeRA. We may revisit support for Embeddings in the future, maybe if we can enable vera_A and vera_B to be of different shapes. Until then, let's support the most common use cases and simplify our lives.