-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding PiSSA as an optional initialization method of LoRA #1626
Conversation
你好,发现你的 |
您好,这里是因为线性层torch.nn.Linear(in_channel, out_channel)的矩阵维度实际上是转置过的,即W的维度实际上是out_channel X in_channel。正常情况下,需要对W进行转置,进行奇异值分解并对AB初始化后,再进行转置 |
Let me know when this is ready for review. Also, please run |
I've run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for providing this useful method. The results from the paper are very promising, so I would be happy to have this added to PEFT.
I added a couple of comments to the PR. Please check them out.
There is one very big issue still, which is about saving and loading the model (check out my last comment). This is a very similar issue to what we have with LoftQ initialization (which itself is very similar to PiSSA IIUC). That has to be addressed most urgently or else users cannot correctly load LoRA weights trained with PiSSA.
Edit: It seems I have been mistaken here. But let's add a test that involves saving and loading to ensure that this works correctly.
Let's also add some documentation. This is important so that users can discover this new method and understand what it does and when to use it. For this, could you please:
- Add a section to the LoRA docs here:
peft/docs/source/developer_guides/lora.md
Line 91 in 31c884e
- The docstring of the
LoraConfig
here. Extend the type annotation to includeLiteral["gaussian", "loftq", "pissa"]
and add a sentence or two to the description. Don't forget to mention the possibility to pass multiple iterations. - The help of
LoraConfig
: You can use the same explanation as above.
Furthermore, let's add some testing in this file. Specifically, let's check init_lora_weights="pissa"
and also having multiple iterations, with bnb, as well as checking the errors. If you need help with writing the tests, let me know.
Hey @fxmeng, after some internal discussion, we had some concerns about this line: The issue here is that the model base weights are modified when initializing with PiSSA. This can have side-effects for the user. For example, when they disable all adapters, they would normally expect the model output to be the same as the base model, but here it's not the case. Or when a user loads a PiSSA-LoRA adapter and another LoRA adapter, that other adapter will not work correctly because it was trained on the unmodified base weight. It would be possible to add a lot of checks everywhere and raise errors if we detect that PiSSA is used and a user wants to disable the adapter or switch to another adapter. But that's very complicated and error prone, and at the end of the day also not very user friendly. What I wonder is: How much performance would we lose if we keep the base weights unmodified? If this works almost as well, maybe we can keep the base weights and not have to add all those complications. Did you run experiments to test that? |
Hi @BenjaminBossan, |
Oh nice, thanks, I think it would be great to integrate this functionality into PEFT. To be sure I understand: We first load the base model, then initialize the PEFT model with PiSSA turned on, then train the PiSSA-LoRA adapter, then we can convert it to a normal LoRA adapters and share it with others. When someone loads this converted PiSSA-LoRA adapter, it works like a normal LoRA adapter, so no need to adjust the base model weights. This means we can disable it, combine it with other LoRA adapters, etc. Is that right? Regarding the linked script, can you explain this line (or refer to the part of the paper that explains it):
Looking forward to this. |
We have explained the line you mentioned on https://github.com/fxmeng/peft/blob/7fabf84375092cc9b2d870188953602a02b9d8db/examples/pissa_finetuning/convert_pissa_to_lora.py#L26. We will include detailed instructions for converting PiSSA to LoRA in our document and the next draft of the paper. Additionally, we have fixed a bug and conducted tests on the combination of the converted LoRA and the base model to ensure its accuracy. |
When I run your test above, the values I get the same or very similar values, except for T5 + 8bit:
Not sure why that is, perhaps it's best to just remove that specific test (in that case, add a comment that this combination may fail on some machines).
I admit that calculating MAE/MSE of logits is a bit flawed as a measure, this was chosen more from a practical viewpoint. I don't know this measure that you proposed and would need to read a bit more, but if you think it's superior, feel free to use it instead. But as mentioned, it would also be fine to remove this one specific test.
Maybe it's the ruff version? The version that the CI uses is
Ah yes, good point, then it should go to |
I have changed the method for measuring quantization errors from calculating the MAE/MSE from a practical viewpoint to calculating the nuclear norm of all error matrices. This method has a fixed error for each model and has passed tests in my local environment. |
@fxmeng Did you check your local ruff version? |
Hi @BenjaminBossan,
|
I had misunderstood your suggestion to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running ruff and for adjusting the test. There are a few issues that resulted from moving the test, but they should be easy fix, please check my comments.
Thank you for your comments. I have fixed these points. If there are any other issues, please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, just one small issue that made the new test fail for me.
Co-authored-by: Benjamin Bossan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this excellent work @fxmeng. This now all looks good. I'll ask if one of the co-maintainers also wants to review before merging.
I also got email notifications with comments you added about experimenting without adjusting the base model weights, thanks for running these tests. However, when I click the links in the emails, the comments are not shown. Maybe because the original comment is on outdated code? Anyway, could you be so kind to paste those comments into the main thread of this PR so that everyone who's interested in your results can read them?
Hi @BenjaminBossan,
Here is the experiment using the PiSSA adapter in conjunction with the base model for training. 1-10 steps:
90-100 steps:
The final results comparison is shown in the table below. As we can observe, there is a significant drop in performance when using PiSSA without updating the base weights.
|
Thanks a lot @fxmeng for this great PR. After internal discussion, we decided this is good to be merged. |
I truly appreciate the constructive feedback and effort from all your team members. I am grateful to contribute to the valuable PEFT project. |
Thanks for your great PR! |
Hi @fxmeng. May I know if pissa initialization compatible to deepspeed ZeRO? |
Hi @Con6924, |
Cool, looking forward to it! |
In paper "https://arxiv.org/pdf/2404.02948.pdf", we introduce a parameter-efficient fine-tuning (PEFT) method, Principal Singular values and Singular vectors Adaptation (PiSSA), which optimizes a significantly reduced parameter space while achieving or surpassing the performance of full-parameter fine-tuning.
PiSSA is inspired by Intrinsic SAID, which suggests that pre-trained, over-parametrized models inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a matrix$W\in\mathbb{R}^{m\times n}$ within the model by the product of two trainable matrices $A \in \mathbb{R}^{m\times r}$ and $B \in \mathbb{R}^{r\times n}$ , where $r \ll \min(m, n)$ , plus a residual matrix $W^{res}\in\mathbb{R}^{m\times n}$ for error correction. Singular value decomposition (SVD) is employed to factorize $W$ , and the principal singular values and vectors of $W$ are utilized to initialize $A$ and $B$ . The residual singular values and vectors initialize the residual matrix $W^{res}$ , which keeps frozen during fine-tuning. Notably, PiSSA shares the same architecture with Low-Rank Adaptation (LoRA), which hypothesizes that changes in model parameters $\Delta W$ form a low-rank matrix. However, LoRA approximates $\Delta W$ through the product of two matrices, $A$ , initialized with Gaussian noise, and (B), initialized with zeros, while PiSSA initializes $A$ and $B$ with principal singular values and singular vectors of the original matrix $W$ . Given that the principal singular values and vectors capture the essence of a low-rank matrix, PiSSA can better approximate the outcomes of full-parameter fine-tuning at the beginning by changing the essential parts while freezing the "noisy" parts. In comparison, LoRA freezes the original matrix and updates the "noise". This distinction enables PiSSA to convergence much faster than LoRA and also achieve better performance in the end. On five common benchmarks, PiSSA outperforms LoRA on all of them using exactly the same setups except for a different initialization. On GSM8K, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, outperforming LoRA's 67.7% by 5.16%.
Due to the same architecture, PiSSA inherits many of LoRA's advantages, such as parameter efficiency and compatibility with quantization. Leveraging a fast SVD method, the initialization of PiSSA takes only a few seconds, inducing negligible cost of switching LoRA to PiSSA.