Instruction finetuning LLAMA 3.2 with LoRA #460

dtdo90 · 2024-12-30T08:28:35Z

dtdo90
Dec 30, 2024

Dear Sebastian,

Thank you for your exceptional work in making LLMs more accessible.

I am currently fine-tuning the LLAMA3.2 1B model using LoRA, but I’ve encountered an issue where the train_loss = NaN. This appears to be caused by the logits in the forward pass containing NaN or Inf values.

Are there specific recommendations for adjusting the alpha parameter or initializing the matrices A and B to address this problem?

rasbt · 2024-12-30T14:13:56Z

rasbt
Dec 30, 2024
Maintainer

Hi there, is this with the code provided in the book? I.e., using the Llama 3.2 1B model implementation I provided here together with the LoRA code from the appendix?

In general, I have a few tips here: Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

0 replies

dtdo90 · 2024-12-30T14:34:19Z

dtdo90
Dec 30, 2024
Author

Thank you for the quick reply! I will definitely study the blog post.

Yes. I used the code to build LLAMA3.2 1B from the book, the instruction dataset used to finetune chatgpt2, and the LoRA implementation with alpha=16. Starting from the 2nd batch, the computation of logits in logits=calc_loss_batch(input_batch, target_batch, model, device) gives Nan values. I tried using gradient clipping, but it didn't help.

1 reply

rasbt Dec 30, 2024
Maintainer

Hm, I see. Does the model train ok (without NaNs) on the same dataset without LoRA?

dtdo90 · 2024-12-31T00:40:39Z

dtdo90
Dec 31, 2024
Author

It trains fine on the same dataset without using LoRA.

I'm on an M1 Mac with a 16GB GPU, so I froze all layers except the last linear layer and trained only that. I want to compare this type of fine-tuning with LoRA.

1 reply

rasbt Dec 31, 2024
Maintainer

Thanks for sharing. As a sanity check, I wonder if the model trains normally when other layers are trained on that dataset. It might take a lot of RAM without LoRA and would probably require an A100 at least. Unfortunately I am still out with an injury and can't really do computer work at the moment, but I can try this out some time later in January/February in case you don't have access to cloud GPUs.

dtdo90 · 2025-01-01T15:39:48Z

dtdo90
Jan 1, 2025
Author

I am sorry to hear about your situation. Hope you have a quick recovery!

It would be greatly appreciated if you could try it out when possible. In the meantime, I will explore subscribing to a Cloud GPU service and will provide an update regarding the training process.

0 replies

dtdo90 · 2025-01-02T02:25:32Z

dtdo90
Jan 2, 2025
Author

Update: The model trained fine when I replaced only the W_query and W_value layers with LinearWithLoRA. The LoRA configuration uses rank = 8 and an alpha = 16.

1 reply

rasbt Jan 2, 2025
Maintainer

Nice, I am glad to hear that it is working for you now! And I want to confirm that settings you used are the default/standard settings for LoRA. One may apply LoRA also to the projection, MLP, LM head, or W_key layers, but this is a rare hyperparameter choice. Anyways, I am glad it works now!

dtdo90 · 2025-01-03T14:36:17Z

dtdo90
Jan 3, 2025
Author

Finally, I am able to finetune with LoRA on all layers.
The reason is data type mismatch between the input tensor x and the matrices A and B within the LoRALayer (float vs. bfloat). I'm not sure why this mismatch leads to NaN values on my Mac. On Google Colab, the error was confirmed to be a data type conflict. A simple conversion of data type resolves the issue:

class LoRALayer(nn.Module):

def __init__(self, in_dim, out_dim, rank, alpha):
    super().__init__()
    self.A=nn.Parameter(torch.empty(in_dim,rank))
    nn.init.kaiming_uniform_(self.A,a=math.sqrt(5))
    self.B=nn.Parameter(torch.zeros(rank,out_dim))
    self.alpha=alpha

def forward(self,x):
    alpha=self.alpha.to(x.dtype)
    A=self.A.to(x.dtype)
    B=self.B.to(x.dtype)
    return alpha * (x @ A @ B)

2 replies

rasbt Jan 3, 2025
Maintainer

Ah nice, that's a good point btw. In the book, I used float to make it compatible with most machines (and I think the original GPT-2 model was also trained in float32). The Llama model weights were shared/are optimized for bfloat16 though afaik. So yeah, one definitely needs to be careful here.

I think it would be good if PyTorch raised an error in such cases instead of giving NaNs, so one could debug it more easily.

Also, it might be a good extension for the LoRA code to adopt the type from the original layer like you did. (I didn't think about it when writing the book, because that was long before adding Llama 3 existed and I added it to the repo.) In case there will be a 2nd edition in a few years, I will make a note for this improvement.

In any case, thanks so much for sharing all the details here, super useful!

dtdo90 Jan 3, 2025
Author

Thank you so much for taking the time to look into the issue! It's always a relief to see how much care about every single detail as the author of the book!

And your book is absolutely amazing!!! I have learnt so much from it!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction finetuning LLAMA 3.2 with LoRA #460

{{title}}

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Instruction finetuning LLAMA 3.2 with LoRA #460

dtdo90 Dec 30, 2024

Replies: 6 comments · 5 replies

rasbt Dec 30, 2024 Maintainer

dtdo90 Dec 30, 2024 Author

rasbt Dec 30, 2024 Maintainer

dtdo90 Dec 31, 2024 Author

rasbt Dec 31, 2024 Maintainer

dtdo90 Jan 1, 2025 Author

dtdo90 Jan 2, 2025 Author

rasbt Jan 2, 2025 Maintainer

dtdo90 Jan 3, 2025 Author

rasbt Jan 3, 2025 Maintainer

dtdo90 Jan 3, 2025 Author

dtdo90
Dec 30, 2024

Replies: 6 comments 5 replies

rasbt
Dec 30, 2024
Maintainer

dtdo90
Dec 30, 2024
Author

rasbt Dec 30, 2024
Maintainer

dtdo90
Dec 31, 2024
Author

rasbt Dec 31, 2024
Maintainer

dtdo90
Jan 1, 2025
Author

dtdo90
Jan 2, 2025
Author

rasbt Jan 2, 2025
Maintainer

dtdo90
Jan 3, 2025
Author

rasbt Jan 3, 2025
Maintainer

dtdo90 Jan 3, 2025
Author