Replies: 6 comments 5 replies
-
Hi there, is this with the code provided in the book? I.e., using the Llama 3.2 1B model implementation I provided here together with the LoRA code from the appendix? In general, I have a few tips here: Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) |
Beta Was this translation helpful? Give feedback.
-
Thank you for the quick reply! I will definitely study the blog post. Yes. I used the code to build LLAMA3.2 1B from the book, the instruction dataset used to finetune chatgpt2, and the LoRA implementation with alpha=16. Starting from the 2nd batch, the computation of logits in logits=calc_loss_batch(input_batch, target_batch, model, device) gives Nan values. I tried using gradient clipping, but it didn't help. |
Beta Was this translation helpful? Give feedback.
-
It trains fine on the same dataset without using LoRA. I'm on an M1 Mac with a 16GB GPU, so I froze all layers except the last linear layer and trained only that. I want to compare this type of fine-tuning with LoRA. |
Beta Was this translation helpful? Give feedback.
-
I am sorry to hear about your situation. Hope you have a quick recovery! It would be greatly appreciated if you could try it out when possible. In the meantime, I will explore subscribing to a Cloud GPU service and will provide an update regarding the training process. |
Beta Was this translation helpful? Give feedback.
-
Update: The model trained fine when I replaced only the W_query and W_value layers with LinearWithLoRA. The LoRA configuration uses rank = 8 and an alpha = 16. |
Beta Was this translation helpful? Give feedback.
-
Finally, I am able to finetune with LoRA on all layers. class LoRALayer(nn.Module):
|
Beta Was this translation helpful? Give feedback.
-
Dear Sebastian,
Thank you for your exceptional work in making LLMs more accessible.
I am currently fine-tuning the LLAMA3.2 1B model using LoRA, but I’ve encountered an issue where the train_loss = NaN. This appears to be caused by the logits in the forward pass containing NaN or Inf values.
Are there specific recommendations for adjusting the alpha parameter or initializing the matrices A and B to address this problem?
Beta Was this translation helpful? Give feedback.
All reactions