training loss while fine-tuning llama 3.1 with lora is very high compared to rtx 3090 #721
Open
4 tasks done
Labels
bug
Something isn't working
System Info
Who can help?
@michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I am following the tutorial here: https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm
I have been using the training script found here: https://github.com/huggingface/optimum-neuron/blob/main/docs/source/training_tutorials/sft_lora_finetune_llm.py
I used a trn1.2xlarge instance with 2 neuron cores to train a Llama 3.1 8B using LoRA using tensor parallelism with a degree of 2. However, training loss is very high compared to the same model with same parameters being trained on a single RTX 3090. The training losses look like this:
I ran these experiments using databricks/databricks-dolly-15k and timdettmers/openassistant-guanaco
I also changed the
tokenize
function under_prepare_non_packed_dataloader
intrl/trainer/sft_trainer
so that it pads every sample to max_length so it behaves the same as optimum-neuron.My training script for the trn1.2xlarge instance (for dolly dataset, for openassistant dataset I change the formatting function so it just returns
examples["text"]
directly:train.py
My bash script for graph compilation:
compile.sh
and my bash script for training:
train.sh
The script I use to train on RTX 3090:
train.py
Disabling embedding parallelization on the Trainium instance lowers the training loss but it is still consistently higher than the loss on the RTX 3090. Also, with embedding parallelization enabled the model is saved incorrectly. Trained model with embedding parallelization has additional layers
base_model.model.lm_head.weight
andbase_model.model.model.embed_tokens.weight
. Additionally only half ofbase_model.model.model.embed_tokens.weight
is saved (shape is (64128, 4096) instead of (128256, 4096)) but perhaps this should be another issue.Expected behavior
I expect the training loss to be much closer to the loss I get when I train the model on an RTX 3090 instead of 2 trainium neuron cores.
The text was updated successfully, but these errors were encountered: