-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fullfinetune leaves model unusable #648
Comments
I think this is related to FSDP. huggingface/transformers#26498 |
We had a bad experience with lit-gpt when finetuning on multi-GPU: #652 |
@windprak Is there a guide on how to use deepspeed with fabric, or some example? I'm trying to do it and keep failing to load the model weights. |
We improved things by a lot in the recent months and also have configuration files for good out of the box performance now, e.g., see https://github.com/Lightning-AI/litgpt/tree/main/config_hub/finetune. Please feel free to reopen this issue and discussion if you have any follow-up questions or concerns. |
I prepared a dataset in alpaca-style and trained llama 7b with the original alpaca code with it. This worked out fine and I got useable results. Now I decided to switch to lit-gpt, did everything accordingly to the tutorials and started training Llama 2 on the same dataset. It fails at the first validation step already, generating only gibberish. It doesn't when using the original dataset. Also generate/chat works fine. I don't understand how my dataset turns it into pieces.
`Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[rank: 3] Seed set to 1337
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
[rank: 0] Seed set to 1337
[rank: 2] Seed set to 1337
[rank: 1] Seed set to 1337
[rank: 2] Seed set to 1339
[rank: 3] Seed set to 1340
[rank: 0] Seed set to 1337
[rank: 1] Seed set to 1338
/root/miniconda3/envs/lit/lib/python3.10/site-packages/lightning/fabric/wrappers.py:176: You are calling the method
GPT.set_kv_cache()
from outside the model. This will bypass the wrapper from the strategy and result in incorrect behavior in.backward()
. You should pass your inputs throughGPT.forward()
.{'eval_interval': 1000, 'save_interval': 2000, 'eval_iters': 100, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 4, 'learning_rate': 5e-05, 'batch_size': 1.0, 'micro_batch_size': 1, 'gradient_accumulation_iters': 1.0, 'epoch_size': 14771, 'num_epochs': 3, 'max_iters': 11078, 'weight_decay': 0.0, 'warmup_steps': 22156.0}
Loading model '/home/exstorage/meta-llama/Llama-2-13b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-13b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 40, 'n_head': 40, 'n_embd': 5120, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 40, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 13824, 'rope_condense_ratio': 1, 'rope_base': 10000, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 13,015,864,320
The longest sequence length in the train data is 4096, the model's maximum sequence length is 4096 and context length is 4096
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.
Response:OOOOOOOOOOOOOOOOOOOOtOO!!OOOO!O speakOO andOO -OOOOO
Estimated TFLOPs: 1554.39
Measured TFLOPs: 1428.29
iter 0 step 1: loss 9.8921, iter time: 3720.26ms (optimizer.step)
iter 1 step 2: loss 9.8023, iter time: 2078.05ms (optimizer.step)`
I downloaded the model again, reinstalled everything but still the results are the same. Also the final fine-tuned model will only produce garbage.
Here is what I observed:
I really don't know where to look for a solution anymore. Has anyone ever experienced this?
The text was updated successfully, but these errors were encountered: