Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted #1156

Open
6 of 8 tasks
vip-china opened this issue Jan 20, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@vip-china
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Generate the correct Lora after training is completed

Current behaviour

Using the following command to merge models, there is an error message:

python3 -m axolotl.cli.merge_lora sft_34b.yml --lora_model_dir="/workspace/axolotl/output/Yi-34B/ljf-yi-34b-lora" --output_dir=/data1/ljf2/data-check-test
image

Steps to reproduce

The meaning of this parameter is not effective
save_safetensors: true
Actually generated
adapter_model.bin

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@vip-china vip-china added the bug Something isn't working label Jan 20, 2024
@winglian
Copy link
Collaborator

is the issue that the merge doesn't work? or that specifying save_safetensors produces a Pytorch_model.bin? or both?

@vip-china
Copy link
Author

specifying save_safetensors produces a Pytorch_model.bin

@NanoCode012
Copy link
Collaborator

Using the following command to merge models, there is an error message:

Hey, seems like the PeftModel loading failed. Can you check the files in lora_model_dir are valid (aren't a few KBs)?

Did you run out of space during training?

Since the training had a few checkpoints, could you try pointing the model_dir to one of those and see what happens?

The meaning of this parameter is not effective
save_safetensors: true

I'm a bit confused at this as the you said the merge failed.

@vip-china
Copy link
Author

Now I am continuing SFT training from checkpoint and reporting this error again

I have configured this parameter:
use_reentrant: true
resume_from_checkpoint: /workspace/axolotl-main/checkpoint-5865

image
image

@winglian
Copy link
Collaborator

resuming from a "peft checkpoint" is not the same as resuming from a regular checkpoint. You'll want to set lora_model_dir to point to the checkpoint directory iirc. @NanoCode012 does that sound right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants