-
Notifications
You must be signed in to change notification settings - Fork 76
HFSaveCheckpoint does not work with deepspeed #273
Comments
Thanks for the report! This is tricky as when using deepspeed stage 3, checkpoints are saved as shards as opposed to a single checkpoint. One thing you might benefit from is saving normally using lightning, then using https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/utilities/deepspeed.py#L52 to combine the checkpoints into one. After this, load the checkpoints into the It would be nice to have an automated system to do this, but I'm worried that adding too much automation will cause overhead. Let me know if this solution works for you in the meantime! |
@SeanNaren, thanks that solution worked nicely. I added this code to the end of the script in the original post: from pytorch_lightning.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict
aggregated_checkpoint_path="checkpoints/aggregated"
hf_checkpoint_path="checkpoints/hf_checkpoint"
# Convert sharded checkpoint files into a single checkpoint file
#https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/utilities/deepspeed.py#L52
convert_zero_checkpoint_to_fp32_state_dict(
checkpoint_callback.best_model_path,
aggregated_checkpoint_path
)
# Load best model from aggregated checkpoint file
best_model = TextClassificationTransformer.load_from_checkpoint(
aggregated_checkpoint_path
)
# Save model and tokenizer to HF checkpoint
best_model.model.save_pretrained(hf_checkpoint_path)
tokenizer.save_pretrained(hf_checkpoint_path) |
@SeanNaren one follow up on this - although the code above worked for a local environment (1 gpu on 1 node), I run into this error when
It looks like it is only seeing the checkpoint files from one of the two nodes. I tried inserting Is there a way to run the checkpoint convert function on a multinode environment or should I just be running it in a separate script/process? |
There is probably a more elegant solution, but what ended up working for me was to have the root process of each node upload the files to Azure immediately after # Run one upload process per node to make sure we capture all the sharded checkpoint files
if os.environ.get("LOCAL_RANK") == "0":
print(f"Uploading best model checkpoints (sharded) to {checkpoints_upload_path}")
datastore.upload(
src_dir=checkpoint_callback.best_model_path,
target_path=checkpoints_upload_path,
overwrite=True
) In a separate process I can then download all these sharded checkpoint files from Azure and run the code above to convert them to a single checkpoint file. |
getting this error while loading a language model RuntimeError: Error(s) in loading state_dict for LanguageModelingTransformer:
Missing key(s) in state_dict: "model.lm_head.weight". any leads? |
🐛 Bug
HFSaveCheckpoint does not save a HuggingFace checkpoint when the model is trained with deepspeed. No message or warning appears to indicate that the HF checkpoint did not save.
To Reproduce
Use the HFSaveCheckpoint callback when training with deepspeed. I encountered this on both a multinode (Azure) and a single node (local) environment.
Code sample
Expected behavior
Either a warning is thrown or the HF model saves properly.
Environment
Lightning transformers 0.2.1
The text was updated successfully, but these errors were encountered: