Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gradient_checkpointing #9

Open
Richar-Du opened this issue Jul 13, 2023 · 3 comments
Open

Support for gradient_checkpointing #9

Richar-Du opened this issue Jul 13, 2023 · 3 comments

Comments

@Richar-Du
Copy link

Thanks for your awesome work! There is a small problem: when I fine-tune long_llama with gradient_checkpointing, it raises an error:
image
Could you please update the code in transformers to make long_llama support gradient_checkpointing. I think it is useful for the community to use long_llama.
@CStanKonrad

@CStanKonrad
Copy link
Owner

Hi, thanks for the request. In the recent commit, I have added initial support for gradient checkpointing (it just skips memory layers). As I am writing, it is not yet present in the Hugging Face repository, so to use it you can download code from the src directory in this repository and write something like this:

from transformers import LlamaTokenizer
from .modeling_longllama import LongLlamaForCausalLM
import torch

MODEL_PATH = "syzymon/long_llama_3b"

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH)
model = LongLlamaForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.float32)

@Richar-Du
Copy link
Author

Richar-Du commented Jul 14, 2023

Thanks for your commit!

Now I would like to fine-tune longllama, but the sequence is too long and it returns CUDA OOM (4x80G). I wonder if I could fine-tune longllama under a regular framework without support for long context (e.g. the training framework of alpaca or vicuna). If I could not, could you please release the fine-tuning code of longllama?

@CStanKonrad
Copy link
Owner

I apologize for the late response. We have recently published the code that allows for fine-tuning the model on a single A100 80GB GPU. We use a total context size of 2048, with last_context_length being 1024. For shorter inputs, we randomly decide how much data will be present in memory. We achieve this by randomly padding the input.

You can try the instruction+chat fine-tuned model in the Colab.

For the Colab model, we provide the fine-tuning config and log of train loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants