Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication of finetuning code #6

Open
VilhelmHovland opened this issue Jun 5, 2024 · 17 comments
Open

Replication of finetuning code #6

VilhelmHovland opened this issue Jun 5, 2024 · 17 comments
Labels
question Further information is requested

Comments

@VilhelmHovland
Copy link

Hello, I want to try finetuning your model with own data but I have two questions:

  1. I am trying to replicat eyour finetuning code but if I try finetuning the larger version of FLAN-T5 I run into memory capacity issues. I am just using the wordnet dataset from huggingface, training one epoch with a batch size of 1 and reduced lengths. It appears to not run on multiple nodes. How could I solve this?
  2. How should I format my data in order to use it for further finetuning?

Thank you for any assistance here.

@jacklanda
Copy link

I suggest giving up on the reproduction, my friend.

Code tastes bitter and Truth goes opaque.

@akutuzov
Copy link
Member

akutuzov commented Jun 6, 2024

Hi @VilhelmHovland

  1. What exact version of FLAN-T5 are you using, and what fine-tuning parameters? To fine-tune FLAN-T5 Large, 40 GB of GPU RAM should be enough (probably even 24). For the XL version, you'll need more. We did not fine-tune on multiple nodes, but used multiple GPUs on one node to increase the global batch size - it worked fine. To be more precise, using 8 GPUs with 64 GB of RAM each allowed us to fine-tune FLAN-T5 XL with the global batch size 32 (we also set gradient_accumulation_steps=4 and truncated the maximum input length to 160 tokens). You can probably go even beyond that by using reduced precision.
  2. Our fine-tuning code assumes your training dataset is a tab-separated file with two columns: examples and definitions. The validation dataset should be in the same format, of course.

Any other questions are welcome.

@akutuzov
Copy link
Member

akutuzov commented Jun 6, 2024

I suggest giving up on the reproduction, my friend.
Code tastes bitter and Truth goes opaque.

@jacklanda I am not sure what do you mean by that?

@glnmario
Copy link
Collaborator

glnmario commented Jun 6, 2024

I'm guessing poetry generation 🍷

@VilhelmHovland
Copy link
Author

VilhelmHovland commented Jun 12, 2024

@akutuzov Thank you for your response. The parameters I have been testing with have been:
--model_name_or_path="google/flan-t5-xl"
--cache_dir="/vilhelm/.cache/"
--do_train
--do_eval
--dataset_name="marksverdhei/wordnet-definitions-en-2021"
--output_dir="/vilhelm/finetune_output/"
--overwrite_output_dir
--evaluation_strategy=epoch
--logging_strategy=epoch
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--predict_with_generate
--save_total_limit=5
--max_source_length=5
--max_target_length=5
--fp16=True
--num_train_epochs=1
--save_strategy=epoch
--load_best_model_at_end=True
--metric_for_best_model=eval_rouge1
--ddp_find_unused_parameters=False
--optim=adafactor \

I have been running it using 4 32GB V100 gpus at the Puhti supercomputer, on a single node.

@akutuzov
Copy link
Member

@VilhelmHovland I believe the root of your troubles is this line:
--dataset_name="marksverdhei/wordnet-definitions-en-2021"

You are trying to use the Wordnet dataset directly as it is on HF. We didn't try that, and I doubt the fine-tuning script deals with this well. As mentioned before, we fine-tune on tab-separated files with two columns: examples and definitions, without directly using the datasets library. This allows much more flexibility. You should point to the training and validation data files with these arguments:

--train_file ${TRAIN_DATASET} \
--validation_file ${VAL_DATASET} \

(see the example here)

Note that the examples should be already augmented with the instruction prompt ("What is the definition of TARGET_WORD?" or whatever prompt you are using).

@akutuzov akutuzov added the question Further information is requested label Jun 13, 2024
@VilhelmHovland
Copy link
Author

@akutuzov I see, thank you. Is the exact data you used available anywhere, or do I need to process the CoDWoE and naacl data?

@akutuzov
Copy link
Member

"naacl data" means datasets from Ishivatari et al 2019, right?
Then yes, you'll have to convert them to the tab-separated format I described above. Same with CoDWoE - it comes as json files, but it's trivial to convert them to .tsv.

We did not publish our converted versions, since we felt it would be not polite to re-distribute datasets created by others (simply saved in another format). Again, it should be trivial convert these datasets to .tsv and add the instruction prompt.
If you encounter any difficulties with that, get in touch with me, I'll share our preprocessed files privately.

@VilhelmHovland
Copy link
Author

Hello again, I have now changed my data, but I am still getting the same error. I am using the same parameters except with direct data files. I formatted them like this, in .tsv files, does it look correct? What else could be causing issues?

example definition
cranial pressure What is the definition of cranial? of or relating to the cranium which encloses the brain
an easy job What is the definition of easy? posing no difficulty

@akutuzov
Copy link
Member

@VilhelmHovland did you try to fine-tune a smaller model (flan-t5-base, foe example), and/or removing the --fp16=True argument?

@akutuzov
Copy link
Member

@VilhelmHovland I've just tried to fine-tune the flan-t5-base model on the few lines you quoted above. I repeated them multiple times, so that in the end I got a file with 12 instances (the file is here).

On this toy dataset, fine-tuning with batch size 4 and 2 epochs completed without any issues. I used one A100 GPU with 40GB of RAM. Here is the exact command:

python3 finetune_flan.py \
    --model_name_or_path google/flan-t5-base \
    --do_train \
    --do_eval \
    --train_file example_dataset.tsv \
    --validation_file example_dataset.tsv \
    --output_dir test_model \
    --overwrite_output_dir \
    --evaluation_strategy=epoch \
    --logging_strategy=epoch \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate \
    --save_total_limit=5 \
    --max_source_length=192 \
    --max_target_length=128 \
    --bf16=False \
    --num_train_epochs=2 \
    --save_strategy=epoch \
    --load_best_model_at_end=True \
    --metric_for_best_model=eval_rouge1 \
    --ddp_find_unused_parameters=False \
    --optim=adafactor \
    --report_to=none \

@VilhelmHovland
Copy link
Author

Okay, I tried as well, it does work now, thank you. What would be the bottleneck for finetuning the larger models then? Is there any way I could get it to work for those as well?

@akutuzov
Copy link
Member

Well, the usual procedure: set the per-device batch size to 1, and then increase it until you hit out-of-memory error again. This will be your ceiling in terms of RAM. Often, you can increase the batch size even more by using gradient accumulation (at the cost of slower training).
Using more than one GPU (within one node) will also naturally allow you to have a larger global batch size, which is usually a good thing.

@VilhelmHovland
Copy link
Author

@akutuzov Hello, thank you very much for the help earlier, I was hoping you would give me some more advice, I am still working with this model; the model seems to not be learning, and from what I can see in the logging the loss starts and stays at 0.0 (though the logging seems to be very limited and only showing a single epoch), so I suspect the issue is still with the training data. Attached is the batch script and a data sample (in tsv format).

ft_flan.job.txt
data.txt

@akutuzov
Copy link
Member

Hi @VilhelmHovland
Are you trying to fine-tune the already fine-tuned model ltg/flan-t5-definition-en-base? Where do the examples in your data file come from? If they come from CoDWoE, WordNet or Oxford, then no wonder the loss is zero: the model has already seen these definitions during our fine-tuning.

Otherwise, your SLURM script and data look good (of course I hope in reality you train on more examples, since in the attached file I see only 2, so it won't work with batch size 4 specified in the SLURM script).

@VilhelmHovland
Copy link
Author

I am fine-tuning the already fine-tuned model yes; using definitions from the Historical Thesaurus of English from the Oxford English Dictionary, I was expecting very high loss. That was just sample data to show the structure yes, the dataset I have is fairly large

@akutuzov
Copy link
Member

Try to remove --fp16=True from the arguments of your run script.
Mixed precision can cause problems sometimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants