-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs when fine tuning the gpt2 #12965
Comments
Pinging @sgugger |
It's hard to investigate more without having the data. Adding padding when fine-tuning GPT-2 is a very bad idea when fine-tuning GPT-2, which does not have a padding token, and it shouldn't be necessary. Could you provide us with a reproducer that includes the data? |
Thanks for your suggestion. I will check my data to meet the default setting of fine-tuning. |
If it's not done by the tokenizer, yes it should. |
|
@sgugger Hello, I try to reproduce this error. The texts above is the samples for finetuning for GPT2. It is the column of
When I use another dataset, which have longer sentences than this dataset, there is no error and the finetuning process is OK. |
I also tried sentiment analysis dataset, which also consists of relatively short sentences. The error came out too. |
|
I try another manner to organise the training corpus, as txt file:
The error comes the same.
|
Yes, this all points out to your corpus being too short to form a full batch. You should use a lower batch size or a lower block size. |
Transformers Version: 4.8.2
Torch Version: 1.8.0
I am using the official script to fine tune the gpt2 on the csv files.
the script:
https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py
train and validation file makeup:
My shell command:
where ths csv files contain a column, named 'text' for fine tuning the model.
However, there are always errors below, suggesting the lengths of the dataloader
Next time I run it, it returns the similar error:
Then I modified the input params of tokenizer:
This seems fix the problem. However, the generated texts are quite short after this change.
Any suggestions?
The text was updated successfully, but these errors were encountered: