-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036
Comments
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi there, I found there might be a typo in your script. the I'm not sure whether I'm correct, as I'm totally new in funetuning. |
Examples are just that and there is no guarantee that they are still the same 2.5 years later from when this thread was started, so chances are very high that many things discussed many years ago won't work now and you have to (1) either read the current version and adapt to it (2) use the transformers version from the date this thread was started and then it'd work as discussed in this thread. |
As
transformers
examples are evolving it seems that the good oldfinetune_trainer.py
is going to be moved into unmaintainedexamples/legacy/
area, andrun_seq2seq.py
is to be the new king, so let's automate this processAssuming your cmd script is
process.txt
(and replace with the file names that you have (one or many), let's auto-adjust it:otherwise the results would be terrible.
a. need to convert the normal dataset into jsonlines (unless the data is already on datasets hub)
instructions are: https://huggingface.co/docs/datasets/loading_datasets.html#json-files
b. new arguments:
instead of
now you need:
Here's is an example conversion script for the
wmt_en_ro
dataset:Or if you find an existing dataset in
datasets
, you can supply it instead of the--data_dir
arg as following:Here is the full conversion table from the previously recommended 4 datasets in the
examples/seq2seq
folder:--data_dir wmt_en_de
=>--dataset_name wmt14 --dataset_config "de-en"
or if you want the highest score use:--dataset_name wmt14-en-de-pre-processed
--data_dir wmt_en_ro
=> --dataset_name wmt16 --dataset_config "ro-en"`--data_dir cnn_dm
=>--dataset_name cnn_dailymail --dataset_config "3.0.0"
--data_dir xsum
=>--dataset_name xsum
You will find more details here
t5-specific changes: from #10133 (comment)
The text was updated successfully, but these errors were encountered: