Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

Closed
4 tasks done
stas00 opened this issue Feb 5, 2021 · 4 comments
Closed
4 tasks done
Assignees
Labels
Examples Which is related to examples in general

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 5, 2021

As transformers examples are evolving it seems that the good old finetune_trainer.py is going to be moved into unmaintained examples/legacy/ area, and run_seq2seq.py is to be the new king, so let's automate this process

Assuming your cmd script is process.txt (and replace with the file names that you have (one or many), let's auto-adjust it:

  1. Renames
# main name and args rename
perl -pi -e 's|finetune_trainer|run_seq2seq|g; s#--n_(train|val)#--max_$1_samples#g; \
s|--src_lang|--source_lang|g; s|--tgt_lang|--target_lang|g; s|--eval_beams|--num_beams|'  process.txt

# drop no longer supported args
perl -pi -e 's|--freeze_embeds||; s|--test_max_target_length[ =]+\d+||;' process.txt

  1. t5 auto-adding prefix has been dropped, so you need to add it manually, e.g.:
--source_prefix "translate English to Romanian: "

otherwise the results would be terrible.

  1. Datasets are different

a. need to convert the normal dataset into jsonlines (unless the data is already on datasets hub)
instructions are: https://huggingface.co/docs/datasets/loading_datasets.html#json-files

b. new arguments:

instead of

            --data_dir {data_dir}

now you need:

            --train_file {data_dir}/train.json
            --validation_file {data_dir}/val.json

Here's is an example conversion script for the wmt_en_ro dataset:

# convert.py
import io
import json
import re

src_lang, tgt_lang = ["en", "ro"]

for split in ["train", "val", "test"]:
    recs = []
    fout = f"{split}.json"
    with io.open(fout, "w", encoding="utf-8") as f:
        for type in ["source", "target"]:
            fin = f"{split}.{type}"
            recs.append([line.strip() for line in open(fin)])
        for src, tgt in zip(*recs):
            out = {"translation": { src_lang: src, tgt_lang: tgt } }
            x = json.dumps(out, indent=0, ensure_ascii=False)
            x = re.sub(r'\n', ' ', x, 0, re.M)
            f.write(x + "\n")

Or if you find an existing dataset in datasets, you can supply it instead of the --data_dir arg as following:

--dataset_name wmt16 --dataset_config_name ro-en 

Here is the full conversion table from the previously recommended 4 datasets in the examples/seq2seq folder:

  • --data_dir wmt_en_de => --dataset_name wmt14 --dataset_config "de-en" or if you want the highest score use: --dataset_name wmt14-en-de-pre-processed
  • --data_dir wmt_en_ro => --dataset_name wmt16 --dataset_config "ro-en"`
  • --data_dir cnn_dm => --dataset_name cnn_dailymail --dataset_config "3.0.0"
  • --data_dir xsum => --dataset_name xsum

You will find more details here


t5-specific changes: from #10133 (comment)

1. Use the same dataset
2. if using T5 manually pass the `prefix` argument,
3. manually copy the `task_specific_parms` to `config`
@stas00 stas00 self-assigned this Feb 5, 2021
@stas00 stas00 changed the title [conversion] convert existing scripts to run_seq2seq.py from finetune_trainer.py [s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py Feb 5, 2021
@stas00 stas00 added the Examples Which is related to examples in general label Feb 5, 2021
@stas00
Copy link
Contributor Author

stas00 commented Mar 16, 2021

run_seq2seq.py didn't survive for long, it's no more in master, so yet another automatic conversion for translation scripts is:

perl -pi -e 's|run_seq2seq.py|run_translation.py|g; s|--task translation_(\w\w)_to_(\w\w)|--source_lang $1 --target_lang $2|;' process.txt

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@edwardzjl
Copy link

Hi there, I found there might be a typo in your script.

the n_val param was renamed to max_eval_samples in examples/pytorch/translation/run_translation.pym not max_val_samples

I'm not sure whether I'm correct, as I'm totally new in funetuning.

@stas00
Copy link
Contributor Author

stas00 commented Nov 6, 2023

Examples are just that and there is no guarantee that they are still the same 2.5 years later from when this thread was started, so chances are very high that many things discussed many years ago won't work now and you have to (1) either read the current version and adapt to it (2) use the transformers version from the date this thread was started and then it'd work as discussed in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Examples Which is related to examples in general
Projects
None yet
Development

No branches or pull requests

2 participants