[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

stas00 · 2021-02-05T21:35:27Z

As transformers examples are evolving it seems that the good old finetune_trainer.py is going to be moved into unmaintained examples/legacy/ area, and run_seq2seq.py is to be the new king, so let's automate this process

Assuming your cmd script is process.txt (and replace with the file names that you have (one or many), let's auto-adjust it:

Renames

# main name and args rename
perl -pi -e 's|finetune_trainer|run_seq2seq|g; s#--n_(train|val)#--max_$1_samples#g; \
s|--src_lang|--source_lang|g; s|--tgt_lang|--target_lang|g; s|--eval_beams|--num_beams|'  process.txt

# drop no longer supported args
perl -pi -e 's|--freeze_embeds||; s|--test_max_target_length[ =]+\d+||;' process.txt

t5 auto-adding prefix has been dropped, so you need to add it manually, e.g.:

--source_prefix "translate English to Romanian: "

otherwise the results would be terrible.

Datasets are different

a. need to convert the normal dataset into jsonlines (unless the data is already on datasets hub)
instructions are: https://huggingface.co/docs/datasets/loading_datasets.html#json-files

b. new arguments:

instead of

            --data_dir {data_dir}

now you need:

            --train_file {data_dir}/train.json
            --validation_file {data_dir}/val.json

Here's is an example conversion script for the wmt_en_ro dataset:

# convert.py
import io
import json
import re

src_lang, tgt_lang = ["en", "ro"]

for split in ["train", "val", "test"]:
    recs = []
    fout = f"{split}.json"
    with io.open(fout, "w", encoding="utf-8") as f:
        for type in ["source", "target"]:
            fin = f"{split}.{type}"
            recs.append([line.strip() for line in open(fin)])
        for src, tgt in zip(*recs):
            out = {"translation": { src_lang: src, tgt_lang: tgt } }
            x = json.dumps(out, indent=0, ensure_ascii=False)
            x = re.sub(r'\n', ' ', x, 0, re.M)
            f.write(x + "\n")

Or if you find an existing dataset in datasets, you can supply it instead of the --data_dir arg as following:

--dataset_name wmt16 --dataset_config_name ro-en

Here is the full conversion table from the previously recommended 4 datasets in the examples/seq2seq folder:

--data_dir wmt_en_de => --dataset_name wmt14 --dataset_config "de-en" or if you want the highest score use: --dataset_name wmt14-en-de-pre-processed
--data_dir wmt_en_ro => --dataset_name wmt16 --dataset_config "ro-en"`
--data_dir cnn_dm => --dataset_name cnn_dailymail --dataset_config "3.0.0"
--data_dir xsum => --dataset_name xsum

You will find more details here

t5-specific changes: from #10133 (comment)

1. Use the same dataset
2. if using T5 manually pass the `prefix` argument,
3. manually copy the `task_specific_parms` to `config`

The text was updated successfully, but these errors were encountered:

stas00 · 2021-03-16T02:13:49Z

run_seq2seq.py didn't survive for long, it's no more in master, so yet another automatic conversion for translation scripts is:

perl -pi -e 's|run_seq2seq.py|run_translation.py|g; s|--task translation_(\w\w)_to_(\w\w)|--source_lang $1 --target_lang $2|;' process.txt

github-actions · 2021-04-14T15:04:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

edwardzjl · 2023-11-06T07:57:17Z

Hi there, I found there might be a typo in your script.

the n_val param was renamed to max_eval_samples in examples/pytorch/translation/run_translation.pym not max_val_samples

I'm not sure whether I'm correct, as I'm totally new in funetuning.

stas00 · 2023-11-06T16:37:11Z

Examples are just that and there is no guarantee that they are still the same 2.5 years later from when this thread was started, so chances are very high that many things discussed many years ago won't work now and you have to (1) either read the current version and adapt to it (2) use the transformers version from the date this thread was started and then it'd work as discussed in this thread.

stas00 self-assigned this Feb 5, 2021

stas00 changed the title ~~[conversion] convert existing scripts to run_seq2seq.py from finetune_trainer.py~~ [s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py Feb 5, 2021

stas00 added the Examples Which is related to examples in general label Feb 5, 2021

stas00 mentioned this issue Feb 25, 2021

Option to output "test predictions" text file with each checkpoint in run_seq2seq.py #10381

Closed

github-actions bot closed this as completed Apr 23, 2021

stas00 mentioned this issue Apr 2, 2022

t5-large model OOM with FP16, but run well without FP16 #16561

Closed

4 tasks

ZeyiLiao mentioned this issue Jun 3, 2022

How to use finetuner.py to train t5-large model #17534

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

stas00 commented Feb 5, 2021 •

edited

Loading

stas00 commented Mar 16, 2021

github-actions bot commented Apr 14, 2021

edwardzjl commented Nov 6, 2023

stas00 commented Nov 6, 2023

[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

Comments

stas00 commented Feb 5, 2021 • edited Loading

stas00 commented Mar 16, 2021

github-actions bot commented Apr 14, 2021

edwardzjl commented Nov 6, 2023

stas00 commented Nov 6, 2023

stas00 commented Feb 5, 2021 •

edited

Loading