[examples/run_s2s] remove task_specific_params and update rouge computation #10133

patil-suraj · 2021-02-11T08:17:06Z

What does this PR do?

correctly handle task_specific_params and prefix
The current script tries to access the prefix from config.task_specific_params.prefix, which is always going to be None as task_specific_params is a nested dict with each key being a task name. This PR retrieves the task_specific_params from config using the task name (data_args.task), updates the config with the retrieved params (this is needed for T5), and access prefix using config.prefix

@stas00 as you reported offline, the bleu score for the new script was different from the old script for T5 on the en-ro task. This was because the old script was using the task_specific_params and the new script wasn't. This update should resolve that issue.
Update rouge score computation.
The rougeLsum metric expects newlines between each sentence, this is usually the score reported in papers. This PR
1. adds newlines to each sentence in preds and labels using nltk to correctly compute rougeLsum
2. pass use_stemmer=True to metric.compute to match the metrics with old script.
Add test_file argument to DataTrainingArguments to load custom test dataset.

sgugger

Thanks a lot for fixing the script!

examples/seq2seq/run_seq2seq.py

sgugger · 2021-02-11T16:49:24Z

examples/seq2seq/run_seq2seq.py

+    # update config with task specific params
+    task_specific_params = model.config.task_specific_params
+    if task_specific_params is not None:
+        params = task_specific_params.get(data_args.task, {})
+        logger.info(f"Updating model.config with task specific params for {data_args.task}:\n {params}")
+        logger.info("Note: command line args may override some of these.")
+        model.config.update(params)


This was the thing @patrickvonplaten told me to remove, so just pinging him here so you two can fight :-)

This is mostly for T5 and for reproducing the metrics. I don't have any strong opinion here. If we decide to remove this, then we should remove all mentions of task_specific_params from the script and use the prefix only if the user has specified it.

I don't care if you want to change this, as long as we can accomplish the same in a new way.

I have a bit of a hard time understanding what is the intention behind removing functionality. Is this bad functionality? Is it not useful?

As I mentioned several times in this let's-rewrite-things context, as long as I have a reliable sensitive tool that can help me detect quality regressions over short dataset sample sets I am not attached to any specific way.

stas00

Thank you, @patil-suraj for this functionality sync with the old script. That's wonderful!

I see there is a potential conflict with restoring functionality that was purposefully removed. so let's see what @patrickvonplaten says.

patrickvonplaten · 2021-02-12T06:52:12Z

Context:

Here some context on the task_specific_params config param. In the beginning, we had T5 as the only model that was used for both the translation and summarization pipeline. The problem was that we had one model that we used as a default for both pipelines. At that time @thomwolf and I thought about a nice general design that - depending on the specific task (e.g. summarization, translation) - automatically sets the correct parameter set, so we started adding a task_specific_params parameter to the config that depending on the task sets the correct parameters. This is why the config of T5 is so long and looks like this:

{
...
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  ...
}

=> So this design was chosen only for the pipelines and essentially only for T5 version 1 since T5 version 1, is the only model we have that needs task-specific params (especially due to the different required prefixes depending on the task). Up until now, there were too many problems with this mechanism IMO so that the benefit of having it is IMO outweighed by its disadvantages, which are:

1) It blows up the config a lot and is not scalable (what do you do with many-to-many translation models? you can have each combination of translation_..._to_...)

2) No one understood anymore what was happening under the hood. IMO, having such a mechanism is a bit too "magical" because it creates a whole other logical layer to the already complicated mechanism that we have for the config params. In short, we currently have the following logic in pipelines:

i) The function argument is used (such as max_length), if not given, then
ii) the config's task_specific_params (such as config.task_specific_params["summarization"]["max_length"] is used, if not set, then
iii) the normal config's param is used such as config.max_length, if not set, then
iiii) the default PretrainedConfig param is used.

=> It is obvious that this a very complicated and somewhat "magical" logic and lot of people internally didn't even really understand it. This is why I really would like to remove the second step. It's confusing to see multiple max_length parameters in the config IMO and it's just not worth it.

3) So far T5 is the only model that really requires this "magical" mechanism and that's mostly because it has a very special constraint in the sense that it was primed during training on cues such as translation from X to Y: ... which is definitely not something general that we would expect future models to have as well. We might very well have models in the future that have task-specific params like max_length and beam_search (It can very well be that a GPT3-like model that can do everything wants to adapt those params depending on the task), but those params are usually things that people are aware of and adjust themselves during evaluation IMO. E.g. if one is evaluating a model on summarization, setting the correct max_length, num_beams and maybe repetition_penalty is IMO something people should do themselves and not expect to be set correctly automatically.

4) It makes the pipelines in general very inflexible. E.g. when importing the pipeline classes directly, say the TranslationPipeline (which is what we did for a long time for the inference API - and maybe still do - not so sure anymore @julien-c @Narsil), there is no way of knowing that we should pass a task="summary" arg to the init to correctly load the task_specific_parms. To be more precise, imagine you want to directly import the TranslationPipeline here:

transformers/src/transformers/pipelines/text2text_generation.py

Line 215 in 3124577

class TranslationPipeline(Text2TextGenerationPipeline):

where you don't see any task param. But in order to correctly load T5 translation params for TranslationPipeline, you actually manually have to pass task="translation_en_to_de" to the init (also note here that it's not as easy as just saying - let's just add a class attribute self.task = "translation_en_to_de" because the same pipeline is also used for EN->RO translation in which case one could not use the class attribute... => this created a lot of problems leading to @julien-c eventually hard-coding (I think) the correct task name for T5 into the inference API code, which then kind of defeated the purpose of having this mechanism.

Conclusion

That being said, I see two solutions in general:

Eventually completely remove this mechanism (which I prefer)
Keep this mechanism for the pipelines only. Since things like the pipelines or AutoNLP are not meant to be built for researchers I'm ok with having some "under-the-hood" magic / very abstracted logic there, but I definitely don't want to have it anywhere else.

=> This means that I really don't think that should use this param in run_seq2seq.py. It creates more confusion than it really helps and is not in line with our motivation to have the examples be "easy to tweak and to understand" by the user. I think as @sgugger already said multiple times the example scripts should not follow the "one-command-fits-all-cases" approach, but rather should be easy to understand and to tweak for the specific task. This is why I'm quite strongly against using the task_specific_params here. However, @patil-suraj @stas00 I think you are completely correct that we should try to not have a regression in performance here. So I would then actually prefer to hard code T5's prefixes in the script. Something like:

T5_PREFIX = {
    "summary": ...
    "translation_en_to_de": ...
}

Sorry for the long text, but I think this is actually an important mechanism not too many people are aware of and we should think about a more general solution for how to continue with task_specific_params. Actually also pinging @LysandreJik on this one to hear his opinion.

Happy to hear your opinions on what I wrote above :-)

patil-suraj · 2021-02-12T07:34:26Z

Thanks a lot for the context @patrickvonplaten

Regarding the script, to follow the examples philosophy, let's just remove it completely. If a model requires prefix it should be passed explicitly and related params should be copied to the config manually in case one wants to reproduce some metrics.

stas00 · 2021-02-12T09:03:14Z

Thank you for the detailed explanation, @patrickvonplaten - that was very awesome of you to write it all out in such clarity.

I'm totally fine with your proposal, yet I think it'd be important to document how does one reproduce the same behavior with the new script and new t5 config then.

I already started an issue that documents the nuances of porting from ./finetune_trainer.py #10036 so perhaps it can belong there and once the notes have been compiled we can put them into the seq2seq/README.md to help users transition before ./finetune_trainer.py is moved into the unmaintained territory.

Should you decide to remove this mechanism completely, the t5 models on the hub should probably be updated to reflect that at some future point, so that there is no baggage to carry forward. Perhaps in a few release cycles after the cut is done? Surely, users who use older transformers version should still be able to run their scripts normally for quite some time. I'd imagine that's where the model files versioning could come in.

patil-suraj · 2021-02-12T09:07:32Z

@stas00

To reproduce the same behavior with the new script

Use the same dataset
if using T5 manually pass the prefix argument,
manually copy the task_specific_parms to config

Again, this is just for T5, the rest of the models should give similar results. So I'm going to merge this PR and let's update the readme in the clean-up PR #10136.

patil-suraj added 4 commits February 11, 2021 11:23

fix rouge metrics and task specific params

31eb60d

fix typo

d0f5f89

round metrics

8aadc3e

typo

01b68c6

patil-suraj requested review from sgugger and stas00 February 11, 2021 08:40

sgugger approved these changes Feb 11, 2021

View reviewed changes

stas00 approved these changes Feb 11, 2021

View reviewed changes

patrickvonplaten requested review from LysandreJik and patrickvonplaten February 12, 2021 07:01

remove task_specific_params

f069760

patrickvonplaten approved these changes Feb 12, 2021

View reviewed changes

patil-suraj changed the title ~~[examples/run_s2s] fix task_specific_params and update rouge computation~~ [examples/run_s2s] remove task_specific_params and update rouge computation Feb 12, 2021

patil-suraj merged commit f51188c into huggingface:master Feb 12, 2021

patil-suraj deleted the fix-run-s2s branch February 12, 2021 11:48

stas00 mentioned this pull request Feb 12, 2021

[s2s examples] convert existing scripts to run_seq2seq.py from finetune_trainer.py #10036

Closed

4 tasks

stas00 mentioned this pull request Mar 16, 2021

[examples run_summarization.py] t5 worse score w/ --source_prefix "summarize: " than w/o #10733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[examples/run_s2s] remove task_specific_params and update rouge computation #10133

[examples/run_s2s] remove task_specific_params and update rouge computation #10133

patil-suraj commented Feb 11, 2021 •

edited

Loading

sgugger left a comment

sgugger Feb 11, 2021

patil-suraj Feb 11, 2021

stas00 Feb 11, 2021

stas00 left a comment •

edited

Loading

patrickvonplaten commented Feb 12, 2021 •

edited

Loading

patil-suraj commented Feb 12, 2021

stas00 commented Feb 12, 2021 •

edited

Loading

patil-suraj commented Feb 12, 2021

[examples/run_s2s] remove task_specific_params and update rouge computation #10133

[examples/run_s2s] remove task_specific_params and update rouge computation #10133

Conversation

patil-suraj commented Feb 11, 2021 • edited Loading

What does this PR do?

sgugger left a comment

Choose a reason for hiding this comment

sgugger Feb 11, 2021

Choose a reason for hiding this comment

patil-suraj Feb 11, 2021

Choose a reason for hiding this comment

stas00 Feb 11, 2021

Choose a reason for hiding this comment

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

patrickvonplaten commented Feb 12, 2021 • edited Loading

patil-suraj commented Feb 12, 2021

stas00 commented Feb 12, 2021 • edited Loading

patil-suraj commented Feb 12, 2021

patil-suraj commented Feb 11, 2021 •

edited

Loading

stas00 left a comment •

edited

Loading

patrickvonplaten commented Feb 12, 2021 •

edited

Loading

stas00 commented Feb 12, 2021 •

edited

Loading