Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent TextGenerationPipeline._sanitize_parameters from overriding previously provided parameters #30362

Merged
merged 3 commits into from
May 3, 2024

Conversation

yting27
Copy link
Contributor

@yting27 yting27 commented Apr 20, 2024

What does this PR do?

Below is the code example I used. The max_length and truncation (as well as padding and add_special_tokens) parameters passed to the pipeline method are overridden when generating output.

from transformers import pipeline


template_inst = "Instruct: {prompt}\nOutput:"
max_length = 25

text_generator = pipeline("text-generation", model="openai-gpt", max_length=max_length, max_new_tokens=30, truncation=True)
results = text_generator(template_inst.format(prompt="The personal life of William Shakespeare is somewhat of a mystery. There are two primary sources that provide historians with an outline of his life. One is his work, and the other is official documentation such as church and court records. However, these provide only brief sketches of specific events in his life and yield little insight into the man himself. When Was Shakespeare Born?"))

print(results)

Print out preprocess_params variables in Pipeline.__call__() method:
image

Kwargs: {}
self._preprocess_params: {'add_special_tokens': False, 'truncation': True, 'padding': False, 'max_length': 25, 'max_new_tokens': 30}
preprocess_params: {'add_special_tokens': False, 'truncation': None, 'padding': False, 'max_length': None}
preprocess_params (after): {'add_special_tokens': False, 'truncation': None, 'padding': False, 'max_length': None, 'max_new_tokens': 30}

As shown in the output, after self._sanitize_parameters() is called, both truncation and max_length are set to None.

I know we can pass the preprocess parameters when calling text_generator method (see code above). However, it just doesn't make sense to me that the previous parameters provided should be overridden. Hence I made the change. Feel free to improve if needed 😃

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts
Copy link
Collaborator

cc @gante @zucchini-nlp

@zucchini-nlp
Copy link
Member

I think it makes sense and the changes are consistent with the text2text pipeline, where "truncation" is handled similarly

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for making transformers better @yting27 💛

Context for @amyeroberts [not sure what's your pipeline expertise, so I'm going with None]: when we init a pipeline, we store kwargs such as tokenizer options to act as default values. Before this PR, preprocess_params (which would hold some of the __call__-time kwargs) was being initialized with a set of default values, and those default values would override the kwargs set at __init__ time. By initializing preprocess_params as an empty dict we avoid that, fixing the issue.

🔍 I've checked the other pipelines: this is the only pipeline that is NOT following the correct pattern on main.

@gante gante requested a review from amyeroberts May 3, 2024 13:53
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fair to me as well! Thanks for your contribution @yting27

@LysandreJik LysandreJik merged commit deb7605 into huggingface:main May 3, 2024
17 checks passed
itazap pushed a commit that referenced this pull request May 14, 2024
… previously provided parameters (#30362)

* Fixed TextGenerationPipeline._sanitize_parameters default params

* removed empty spaces

---------

Co-authored-by: Ng, Yen Ting <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants