Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 17128 #17277

Closed
wants to merge 6 commits into from
Closed

Issue 17128 #17277

wants to merge 6 commits into from

Conversation

mygithubid1
Copy link
Contributor

@mygithubid1 mygithubid1 commented May 16, 2022

What does this PR do?

Fixes #17128

Before submitting

  • [N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Here's the link
  • [N/A] Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests? I didn't write a custom test. Ran the following commands run to ensure local tests pass
  1. RUN_PIPELINE_TESTS=yes python -m unittest discover -s tests/pipelines -p "test_pipelines_question_answering.py" -t . -v -f
  2. python -m unittest discover -s . -p "test_tokenization_wav2vec2.py" -t . -v -f

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.
VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 16, 2022

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ilikedata2 and others added 2 commits May 16, 2022 17:08
VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.
@mygithubid1 mygithubid1 marked this pull request as ready for review May 16, 2022 12:11
@LysandreJik LysandreJik requested a review from Narsil May 18, 2022 14:40
Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, I hate annoying warnings too !

I would like to propose a different change which tries to avoid touching the core tokenization part which should be entirely separate from the pipelines themselves (so fixing a warning in pipelines should not require a tokenizer modification 99% of the time).

The proposed fix is this:

Since we're not using real tensors (dtype==object) then we can actually use real raw lists directly ( removing return_tensors="np").

This however makes 1 tensor be passed as a list (token_type_ids) which in terms break the auto batcher of the pipeline (Things that should batch should be tensors, not lists). The fix consists simply on adding token_type_ids to the special name to transform into tensors before sending into the next step of the pipeline even though it's not used by the model, we could discard it but it doesn't seem necessary actually)

Here is the diff:

diff --git a/src/transformers/pipelines/question_answering.py b/src/transformers/pipelines/question_answering.py
index bbffa3471..6f4c0e985 100644
--- a/src/transformers/pipelines/question_answering.py
+++ b/src/transformers/pipelines/question_answering.py
@@ -279,7 +279,6 @@ class QuestionAnsweringPipeline(ChunkPipeline):
                 truncation="only_second" if question_first else "only_first",
                 max_length=max_seq_len,
                 stride=doc_stride,
-                return_tensors="np",
                 return_token_type_ids=True,
                 return_overflowing_tokens=True,
                 return_offsets_mapping=True,
@@ -294,12 +293,10 @@ class QuestionAnsweringPipeline(ChunkPipeline):
 
             # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
             # We put 0 on the tokens from the context and 1 everywhere else (question and special tokens)
-            p_mask = np.asarray(
-                [
-                    [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
-                    for span_id in range(num_spans)
-                ]
-            )
+            p_mask = [
+                [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
+                for span_id in range(num_spans)
+            ]
 
             features = []
             for span_idx in range(num_spans):
@@ -344,7 +341,7 @@ class QuestionAnsweringPipeline(ChunkPipeline):
         for i, feature in enumerate(features):
             fw_args = {}
             others = {}
-            model_input_names = self.tokenizer.model_input_names + ["p_mask"]
+            model_input_names = self.tokenizer.model_input_names + ["p_mask", "token_type_ids"]
 
             for k, v in feature.__dict__.items():
                 if k in model_input_names:

What do you think about this other change ?

Also and totally unrelated to your PR I saw that QA pipeline did not test batch_size argument which can sometimes be a problem so I think we should add a few tests here (not in this PR #17330)

for span_id in range(num_spans)
]
)
p_mask = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ! it doesn't make any sense to make this a numpy array.

Comment on lines 348 to 351
if isinstance(v, np.ndarray) and v.dtype == object:
tensor = tf.constant(v, dtype=tf.int32)
else:
tensor = tf.constant(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of this since dtype == object does not necessarily mean dtype=tf.int32 logically (I know here it does make sense)

if not is_tensor(value):
tensor = as_tensor(value)
if as_tensor == np.asarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really refrain from touching src/transformers/tokenization_utils_base.py if possible. It's quite a core file, and usage are many. Especially the is_rectangular check, while correct I think is tricky to keep.

For the current fix I think we can find a smaller change that will still fix the warning for the pipeline.

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.
Update code based on review feedback.
Copy link
Contributor Author

@mygithubid1 mygithubid1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'm ok with not touching tokenization_utils_base as long as QA pipeline is fixed for now. Hopefully, other modules that depend on this will be gradually fixed.

Please review the update.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Thanks for updating.

@mygithubid1
Copy link
Contributor Author

Thank you, Narsil. I don't have write access. So, please merge to main.

@Narsil
Copy link
Contributor

Narsil commented May 19, 2022

I will ping a wait for a core maintainer second eye.

@LysandreJik can you get a look ?

@LysandreJik
Copy link
Member

Look good @mygithubid1! I see there are some blank line changes in the tokenization_utils_base.py. Could you revert those?

@mygithubid1
Copy link
Contributor Author

Please ignore this. My mistake.

@mygithubid1 mygithubid1 reopened this May 19, 2022
@mygithubid1 mygithubid1 mentioned this pull request May 19, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fixing VisibleDeprecationWarning
5 participants