Issue 17128 #17277

mygithubid1 · 2022-05-16T09:36:11Z

What does this PR do?

Before submitting

[N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Here's the link
[N/A] Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? I didn't write a custom test. Ran the following commands run to ensure local tests pass

RUN_PIPELINE_TESTS=yes python -m unittest discover -s tests/pipelines -p "test_pipelines_question_answering.py" -t . -v -f
python -m unittest discover -s . -p "test_tokenization_wav2vec2.py" -t . -v -f

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.

HuggingFaceDocBuilderDev · 2022-05-16T09:55:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.

Narsil

Thanks for working on this, I hate annoying warnings too !

I would like to propose a different change which tries to avoid touching the core tokenization part which should be entirely separate from the pipelines themselves (so fixing a warning in pipelines should not require a tokenizer modification 99% of the time).

The proposed fix is this:

Since we're not using real tensors (dtype==object) then we can actually use real raw lists directly ( removing return_tensors="np").

This however makes 1 tensor be passed as a list (token_type_ids) which in terms break the auto batcher of the pipeline (Things that should batch should be tensors, not lists). The fix consists simply on adding token_type_ids to the special name to transform into tensors before sending into the next step of the pipeline even though it's not used by the model, we could discard it but it doesn't seem necessary actually)

Here is the diff:

diff --git a/src/transformers/pipelines/question_answering.py b/src/transformers/pipelines/question_answering.py
index bbffa3471..6f4c0e985 100644
--- a/src/transformers/pipelines/question_answering.py
+++ b/src/transformers/pipelines/question_answering.py
@@ -279,7 +279,6 @@ class QuestionAnsweringPipeline(ChunkPipeline):
                 truncation="only_second" if question_first else "only_first",
                 max_length=max_seq_len,
                 stride=doc_stride,
-                return_tensors="np",
                 return_token_type_ids=True,
                 return_overflowing_tokens=True,
                 return_offsets_mapping=True,
@@ -294,12 +293,10 @@ class QuestionAnsweringPipeline(ChunkPipeline):
 
             # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
             # We put 0 on the tokens from the context and 1 everywhere else (question and special tokens)
-            p_mask = np.asarray(
-                [
-                    [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
-                    for span_id in range(num_spans)
-                ]
-            )
+            p_mask = [
+                [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
+                for span_id in range(num_spans)
+            ]
 
             features = []
             for span_idx in range(num_spans):
@@ -344,7 +341,7 @@ class QuestionAnsweringPipeline(ChunkPipeline):
         for i, feature in enumerate(features):
             fw_args = {}
             others = {}
-            model_input_names = self.tokenizer.model_input_names + ["p_mask"]
+            model_input_names = self.tokenizer.model_input_names + ["p_mask", "token_type_ids"]
 
             for k, v in feature.__dict__.items():
                 if k in model_input_names:

What do you think about this other change ?

Also and totally unrelated to your PR I saw that QA pipeline did not test batch_size argument which can sometimes be a problem so I think we should add a few tests here (not in this PR #17330)

Narsil · 2022-05-18T16:39:06Z

src/transformers/pipelines/question_answering.py

-                    for span_id in range(num_spans)
-                ]
-            )
+            p_mask = [


Yes ! it doesn't make any sense to make this a numpy array.

Narsil · 2022-05-18T16:39:43Z

src/transformers/pipelines/question_answering.py

+                        if isinstance(v, np.ndarray) and v.dtype == object:
+                            tensor = tf.constant(v, dtype=tf.int32)
+                        else:
+                            tensor = tf.constant(v)


Not a big fan of this since dtype == object does not necessarily mean dtype=tf.int32 logically (I know here it does make sense)

Narsil · 2022-05-18T16:41:30Z

src/transformers/tokenization_utils_base.py

                if not is_tensor(value):
-                    tensor = as_tensor(value)
+                    if as_tensor == np.asarray:


I would really refrain from touching src/transformers/tokenization_utils_base.py if possible. It's quite a core file, and usage are many. Especially the is_rectangular check, while correct I think is tricky to keep.

For the current fix I think we can find a smaller change that will still fix the warning for the pipeline.

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array. Update code based on review feedback.

mygithubid1

Thanks for the feedback. I'm ok with not touching tokenization_utils_base as long as QA pipeline is fixed for now. Hopefully, other modules that depend on this will be gradually fixed.

Please review the update.

Narsil

LGTM !

Thanks for updating.

mygithubid1 · 2022-05-19T15:36:54Z

Thank you, Narsil. I don't have write access. So, please merge to main.

Narsil · 2022-05-19T15:40:51Z

I will ping a wait for a core maintainer second eye.

@LysandreJik can you get a look ?

LysandreJik · 2022-05-19T18:31:31Z

Look good @mygithubid1! I see there are some blank line changes in the tokenization_utils_base.py. Could you revert those?

mygithubid1 · 2022-05-19T18:58:49Z

Please ignore this. My mistake.

ilikedata2 added 2 commits May 16, 2022 14:28

Fixes #17128 .

1c823d2

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.

Fixes #17128 .

2eae8a5

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.

ilikedata2 and others added 2 commits May 16, 2022 17:08

Fixes #17128 .

720dd72

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array.

Merge branch 'huggingface:main' into issue-17128

5e45853

mygithubid1 marked this pull request as ready for review May 16, 2022 12:11

LysandreJik requested a review from Narsil May 18, 2022 14:40

Narsil reviewed May 18, 2022

View reviewed changes

Fixes #17128 .

6a8746b

VisibleDeprecationWarning is addressed by specifying dtype=object when creating numpy array. Update code based on review feedback.

mygithubid1 commented May 18, 2022

View reviewed changes

Narsil approved these changes May 19, 2022

View reviewed changes

Merge branch 'huggingface:main' into issue-17128

c61c9d0

mygithubid1 closed this May 19, 2022

mygithubid1 reopened this May 19, 2022

mygithubid1 closed this May 19, 2022

mygithubid1 mentioned this pull request May 19, 2022

Fixes #17128 . #17356

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 17128 #17277

Issue 17128 #17277

mygithubid1 commented May 16, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 16, 2022 •

edited

Loading

Narsil left a comment •

edited

Loading

Narsil May 18, 2022

Narsil May 18, 2022

Narsil May 18, 2022

mygithubid1 left a comment

Narsil left a comment •

edited

Loading

mygithubid1 commented May 19, 2022

Narsil commented May 19, 2022

LysandreJik commented May 19, 2022

mygithubid1 commented May 19, 2022

Issue 17128 #17277

Issue 17128 #17277

Conversation

mygithubid1 commented May 16, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented May 16, 2022 • edited Loading

Narsil left a comment • edited Loading

Choose a reason for hiding this comment

Narsil May 18, 2022

Choose a reason for hiding this comment

Narsil May 18, 2022

Choose a reason for hiding this comment

Narsil May 18, 2022

Choose a reason for hiding this comment

mygithubid1 left a comment

Choose a reason for hiding this comment

Narsil left a comment • edited Loading

Choose a reason for hiding this comment

mygithubid1 commented May 19, 2022

Narsil commented May 19, 2022

LysandreJik commented May 19, 2022

mygithubid1 commented May 19, 2022

mygithubid1 commented May 16, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 16, 2022 •

edited

Loading

Narsil left a comment •

edited

Loading

Narsil left a comment •

edited

Loading