-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 17128 #17277
Issue 17128 #17277
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, I hate annoying warnings too !
I would like to propose a different change which tries to avoid touching the core tokenization part which should be entirely separate from the pipelines themselves (so fixing a warning in pipelines should not require a tokenizer modification 99% of the time).
The proposed fix is this:
Since we're not using real tensors (dtype==object
) then we can actually use real raw lists directly ( removing return_tensors="np"
).
This however makes 1 tensor be passed as a list (token_type_ids
) which in terms break the auto batcher of the pipeline (Things that should batch should be tensors, not lists). The fix consists simply on adding token_type_ids
to the special name to transform into tensors before sending into the next step of the pipeline even though it's not used by the model, we could discard it but it doesn't seem necessary actually)
Here is the diff:
diff --git a/src/transformers/pipelines/question_answering.py b/src/transformers/pipelines/question_answering.py
index bbffa3471..6f4c0e985 100644
--- a/src/transformers/pipelines/question_answering.py
+++ b/src/transformers/pipelines/question_answering.py
@@ -279,7 +279,6 @@ class QuestionAnsweringPipeline(ChunkPipeline):
truncation="only_second" if question_first else "only_first",
max_length=max_seq_len,
stride=doc_stride,
- return_tensors="np",
return_token_type_ids=True,
return_overflowing_tokens=True,
return_offsets_mapping=True,
@@ -294,12 +293,10 @@ class QuestionAnsweringPipeline(ChunkPipeline):
# p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
# We put 0 on the tokens from the context and 1 everywhere else (question and special tokens)
- p_mask = np.asarray(
- [
- [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
- for span_id in range(num_spans)
- ]
- )
+ p_mask = [
+ [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
+ for span_id in range(num_spans)
+ ]
features = []
for span_idx in range(num_spans):
@@ -344,7 +341,7 @@ class QuestionAnsweringPipeline(ChunkPipeline):
for i, feature in enumerate(features):
fw_args = {}
others = {}
- model_input_names = self.tokenizer.model_input_names + ["p_mask"]
+ model_input_names = self.tokenizer.model_input_names + ["p_mask", "token_type_ids"]
for k, v in feature.__dict__.items():
if k in model_input_names:
What do you think about this other change ?
Also and totally unrelated to your PR I saw that QA pipeline did not test batch_size
argument which can sometimes be a problem so I think we should add a few tests here (not in this PR #17330)
for span_id in range(num_spans) | ||
] | ||
) | ||
p_mask = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ! it doesn't make any sense to make this a numpy array.
if isinstance(v, np.ndarray) and v.dtype == object: | ||
tensor = tf.constant(v, dtype=tf.int32) | ||
else: | ||
tensor = tf.constant(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of this since dtype == object
does not necessarily mean dtype=tf.int32
logically (I know here it does make sense)
if not is_tensor(value): | ||
tensor = as_tensor(value) | ||
if as_tensor == np.asarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really refrain from touching src/transformers/tokenization_utils_base.py
if possible. It's quite a core file, and usage are many. Especially the is_rectangular
check, while correct I think is tricky to keep.
For the current fix I think we can find a smaller change that will still fix the warning for the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'm ok with not touching tokenization_utils_base as long as QA pipeline is fixed for now. Hopefully, other modules that depend on this will be gradually fixed.
Please review the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Thanks for updating.
Thank you, Narsil. I don't have write access. So, please merge to |
I will ping a wait for a core maintainer second eye. @LysandreJik can you get a look ? |
Look good @mygithubid1! I see there are some blank line changes in the |
Please ignore this. My mistake. |
What does this PR do?
Fixes #17128
Before submitting
Pull Request section?
RUN_PIPELINE_TESTS=yes python -m unittest discover -s tests/pipelines -p "test_pipelines_question_answering.py" -t . -v -f
python -m unittest discover -s . -p "test_tokenization_wav2vec2.py" -t . -v -f
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@LysandreJik