fix: make ExtractiveReader
handle situations where token_to_chars
returns None instead of a (start, end) tuple
#6382
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues
ExtractiveReader
fails with proper input Documents #6098Proposed Changes:
Some tokenizers seems to have their
token_to_chars
function returning None on some inputs, while the function should always return a tuple as long as the token is in the context, according to its own docstring.Considering that this is likely a bug of the tokenizer, and to prevent such issues from crashing the reader, I propose to use None as start/end positions when this bug occurs. This choice means that if the starting token points to position None, it will match the start of the document, and if the end token points to position None, it will match the end of the document. Such occurrences are logged.
Alternatively we could raise an error message, but the source of this behavior is unclear and likely tokenizer-dependent, see huggingface/transformers#1662, huggingface/transformers#8209.
How did you test it?
Added a new unit test.
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.