fix: make `ExtractiveReader` handle situations where `token_to_chars` returns None instead of a (start, end) tuple #6382

ZanSara · 2023-11-22T16:07:43Z

Related Issues

fixes ExtractiveReader fails with proper input Documents #6098

Proposed Changes:

Some tokenizers seems to have their token_to_chars function returning None on some inputs, while the function should always return a tuple as long as the token is in the context, according to its own docstring.

Considering that this is likely a bug of the tokenizer, and to prevent such issues from crashing the reader, I propose to use None as start/end positions when this bug occurs. This choice means that if the starting token points to position None, it will match the start of the document, and if the end token points to position None, it will match the end of the document. Such occurrences are logged.

Alternatively we could raise an error message, but the source of this behavior is unclear and likely tokenizer-dependent, see huggingface/transformers#1662, huggingface/transformers#8209.

How did you test it?

Added a new unit test.

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

julian-risch

LGTM to prevent the reader from crashing. I am bit concerned that we don't have a better understanding why the issues with the tokenizers occur. It's outside of the Haystack code base, so no issue with the reader, which is good.

ZanSara added 2 commits November 22, 2023 16:36

fix reader bug

77d90ca

add test

ea00c40

github-actions bot added topic:tests 2.x Related to Haystack v2.0 labels Nov 22, 2023

ZanSara changed the title ~~Reader bug~~ fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple Nov 22, 2023

ZanSara added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Nov 22, 2023

log

cf4ae45

ZanSara marked this pull request as ready for review November 22, 2023 16:10

ZanSara requested a review from a team as a code owner November 22, 2023 16:10

ZanSara requested review from silvanocerza and removed request for a team November 22, 2023 16:10

ZanSara changed the title ~~fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple~~ fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple Nov 22, 2023

ZanSara marked this pull request as draft November 22, 2023 16:12

ZanSara removed the request for review from silvanocerza November 22, 2023 16:12

fix logging

4a9b71e

ZanSara marked this pull request as ready for review November 22, 2023 16:16

ZanSara requested a review from julian-risch November 22, 2023 16:17

improve error message

42ee5a8

github-actions bot added the type:documentation Improvements on the docs label Nov 23, 2023

julian-risch approved these changes Nov 23, 2023

View reviewed changes

julian-risch merged commit c45d8c3 into main Nov 24, 2023
22 checks passed

julian-risch deleted the reader-bug branch November 24, 2023 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make `ExtractiveReader` handle situations where `token_to_chars` returns None instead of a (start, end) tuple #6382

fix: make `ExtractiveReader` handle situations where `token_to_chars` returns None instead of a (start, end) tuple #6382

ZanSara commented Nov 22, 2023 •

edited

Loading

julian-risch left a comment

fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple #6382

fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple #6382

Conversation

ZanSara commented Nov 22, 2023 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

julian-risch left a comment

Choose a reason for hiding this comment

fix: make `ExtractiveReader` handle situations where `token_to_chars` returns None instead of a (start, end) tuple #6382

fix: make `ExtractiveReader` handle situations where `token_to_chars` returns None instead of a (start, end) tuple #6382

ZanSara commented Nov 22, 2023 •

edited

Loading