Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple #6382

Merged
merged 5 commits into from
Nov 24, 2023

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Nov 22, 2023

Related Issues

Proposed Changes:

Some tokenizers seems to have their token_to_chars function returning None on some inputs, while the function should always return a tuple as long as the token is in the context, according to its own docstring.

Considering that this is likely a bug of the tokenizer, and to prevent such issues from crashing the reader, I propose to use None as start/end positions when this bug occurs. This choice means that if the starting token points to position None, it will match the start of the document, and if the end token points to position None, it will match the end of the document. Such occurrences are logged.

Alternatively we could raise an error message, but the source of this behavior is unclear and likely tokenizer-dependent, see huggingface/transformers#1662, huggingface/transformers#8209.

How did you test it?

Added a new unit test.

Notes for the reviewer

Checklist

@github-actions github-actions bot added topic:tests 2.x Related to Haystack v2.0 labels Nov 22, 2023
@ZanSara ZanSara changed the title Reader bug fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple Nov 22, 2023
@ZanSara ZanSara added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Nov 22, 2023
@ZanSara ZanSara marked this pull request as ready for review November 22, 2023 16:10
@ZanSara ZanSara requested a review from a team as a code owner November 22, 2023 16:10
@ZanSara ZanSara requested review from silvanocerza and removed request for a team November 22, 2023 16:10
@ZanSara ZanSara changed the title fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple Nov 22, 2023
@ZanSara ZanSara marked this pull request as draft November 22, 2023 16:12
@ZanSara ZanSara removed the request for review from silvanocerza November 22, 2023 16:12
@ZanSara ZanSara marked this pull request as ready for review November 22, 2023 16:16
@ZanSara ZanSara requested a review from julian-risch November 22, 2023 16:17
@github-actions github-actions bot added the type:documentation Improvements on the docs label Nov 23, 2023
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to prevent the reader from crashing. I am bit concerned that we don't have a better understanding why the issues with the tokenizers occur. It's outside of the Haystack code base, so no issue with the reader, which is good.

@julian-risch julian-risch merged commit c45d8c3 into main Nov 24, 2023
22 checks passed
@julian-risch julian-risch deleted the reader-bug branch November 24, 2023 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 ignore-for-release-notes PRs with this flag won't be included in the release notes. topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExtractiveReader fails with proper input Documents
2 participants