Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make tokenizer emit at least one empty token on empty strings #5532

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

trinity-1686a
Copy link
Contributor

@trinity-1686a trinity-1686a commented Oct 30, 2024

Description

always emit at least one token when indexing/querying with an empty field

How was this PR tested?

tested manually. todo: add integration test

@@ -316,6 +316,7 @@ mod tests {
use crate::{create_default_quickwit_tokenizer_manager, BooleanOperand};

#[test]
#[ignore]
Copy link
Contributor Author

@trinity-1686a trinity-1686a Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behavior this test tries for changed, I don't think it make sense to keep it. Note that it doesn't test for a query, but a body:"", so it makes sense it no longer is a MatchAll. emits no FullTextQuery, and is still a MatchAll

Copy link

On SSD:

Average search latency is 0.996x that of the reference (lower is better).
Ref run id: 4045, ref commit: 105aa7d
Link

On GCS:

Average search latency is 0.925x that of the reference (lower is better).
Ref run id: 4087, ref commit: 826f10f
Link

@PSeitz
Copy link
Contributor

PSeitz commented Oct 31, 2024

Why do we need this?

@@ -128,6 +132,69 @@ pub fn get_quickwit_fastfield_normalizer_manager() -> &'static TokenizerManager
&QUICKWIT_FAST_FIELD_NORMALIZER_MANAGER
}

#[derive(Debug, Clone)]
struct EmitEmptyTokenizer<T>(T);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a comment to express the intent on that kind of tokenizer. It is very difficult for a future readers to infer the point of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tokenizer emits the empty token whenever the underlying tokenizer emits no token.
I am not sure this is the behavior we want.

For instance, a stop word would result in the emission of the empty token.

A stricter behavior, emitting an empty token iff the text is effectively empty might actually make more sense. I suspect the code would be simpler too.


fn token(&self) -> &Token {
match self.state {
EmitEmptyState::Start => unreachable!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is reachable.

Suggested change
EmitEmptyState::Start => unreachable!(),
EmitEmptyState::Start => panic!("token() should not be called before a first call to advance"),

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explain the intent in the commit message and in the comment of the tokenizer.

A unit test checking for the behavior end to end (indexing + query side) should be added too.

We also need to think about what the behavior should be for a multivalued field
["hello", ""] (does it match or not), and add a unit test to validate this behavior.

@trinity-1686a
Copy link
Contributor Author

Why do we need this?

Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that

@PSeitz
Copy link
Contributor

PSeitz commented Oct 31, 2024

Why do we need this?

Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that

Tokenization is already quite expensive during indexing, I think this change may add some non-negligable overhead there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants