make tokenizer emit at least one empty token on empty strings #5532

trinity-1686a · 2024-10-30T22:57:03Z

Description

always emit at least one token when indexing/querying with an empty field

How was this PR tested?

tested manually. todo: add integration test

trinity-1686a · 2024-10-30T22:57:43Z

quickwit/quickwit-query/src/query_ast/full_text_query.rs

@@ -316,6 +316,7 @@ mod tests {
    use crate::{create_default_quickwit_tokenizer_manager, BooleanOperand};

    #[test]
+    #[ignore]


the behavior this test tries for changed, I don't think it make sense to keep it. Note that it doesn't test for a query, but a body:"", so it makes sense it no longer is a MatchAll. emits no FullTextQuery, and is still a MatchAll

github-actions · 2024-10-30T23:26:45Z

On SSD:

Average search latency is 0.996x that of the reference (lower is better).
Ref run id: 4045, ref commit: 105aa7d
Link

On GCS:

Average search latency is 0.925x that of the reference (lower is better).
Ref run id: 4087, ref commit: 826f10f
Link

PSeitz · 2024-10-31T02:15:50Z

Why do we need this?

fulmicoton · 2024-10-31T02:26:10Z

quickwit/quickwit-query/src/tokenizers/mod.rs

@@ -128,6 +132,69 @@ pub fn get_quickwit_fastfield_normalizer_manager() -> &'static TokenizerManager
    &QUICKWIT_FAST_FIELD_NORMALIZER_MANAGER
 }

+#[derive(Debug, Clone)]
+struct EmitEmptyTokenizer<T>(T);


We need a comment to express the intent on that kind of tokenizer. It is very difficult for a future readers to infer the point of this.

This tokenizer emits the empty token whenever the underlying tokenizer emits no token.
I am not sure this is the behavior we want.

For instance, a stop word would result in the emission of the empty token.

A stricter behavior, emitting an empty token iff the text is effectively empty might actually make more sense. I suspect the code would be simpler too.

fulmicoton · 2024-10-31T02:28:41Z

quickwit/quickwit-query/src/tokenizers/mod.rs

+
+    fn token(&self) -> &Token {
+        match self.state {
+            EmitEmptyState::Start => unreachable!(),


It is reachable.

Suggested change

EmitEmptyState::Start => unreachable!(),

EmitEmptyState::Start => panic!("token() should not be called before a first call to advance"),

fulmicoton

We need to explain the intent in the commit message and in the comment of the tokenizer.

A unit test checking for the behavior end to end (indexing + query side) should be added too.

We also need to think about what the behavior should be for a multivalued field
["hello", ""] (does it match or not), and add a unit test to validate this behavior.

trinity-1686a · 2024-10-31T09:14:24Z

Why do we need this?

Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that

PSeitz · 2024-10-31T12:16:47Z

Why do we need this?

Today {"field": ""} and {} get the same tokens for most tokenizers (except raw), which is no token at all. It's not possible to search for empty strings only. This allows doing just that

Tokenization is already quite expensive during indexing, I think this change may add some non-negligable overhead there.

make tokenizer emit at least one empty token

776fe26

trinity-1686a requested a review from fulmicoton October 30, 2024 22:57

trinity-1686a commented Oct 30, 2024

View reviewed changes

fulmicoton reviewed Oct 31, 2024

View reviewed changes

fulmicoton requested changes Oct 31, 2024

View reviewed changes

trinity-1686a added 2 commits November 6, 2024 16:49

only emit on strictly empty strings

2f5ad21

add integration test

14ecacb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make tokenizer emit at least one empty token on empty strings #5532

make tokenizer emit at least one empty token on empty strings #5532

trinity-1686a commented Oct 30, 2024 •

edited

Loading

trinity-1686a Oct 30, 2024 •

edited

Loading

github-actions bot commented Oct 30, 2024

PSeitz commented Oct 31, 2024

fulmicoton Oct 31, 2024

fulmicoton Nov 1, 2024

fulmicoton Oct 31, 2024

fulmicoton left a comment

trinity-1686a commented Oct 31, 2024

PSeitz commented Oct 31, 2024

	EmitEmptyState::Start => unreachable!(),
	EmitEmptyState::Start => panic!("token() should not be called before a first call to advance"),

make tokenizer emit at least one empty token on empty strings #5532

Are you sure you want to change the base?

make tokenizer emit at least one empty token on empty strings #5532

Conversation

trinity-1686a commented Oct 30, 2024 • edited Loading

Description

How was this PR tested?

trinity-1686a Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Oct 30, 2024

On SSD:

On GCS:

PSeitz commented Oct 31, 2024

fulmicoton Oct 31, 2024

Choose a reason for hiding this comment

fulmicoton Nov 1, 2024

Choose a reason for hiding this comment

fulmicoton Oct 31, 2024

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment

trinity-1686a commented Oct 31, 2024

PSeitz commented Oct 31, 2024

trinity-1686a commented Oct 30, 2024 •

edited

Loading

trinity-1686a Oct 30, 2024 •

edited

Loading