Provide true real-time indexing for Lucene based text index #13504

itschrispeck · 2024-06-27T21:57:00Z

Problem

Currently, Pinot's RealtimeLuceneTextIndex uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.

This behavior presents in a couple ways:

text_match(col, '"abcd"') -> forward match misses the most recent docs
NOT text_match(col, '"abcd"') -> inverse match fails to exclude the most recent docs, so users will see docs containing abcd

Missing results for upsert, for example:

t0: doc A ingested/doc A is the valid doc based on upsert lastest docs
t1: doc A text indexed, doc A searchable w/ text index
t2: doc B ingested/doc B is the valid doc based on upsert latest docs
<text_match query returns doc A, but upsert invalidated doc A, no results>
t3: doc B text indexed, doc B searchable w/ text index
<text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>

With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.

Alternatives considered:

bound the most recent doc considered during query execution based on index refresh delay
- For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting' numDocs if the data source has a text index.
- This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match.
- This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}
rewrite NOT text_match(col, '"abcd"') to text_match(col, '/.*/ AND NOT "abcd"')
- this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)

The text was updated successfully, but these errors were encountered:

itschrispeck added ingestion real-time labels Jun 27, 2024

itschrispeck self-assigned this Jun 27, 2024

itschrispeck mentioned this issue Jul 5, 2024

Improve realtime Lucene text index freshness/cpu/disk io usage #13503

Merged

hpvd mentioned this issue Nov 4, 2024

Messures and Doc for index freshness/update #14371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide true real-time indexing for Lucene based text index #13504

Provide true real-time indexing for Lucene based text index #13504

itschrispeck commented Jun 27, 2024 •

edited

Loading

Provide true real-time indexing for Lucene based text index #13504

Provide true real-time indexing for Lucene based text index #13504

Comments

itschrispeck commented Jun 27, 2024 • edited Loading

Problem

Alternatives considered:

itschrispeck commented Jun 27, 2024 •

edited

Loading