Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide true real-time indexing for Lucene based text index #13504

Open
itschrispeck opened this issue Jun 27, 2024 · 0 comments
Open

Provide true real-time indexing for Lucene based text index #13504

itschrispeck opened this issue Jun 27, 2024 · 0 comments

Comments

@itschrispeck
Copy link
Collaborator

itschrispeck commented Jun 27, 2024

Problem

Currently, Pinot's RealtimeLuceneTextIndex uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.

This behavior presents in a couple ways:

  1. text_match(col, '"abcd"') -> forward match misses the most recent docs
  2. NOT text_match(col, '"abcd"') -> inverse match fails to exclude the most recent docs, so users will see docs containing abcd
  3. Missing results for upsert, for example:
    t0: doc A ingested/doc A is the valid doc based on upsert lastest docs
    t1: doc A text indexed, doc A searchable w/ text index
    t2: doc B ingested/doc B is the valid doc based on upsert latest docs
    <text_match query returns doc A, but upsert invalidated doc A, no results>
    t3: doc B text indexed, doc B searchable w/ text index
    <text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>
    

With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.

Alternatives considered:

  • bound the most recent doc considered during query execution based on index refresh delay

    • For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting' numDocs if the data source has a text index.
    • This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match.
    • This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}
  • rewrite NOT text_match(col, '"abcd"') to text_match(col, '/.*/ AND NOT "abcd"')

    • this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant