You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Pinot's RealtimeLuceneTextIndex uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.
This behavior presents in a couple ways:
text_match(col, '"abcd"') -> forward match misses the most recent docs
NOT text_match(col, '"abcd"') -> inverse match fails to exclude the most recent docs, so users will see docs containing abcd
Missing results for upsert, for example:
t0: doc A ingested/doc A is the valid doc based on upsert lastest docs
t1: doc A text indexed, doc A searchable w/ text index
t2: doc B ingested/doc B is the valid doc based on upsert latest docs
<text_match query returns doc A, but upsert invalidated doc A, no results>
t3: doc B text indexed, doc B searchable w/ text index
<text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>
With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.
Alternatives considered:
bound the most recent doc considered during query execution based on index refresh delay
For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting' numDocs if the data source has a text index.
This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match.
This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}
rewrite NOT text_match(col, '"abcd"') to text_match(col, '/.*/ AND NOT "abcd"')
this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)
The text was updated successfully, but these errors were encountered:
Problem
Currently, Pinot's
RealtimeLuceneTextIndex
uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.This behavior presents in a couple ways:
text_match(col, '"abcd"')
-> forward match misses the most recent docsNOT text_match(col, '"abcd"')
-> inverse match fails to exclude the most recent docs, so users will see docs containingabcd
With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.
Alternatives considered:
bound the most recent doc considered during query execution based on index refresh delay
numDocs
if the data source has a text index.rewrite
NOT text_match(col, '"abcd"')
totext_match(col, '/.*/ AND NOT "abcd"')
The text was updated successfully, but these errors were encountered: