Cache and search solution space #2248
Replies: 3 comments 1 reply
-
This is great research and summary. I lean towards improving current implementation, however if meili/tantivy can be embedded nicely, it's a great boost. To be more specific on being embedded nicely, we need to AND text seach results with other results. If the integration allows us to do this, then it's nice. |
Beta Was this translation helpful? Give feedback.
-
Part of the question is what we want to support and how to expose it in the API layer. We currently have {
"textfield": {
"$contains": /*<search expression>*/
}
} where "a phrase", // matches "this string has a phrase in the middle"
"prefix*", // matches "this word is prefixed"
"*infix*", // matches "the word fooinfixbar does not exist"
"*suffix", // matches "and foosuffix doesn't either" These could of course be combined using |
Beta Was this translation helpful? Give feedback.
-
There is one more limitation of the current cache: Fixing that would mean we need more rigorous query planning in the cache (maybe we can just put datafusion in front of lmdb to do the query planning). |
Beta Was this translation helpful? Give feedback.
-
Querying
Almost all applications that consume an API will want to do some filtering. In its most basic form, this filtering can use the following building blocks:
Exact matching
For example, a = 3, or b = "Hello, world".
Range queries
For example a BETWEEN 3 AND 6
Text search
Single word
'The quick brown fox jumps over the lazy dog' LIKE '%brown%'
Substring
'The quick brown fox jumps over the lazy dog' LIKE "%brown fox%"
Full-text search
'The quick brown fox jumps over the lazy dog' @@ 'fox & dog'
Combinators
Conjunction
a AND b
Disjunction
a OR b
Current situation
The cache currently is a collection of LMDB databases. Each endpoint has one primary database, which is a KV-mapping from primary key to record. Secondary indexes are defined as databases defining multi-maps from the secondary index field to the primary key. There is basic Full-text search, which tokenizes text by word (basic unicode segmentation) and builds an inverted index based on it. A query with a contains predicate will be treated as a single word search. Substring and FTS are not implemented.
Conjunction is implemented, disjunction is not.
LMDB pros
Possible cache/search implementations
SQLite cache
Pros
Full power of SQL for querying and fts5 for text search. Automatic index building.
Cons
Cannot index in the background. Ingest and query performance slower than LMDB. Need to rebuild entire cache (but don't have to reimplement a lot of the current features ourselves).
Keep LMDB and use meilisearch's milli for full-text indexes
Pros
Automatic Full-Text Search index building. Most powerful search of all options. Milli is built on LMDB as well.
Cons
Search capabilities probably overkill. Still need to implement index maintenance and other kinds of indexes. Optimized for large numbers of large documents.
Add Tantivy on top of current cache
Pros
Powerful search. Easier to use than milli.
Cons
Difficult to store documents in LMDB.
Custom on current cache
Pros
Most control. Least amount of changes to current code.
Cons
Difficult to implement anything beyond most basic search.
Beta Was this translation helpful? Give feedback.
All reactions