Cache and search solution space #2248

Jesse-Bakker · 2023-12-05T16:18:33Z

Jesse-Bakker
Dec 5, 2023

Querying

Almost all applications that consume an API will want to do some filtering. In its most basic form, this filtering can use the following building blocks:

Exact matching

For example, a = 3, or b = "Hello, world".

Range queries

For example a BETWEEN 3 AND 6

Text search

Single word

'The quick brown fox jumps over the lazy dog' LIKE '%brown%'

Substring

'The quick brown fox jumps over the lazy dog' LIKE "%brown fox%"

Full-text search

'The quick brown fox jumps over the lazy dog' @@ 'fox & dog'

Combinators

Conjunction

a AND b

Disjunction

a OR b

Current situation

The cache currently is a collection of LMDB databases. Each endpoint has one primary database, which is a KV-mapping from primary key to record. Secondary indexes are defined as databases defining multi-maps from the secondary index field to the primary key. There is basic Full-text search, which tokenizes text by word (basic unicode segmentation) and builds an inverted index based on it. A query with a contains predicate will be treated as a single word search. Substring and FTS are not implemented.
Conjunction is implemented, disjunction is not.

LMDB pros

LMDB is generally very fast
Full control over indexing
- Currently secondary indexes are built in the background, therefore not blocking writes
Readers don't block writers and vice versa

Possible cache/search implementations

SQLite cache

Pros

Full power of SQL for querying and fts5 for text search. Automatic index building.

Cons

Cannot index in the background. Ingest and query performance slower than LMDB. Need to rebuild entire cache (but don't have to reimplement a lot of the current features ourselves).

Keep LMDB and use meilisearch's milli for full-text indexes

Pros

Automatic Full-Text Search index building. Most powerful search of all options. Milli is built on LMDB as well.

Cons

Search capabilities probably overkill. Still need to implement index maintenance and other kinds of indexes. Optimized for large numbers of large documents.

Add Tantivy on top of current cache

Pros

Powerful search. Easier to use than milli.

Cons

Difficult to store documents in LMDB.

Custom on current cache

Pros

Most control. Least amount of changes to current code.

Cons

Difficult to implement anything beyond most basic search.

chubei · 2023-12-06T02:54:32Z

chubei
Dec 6, 2023

This is great research and summary. I lean towards improving current implementation, however if meili/tantivy can be embedded nicely, it's a great boost.

To be more specific on being embedded nicely, we need to AND text seach results with other results. If the integration allows us to do this, then it's nice.

1 reply

Jesse-Bakker Dec 6, 2023
Author

The integrations would be used as separate indexes, with search terms pointing to primary keys. That should take care of combining them with other indexes.

Jesse-Bakker · 2023-12-06T14:27:30Z

Jesse-Bakker
Dec 6, 2023
Author

Part of the question is what we want to support and how to expose it in the API layer.

We currently have $contains. A more general form of that would be

{
    "textfield": {
        "$contains": /*<search expression>*/
    }
}

where <search espression> is one of:

   "a phrase", // matches "this string has a phrase in the middle"
   "prefix*", // matches "this word is prefixed"
   "*infix*", // matches "the word fooinfixbar does not exist"
   "*suffix", // matches "and foosuffix doesn't either"

These could of course be combined using $and and possibly $or. That would cover most scenario's.

0 replies

Jesse-Bakker · 2023-12-07T10:12:10Z

Jesse-Bakker
Dec 7, 2023
Author

There is one more limitation of the current cache:
All queries must be fully covered by one index. That means that users need to anticipate all possible query patterns in advance.

Fixing that would mean we need more rigorous query planning in the cache (maybe we can just put datafusion in front of lmdb to do the query planning).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache and search solution space #2248

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Cache and search solution space #2248

Jesse-Bakker Dec 5, 2023

Querying

Exact matching

Range queries

Text search

Single word

Substring

Full-text search

Combinators

Conjunction

Disjunction

Current situation

LMDB pros

Possible cache/search implementations

SQLite cache

Pros

Cons

Keep LMDB and use meilisearch's milli for full-text indexes

Pros

Cons

Add Tantivy on top of current cache

Pros

Cons

Custom on current cache

Pros

Cons

Replies: 3 comments · 1 reply

chubei Dec 6, 2023

Jesse-Bakker Dec 6, 2023 Author

Jesse-Bakker Dec 6, 2023 Author

Jesse-Bakker Dec 7, 2023 Author

Jesse-Bakker
Dec 5, 2023

Replies: 3 comments 1 reply

chubei
Dec 6, 2023

Jesse-Bakker Dec 6, 2023
Author

Jesse-Bakker
Dec 6, 2023
Author

Jesse-Bakker
Dec 7, 2023
Author