[META] Support Hybrid Disk/Memory Algorithms #1134

jmazanec15 · 2023-09-13T13:20:48Z

Overview

One common problem around dense vector search is the amount of memory required to implement a solution that delivers high performance and recall. The problem is grounded in the fact that many ANN algorithms require fast, random access to the full floating representations of the vectors as well as auxiliary data structures (in-memory algorithms). HNSW is one example of such an algorithm. While there are techniques, like Product Quantization, that allow the memory footprint of the algorithm to be greatly reduced, it can be difficult to achieve high recall (refer to a blog I wrote awhile ago around IVFPQ vs. HNSW for billion-scale search: https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/). Anecdotally, users typically want a recall above 0.9, where recall is defined as the ratio of neighbors returned that are ground truth nearest neighbors. Similarly, techniques such as scalar quantization and dimensionality reduction can also be employed, but some users may want a more significant reduction in memory.

Another approach that has gained a lot of interest is utilizing fast access disks (i.e. SSDs) in order to extend memory. I will refer to these algorithms as hybrid algorithms. hybrid algorithms will keep some kind of a small representation of the index in memory and the rest on disk. Then, during search, the algorithms will intelligently select a few items to read from disk with the goal of optimizing query latency/working set size/recall tradeoff. A few popular examples are:

DiskANN — an hybrid graph traversal approach — https://suhasjs.github.io/files/diskann_neurips19.pdf
faiss IVF on disk — mapping the IVF algorithm to store vectors on disk — https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/README.md
SPANN — a partitioning based approach similar to IVF — https://arxiv.org/pdf/2111.08566.pdf

For OpenSearch, given the potential benefit, we should investigate adding support. There are several approaches that we can take to add support:

Integrate already existing functionality from faiss
Add new engines to support new algorithms
Enhance existing engines with new/modified algorithms
Do nothing

For each approach, there are also a lot of factors to consider - (integration feasibility into OpenSearch, support for aux features such as filtering, etc.). At the moment, we do not have enough data to figure out what route to go. Therefore, we will need to conduct an investigation into existing approaches. This issue is a general META issue that tracks the project as a whole. All feedback/collaboration is welcome. If there are approaches that you have tried that worked well (or not so well), please share your experiences!

Tasks

(Task list is subject to evolve)

Research approaches and compile comparison report with respect to performance as well as feasibility wrt OpenSearch [ISSUE: TBD]
(If applicable) Create RFC suggestion integration plan
(if applicable) Create proof of concept integration with OpenSearch
(if applicable) Conduct performance evaluation against OpenSearch
(if applicable) Productionalize code

Related Issues

related issue: #758

navneet1v · 2023-09-13T23:03:12Z

@jmazanec15 Thanks for creating this issue. I kind of align on the high level task, but would love to see more breakdown of these tasks with time bound investigation and scope.

Is there any specific feedback you are looking here?

vamshin · 2023-09-14T16:24:58Z

Please +1 if you wish to have this feature prioritized

jmazanec15 · 2023-09-14T17:42:08Z

Thanks @navneet1v

but would love to see more breakdown of these tasks with time bound investigation and scope.

Sure, will evolve this as we go. Mainly put out the issue for tracking purposes.

Is there any specific feedback you are looking here?

Not specific feedback - would like to see if anyone has used a disk based approach and wants to share their opinion. This is a tracking issue, not an RFC. There will be more issues to come seeking more direct feedback.

luyuncheng · 2023-09-18T14:41:26Z

+1 LGTM

sam-herman · 2023-10-05T16:02:43Z

adding the below resources, looks like a similar issue came out in Lucene:
https://foojay.io/today/jvector-1-0/
apache/lucene#12615

credit to @reta for bringing those to my attention

jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Sep 13, 2023

github-actions bot added the untriaged label Sep 13, 2023

jmazanec15 removed the untriaged label Sep 13, 2023

vamshin added this to Vector Search RoadMap Sep 14, 2023

github-project-automation bot moved this to Backlog in Vector Search RoadMap Sep 14, 2023

vamshin moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Sep 14, 2023

vamshin moved this from Backlog (Hot) to Now(This Quarter) in Vector Search RoadMap Oct 5, 2023

vamshin moved this from Now(This Quarter) to Next (Next Quarter) in Vector Search RoadMap Nov 20, 2023

jmazanec15 mentioned this issue Mar 28, 2024

memory-map/on-disk index support #758

Closed

vamshin moved this from Now(This Quarter) to Backlog (Hot) in Vector Search RoadMap Apr 1, 2024

vamshin moved this from Backlog (Hot) to 2.17.0 in Vector Search RoadMap Jul 2, 2024

naveentatikonda added the v2.17.0 label Aug 20, 2024

naveentatikonda closed this as completed Sep 18, 2024

github-project-automation bot moved this from 2.17.0 to ✅ Done in Vector Search RoadMap Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Support Hybrid Disk/Memory Algorithms #1134

[META] Support Hybrid Disk/Memory Algorithms #1134

jmazanec15 commented Sep 13, 2023

navneet1v commented Sep 13, 2023

vamshin commented Sep 14, 2023

jmazanec15 commented Sep 14, 2023

luyuncheng commented Sep 18, 2023

sam-herman commented Oct 5, 2023

[META] Support Hybrid Disk/Memory Algorithms #1134

[META] Support Hybrid Disk/Memory Algorithms #1134

Comments

jmazanec15 commented Sep 13, 2023

Overview

Tasks

Related Issues

navneet1v commented Sep 13, 2023

vamshin commented Sep 14, 2023

jmazanec15 commented Sep 14, 2023

luyuncheng commented Sep 18, 2023

sam-herman commented Oct 5, 2023