Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Support Hybrid Disk/Memory Algorithms #1134

Closed
5 tasks
jmazanec15 opened this issue Sep 13, 2023 · 5 comments
Closed
5 tasks

[META] Support Hybrid Disk/Memory Algorithms #1134

jmazanec15 opened this issue Sep 13, 2023 · 5 comments
Labels
Features Introduces a new unit of functionality that satisfies a requirement v2.17.0

Comments

@jmazanec15
Copy link
Member

Overview

One common problem around dense vector search is the amount of memory required to implement a solution that delivers high performance and recall. The problem is grounded in the fact that many ANN algorithms require fast, random access to the full floating representations of the vectors as well as auxiliary data structures (in-memory algorithms). HNSW is one example of such an algorithm. While there are techniques, like Product Quantization, that allow the memory footprint of the algorithm to be greatly reduced, it can be difficult to achieve high recall (refer to a blog I wrote awhile ago around IVFPQ vs. HNSW for billion-scale search: https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/). Anecdotally, users typically want a recall above 0.9, where recall is defined as the ratio of neighbors returned that are ground truth nearest neighbors. Similarly, techniques such as scalar quantization and dimensionality reduction can also be employed, but some users may want a more significant reduction in memory.

Another approach that has gained a lot of interest is utilizing fast access disks (i.e. SSDs) in order to extend memory. I will refer to these algorithms as hybrid algorithms. hybrid algorithms will keep some kind of a small representation of the index in memory and the rest on disk. Then, during search, the algorithms will intelligently select a few items to read from disk with the goal of optimizing query latency/working set size/recall tradeoff. A few popular examples are:

  1. DiskANN — an hybrid graph traversal approach — https://suhasjs.github.io/files/diskann_neurips19.pdf
  2. faiss IVF on disk — mapping the IVF algorithm to store vectors on disk — https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/README.md
  3. SPANN — a partitioning based approach similar to IVF — https://arxiv.org/pdf/2111.08566.pdf

For OpenSearch, given the potential benefit, we should investigate adding support. There are several approaches that we can take to add support:

  1. Integrate already existing functionality from faiss
  2. Add new engines to support new algorithms
  3. Enhance existing engines with new/modified algorithms
  4. Do nothing

For each approach, there are also a lot of factors to consider - (integration feasibility into OpenSearch, support for aux features such as filtering, etc.). At the moment, we do not have enough data to figure out what route to go. Therefore, we will need to conduct an investigation into existing approaches. This issue is a general META issue that tracks the project as a whole. All feedback/collaboration is welcome. If there are approaches that you have tried that worked well (or not so well), please share your experiences!

Tasks

(Task list is subject to evolve)

  • Research approaches and compile comparison report with respect to performance as well as feasibility wrt OpenSearch [ISSUE: TBD]
  • (If applicable) Create RFC suggestion integration plan
  • (if applicable) Create proof of concept integration with OpenSearch
  • (if applicable) Conduct performance evaluation against OpenSearch
  • (if applicable) Productionalize code

Related Issues

related issue: #758

@jmazanec15 jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Sep 13, 2023
@navneet1v
Copy link
Collaborator

@jmazanec15 Thanks for creating this issue. I kind of align on the high level task, but would love to see more breakdown of these tasks with time bound investigation and scope.

Is there any specific feedback you are looking here?

@vamshin
Copy link
Member

vamshin commented Sep 14, 2023

Please +1 if you wish to have this feature prioritized

@vamshin vamshin moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Sep 14, 2023
@jmazanec15
Copy link
Member Author

Thanks @navneet1v

but would love to see more breakdown of these tasks with time bound investigation and scope.

Sure, will evolve this as we go. Mainly put out the issue for tracking purposes.

Is there any specific feedback you are looking here?

Not specific feedback - would like to see if anyone has used a disk based approach and wants to share their opinion. This is a tracking issue, not an RFC. There will be more issues to come seeking more direct feedback.

@luyuncheng
Copy link
Collaborator

+1 LGTM

@sam-herman
Copy link

adding the below resources, looks like a similar issue came out in Lucene:
https://foojay.io/today/jvector-1-0/
apache/lucene#12615

credit to @reta for bringing those to my attention

@vamshin vamshin moved this from Backlog (Hot) to Now(This Quarter) in Vector Search RoadMap Oct 5, 2023
@vamshin vamshin moved this from Now(This Quarter) to Next (Next Quarter) in Vector Search RoadMap Nov 20, 2023
@vamshin vamshin moved this from Now(This Quarter) to Backlog (Hot) in Vector Search RoadMap Apr 1, 2024
@vamshin vamshin moved this from Backlog (Hot) to 2.17.0 in Vector Search RoadMap Jul 2, 2024
@github-project-automation github-project-automation bot moved this from 2.17.0 to ✅ Done in Vector Search RoadMap Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement v2.17.0
Projects
Status: Done
Development

No branches or pull requests

6 participants