-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Optimized Disk-Based Vector Search #1779
Comments
Quantization ANN in memory, and Full-Precision in Disk, this optimize looks like DiskANN. and it would make memory reduce as many as possible. and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk. |
@luyuncheng I think disk ann may be worth exploring in future. However, I am a bit skeptical on effectiveness of the approach in higher dimensions, common for many existing embedding models. For instance, for embeddings with 1024 dimensions, the full precision vectors would consume 4 KB of space on disk. With this size, nothing else would be able to fit on the block that is read from disk - thus, the locality in reads offered via disk ann would not be helpful. So, the number of IOPs would not decrease unless the number of candidates being re-scored was significantly less than those being fetched during the graph traversal of diskann.
What did you mean by this? |
i misunderstanding for the RFC. i was supposed that we can design a Disk-Based Storage in Faiss and CHANGED
but i realized we just use |
@luyuncheng |
Introduction
This document outlines a high level proposal for providing efficient, yet easy to use k-NN in OpenSearch in low-memory environments. Many more details to come in individual component RFCs.
Problem Statement
Vector search has become very popular in recent years due to recent advancements in LLMs. Typically, embedding models like Amazon Titan or Cohere or OpenAI will output fp32 embeddings in high dimensional space (>= 768). These embeddings are then ingested into vector search systems to provide semantic search. While semantic search can provide major benefits in search quality, it has historically come at a much higher cost than full text search. This is mainly due to the size of embeddings produced along with the high in-memory requirements of traditional Approximate Nearest Neighbor algorithms, such as HNSW.
With that, a major user ask is cost reduction. Many new solutions/algorithms are being released to reduce cost. General strategies include leveraging the disk to extend memory and/or using quantization techniques to reduce overall memory footprint per vector. Currently, OpenSearch offers the following quantization techniques for the 2x, 4x, 8x+ compression levels (assuming input vectors are fp32 d-dimensional vectors, 2x would compress vectors into 16*d bits, etc):
2. [lucene] byte/8-bit Scalar Quantization (coming soon)
Many users want 8x+ compression, meaning that the current scalar quantization techniques are not viable options for them. Product quantization can be employed to meet this requirement. However, the current product quantization implementation has a few drawbacks:
Embedding models have also recognized the issue around memory and are starting to generate embeddings that use byte-per-dimension and/or bit-per-dimensional vectors, while maintaining competitive search quality. See cohere blog for byte/bit embeddings.
In OpenSearch, we currently support byte vectors through the Lucene engine. We are planning on supporting byte vectors in 2.16 in faiss (see #1659). binary vector support has not yet been added. This has also been a highly sought after feature for some time, with use cases outside of semantic search as well (see #81).
That being said, in order to continue to provide optimal vector search support, OpenSearch should:
Requirements
Functional
Non-functional
Requirement #1 is covered in #1767. We will focus on Requirement #2 for remainder of doc.
Proposed Solution
From a high level, we propose a flexible, easy to use, two-phased search to give performant k-NN search in low-memory environments. The two-phase search works by executing an initial k-NN search over an in-memory quantized index, with # results > k, and then re-scoring the results by re-computing the distances with full-precision vectors. In the second phase, the full precision vectors are lazily loaded from disk. In short, this approach offers users the ability to reduce their memory footprint and control the tradeoff between recall and search latency.
Ingestion
Two-phase Search
In fact, it is possible to do a two-phased workflow like this in OpenSearch now. See Appendix A: Executing Two-Phased Search with Current Functionality. To evaluate feasibility of this approach, we ran several PoC experiments using product quantization and re-scoring. See Appendix B: Baseline Re-score Experiments for details. From the experiments, we found that this approach can substantially increase the recall of the k-NN search on quantized vectors, while still providing low-latency vector search in low memory environments. This implementation does have a couple drawbacks though:
We want to provide users with an improved experience that does not require a separate training stage and has a simple, intuitive interface. To do this, we are going to provide quantization during ingestion, improved full-precision re-scoring and a refreshed interface.
Proposed Interface
Disclaimer: The interface is subject to change — feedback is encouraged/very valuable. We want to provide an intuitive user experience that provides a simple out of the box experience, with the flexibility for users to fine tune to optimize for their performance needs. This is the general vision. More details to come in future RFCs.
Out of the box, we propose a simple interface that allows users to specify their workload intent, which we will then take and optimize. The end to end workflow will look something like this:
Index Creation
Parameters:
Ingestion (no change)
Search (no change)
The long-term vision will then be to consistently improve the out of box experience for a given mode by making more intelligent configuration decisions.
Fine tuning
As a tenet, we will never introduce functionality for improved tuning via mode that cannot be overridden by a parameter — thus, we will not step on any power users’ toes. As a consequence of this, we will need to expose the set of defaults/constraints for each mode in documentation so that users who need to fine-tune know where to start from.
Index Creation
To support intuitive quantization, we will add a new mapping parameter called compression_level.
Parameters:
Ingestion (no change)
Search
To support two-phased search, we will add a new parameter called rescore_factor.
Parameters:
Quantization Framework
One of the issues with our implementation of PQ is that it requires a training stage and model management. This is cumbersome to users. So, to provide an improved experience, we need to give users the ability to achieve 8x+ compression without requiring a training stage.
To accomplish this, we are going to build a quantization framework that will configure the quantizer during ingestion. It will be responsible for sampling incoming data and performing analysis to determine how to best quantize the data.
Initially, we will onboard binary quantization into this framework. Binary quantization has shown tremendous success (see blogs in Appendix C: References) for achieving high compression rate while sacrificing minimal search qualit. So, we think it is a good place to start. For this, we will leverage the work done for binary vector support (#1767). We will continue to add more quantizers as we go.
Re-scoring
With the quantization framework, we will still store the full-precision vectors in a flat, file structure. With this, we will implement the functionality to get the results from the search over the quantized index and re-score them with higher precision from a secondary data store, like disk.
Alternatives Considered
Supporting new algorithm, such as DiskANN or SPANN
We could integrate another library’s implementation of an algorithm like SPANN or DiskANN directly into the plugin. However, the addition of another library would come with a high cost of maintenance as well as a high cost for integration in a way that would take complete advantage of features such as remote storage.
Right now, there is not enough evidence yet to prove that benefits from such algorithms/implementation would outperform an optimized two-phased search over HNSW or IVF structure with quantization (see PoC done from Lucene community: apache/lucene#12615).
While we are not adding support now for new algos/libraries, this feature does not close the door on it. This is a route we may take in the future as well. With this in mind, we will be building the components in a generic matter so that they can be re-used.
Enhance Product Quantization to avoid training step
Product quantization is a powerful technique for reducing memory consumption. However, it has an upfront requirement for training that leads to a more complex user experience that we want to avoid. While it may be possible to abstract this step into the ingestion workflow, it would result in sub-optimal ingestion performance because the quantizer could not be shared across shards and would potentially need to be recreated per segment.
Product quantization should have better memory vs. recall tradeoffs when compared to binary quantization. However, due to the efficiency for ingestion of binary quantization and encouraging empirical results, we are focusing for now on binary quantization.
In the future, we could onboard product quantization into the online training framework so it can be used as well. In addition, re-scoring components will be available for existing product quantization indices.
Appendix A: Executing Two-Phased Search with Current Functionality #
It is possible to run a two-phased k-NN search in OpenSearch with the script scoring functionality (ref). All you have to do is build an index with a particular quantizer and then add a k-NN script score query as a part of the re-score block, which will re-score the top window_size results after search over the shard finishes with full precision vectors. For example:
This is what was done to run the experiments for Appendix B: Baseline Re-score Experiments.
Appendix B: Baseline Re-score Experiments #
In order to evaluate feasibility of the disk-based two-phase search approach, we executed a series of tests with the OSB cohere-10M (768-d IP) vector data set. To test the system in a memory constrained environment, we used a single OpenSearch node docker configuration where the container was resource restricted (see here for details). We ran the experiments on r6gd.12xlarge (SSD/instance store) AWS instances. To focus on off-heap memory consumption, we fixed the heap at 32 GB.
We defined a couple different memory environments. The baseline was determined via the memory recommended for full precision HNSW graphs:
We ran the search workload with differing number of candidates and re-scoring windows with the memory constraints. This led to these results for IVFPQ:
Appendix C: References #
Several blogs around success of binary quantization
The text was updated successfully, but these errors were encountered: