[RFC] : Boosting OpenSearch Vector Engine Performance using GPUs #2293
Labels
enhancement
indexing
indexing-improvements
This label should be attached to all the github issues which will help improving the indexing time.
Introduction
As the popularity and adoption of generative AI continues to grow, the demand for high-performance vector search databases has skyrocketed. Customers are now dealing with vector datasets that have scaled from hundreds of millions to tens of billions of vectors, and they require these vector databases to handle continuous updates and queries at massive scale. Building vector indices at such massive scales takes significant time and it is imperative that we significantly improve the performance and scalability of the OpenSearch Vector Engine to handle such workloads.
This document outlines a strategic plan to leverage GPU acceleration to dramatically boost the indexing performance of the OpenSearch vector engine. By harnessing the massive parallelism of GPUs, we aim to achieve faster index build times compared to CPU-based solutions, all while maintaining a compelling price-performance ratio. This will enable OpenSearch to meet the growing demands of our customers.
The key elements of this RFC include:
By implementing this GPU-powered approach, we will significantly enhance the indexing capabilities of the OpenSearch vector engine and better position it to meet the evolving needs of our customers in the generative AI era.
Opportunity
Index Build Acceleration
In 2024, the Vector Engine team made significant investments to improve the performance of the OpenSearch vector engine. This includes adding AVX512 SIMD support, implementing segment replication, transitioning to the more efficient KNNVectorsFormat, and employing iterative graph builds during merges to reduce memory footprint, greedy graph builds. These optimizations have greatly improved the overall performance and stability of the vector engine.
However, we believe we have now reached the limits of what can be achieved with CPU-based hardware alone. The next logical step is to leverage GPUs and other purpose-built hardware to further accelerate vector indexing. Our experiments with Nvidia's GPU-accelerated vector indexing algorithm, CAGRA, have shown very promising results - achieving 10-30x faster index build times compared to CPU-based solutions, all while maintaining a compelling price-performance ratio. A snapshot of the results are added in Appendix C.
The bar graph illustrates the substantial percentage improvements in index build time achieved by leveraging GPU-based infrastructure compared to CPU-based solutions for different datasets.
These performance gains are critical as our customers continue to scale their vector workloads into the billions of vectors. Faster indexing will allow them to more efficiently build and manage their vector indices.
Feasibility Study
To ensure full compatibility of GPU-created indices with CPU machines, multiple experiments were conducted. One crucial aspect that was validated is the ability to search a Faiss Index built on a GPU machine(with CAGRA algorithm) using a CPU machine. The performance benchmarks, as mentioned in the previous section, were set up in a manner where the CAGRA index was built on a GPU with Faiss, converted to a CPU-compatible index using Faiss, and then searched using Faiss. This proves that we can build the Faiss Index on a GPU machine and search it on a CPU machine. The CAGRA graph, before serialization, will be converted to an HNSW-based graph (IndexHNSWCagra), which can be searched on a CPU. This conversion is supported internally by Faiss and is maintained and contributed by the Nvidia team. Benchmarking code and code to convert a Cagra index to HNSW index is present in repo named VectorSearchForge
Proposed Architecture
Below is the high level simplified flow for improved vector index build. The overall flow goes like this:
Note: This is an oversimplified architecture and provides a 10K ft. view of the overall system.
Components Definition
Pros:
Cons:
Alternatives Considered
Alternative1: Hosting IndexBuild Service on GPU based Data Node Instances in OpenSearch Cluster
The idea here is rather than having a separate service/component outside of OpenSearch cluster, have a GPU based node in the cluster and can be used to build the index. On exploring further on this idea we can have GPU based instances in 2 forms:
Alternative2: Using Node to Node communication to transfer the vectors from data node to IndexBuildService
The idea here is rather than using an intermediate storage in between we directly transfer the vectors from data nodes to GPU machine to index build service where the actual index build will happen. This will completely remove building the support for different remote stores.
Pros:
Cons
Vector Index Build Component
Below is a sneak peek into the Vector Index build node and it components. This node/nodes will be responsible for building the Vector Index. The Vector Engine will calling/interacting with these nodes to build the index. Opensearch Project will be providing a docker image containing different components that can be deployed into on-prem/cloud to create/run/manage the Vector Index Build Service. Below is a very high level diagram and flow for how create Index api flow will happen.
Components Definition
Below is high level responsibilities and purpose of the components added in the above diagram. This is still a high level roles and responsibilities. During the detailed low level design we will further explore and cement it.
Pros:
Cons:
Alternative Considered
The below alternatives are mainly from distribution of the above software standpoint. The discussion on the internal components will happen as part of a separate RFC.
Alternative 1: Provide the whole software as an zip/tar rather than a container
The idea here is rather than containerizing the service, we can actually give direct .zip/tars containing the software just like Opensearch tar.gz.
Pros:
Cons:
Next steps
Overall the high level design is divided among various RFCs/issues. The division will look like this:
Security
Since there are multiple moving pieces in the design, details on the security aspect/certification will be covered with each component level design/RFC.
FAQs
Why we are focusing on improving the indexing time and not necessarily the search time?
The acceleration provided by GPUs specially comes when you have high number of vectors either for indexing or search. Since for indexing always we buffer the vectors in segments and then do a graph creation indexing can take huge advantage of GPUs.
But if we look at the Search use-case we just get 1 query vector at a time. Plus there is no capability in Opensearch currently where we can batch/buffer the vectors for doing the query at segment/shard level. So at-least initially we want to introduce GPU acceleration with indexing. We also believe that when the use case of batch vector search query will come in that case we can start leveraging the GPUs for search.
Why we choose GPUs rather than a purpose built vector search hardware?
One of the biggest reason around this was Nvidia’s hardware is popular and already available in most of the clouds. Another thing is the k-NN algorithm of Nvidia is Opensource and compatible with CPUs and Faiss.
How a GPU based Faiss Index(built using CAGRA) can be searched on CPU machine?
The CAGRA graph cannot be searched on a CPU machine, this is where when the index will be built on the GPU machine, before doing the serialization, the IndexBuild Service convert the Cagra Index to a IndexHNSWCagra, which can be searched on CPU.
Does CAGRA supports fp16, Byte quantization along with Binary Index for disk based vector search?
We had a discussion with Nvidia team and got the confirmation that Cagra support fp16 and byte quantization. The Binary Index is also present with Cagra. But of Tuesday, November 5 the support is only present in the raft library but not exposed via Faiss. Nvidia team will be working on enabling the support in Faiss for these quantizations.
How much time will be taken to upload the Vectors to remote store?
As per writing of this RFC, actual upload times have not been measured, but if we look at the raw numbers in terms of build time improvement (refer in Appendix C) for 30GB raw vector (openai 1536D) vector, we see a difference of close to 90mins. So even if we spend like 5-10mins in uploads and downloads(~250 Megabit/sec) we would still be saving huge amount of time. Also, when index is built on the GPU, only the HNSW structure will be uploaded to S3, this will further reduce the total upload time of the index(ref this code on how we will do this).
Appendix
Appendix A: Reference Documents
Appendix B
Current Architecture
The OpenSearch k-NN plugin supports three different types of engines to perform Approximate Nearest Neighbor (ANN) search: Lucene(Java Implementation), Faiss(C++ implementation) and Nmslib(C++ implementation). These engines serve as an abstraction layer over the downstream libraries used to implement the nearest neighbor search functionality.
Specifically, the plugin supports the following algorithms within each engine:
At a high level, OpenSearch index data is stored in shards, which are essentially Lucene indices. Each shard is further divided into immutable segments, created during the indexing process.
For indices with k-NN fields, the k-NN plugin leverages the same underlying Lucene architecture. During segment creation, in addition to the standard Lucene data structures (e.g., FST, BKDs, DocValues), the plugin also generates the necessary vector-specific data structures for each vector field. These vector-related files are then written to disk as part of the segment files and managed by Lucene.
When performing ANN search, the plugin loads the relevant vector data structures from disk into memory (but not the JVM heap) and uses the respective library (Lucene, Faiss, or Nmslib) to execute the search operation.
Appendix C
Comparison between g5.2xlarge and r6g.2xlarge
By maintaining similar recall and search time below table provides the index build time improvement numbers
Legends:
CPU Build Time: Total time in sec required to build + serialize the graph on CPU
GPU Build Time: Total time in sec required to build the graph on GPU + converting to CPU compatible graph + serialization of graph
Observations
Appendix D
Prices for GPU and CPU machines.
EC2 machine prices
The text was updated successfully, but these errors were encountered: