-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] Partial vector search datastrcuture loading in memory for doing search #1693
Comments
Seems plausible. My initial opinion is to map the portion of the faiss file into memory that has the flat index and use that. In regards to binary doc values vs lucene vectors values vs native engine vectors, I think we would end up writing a lot of glue code in order to get binary doc values/lucene vectors to work, so may be good to wait on this. In long run, I think each engine should implement Lucene's KnnVectorsFormat. Then, a particular engine can decide how it wants to handle its vectors - i.e. delegate to lucene flatformat or store things on its own. With this, we could support both fp16, binary vectors, etc. In terms of dynamic search parameters, I think these would be able to be passed through KnnCollector. |
i have been tried rewrite an new IndexFlat in faiss as new VectorStore without read all vector into memory. but the iops and latency are not good. also i test Lucene's KnnVectorsFormat performance dependents on page cache. |
if we have an jni access the KnnVectorsFormat we can extends i am wondering which is better: 1. implements an new IndexFlat vector store in Faiss ( also can easy support fp16 wich SQuantizer)or 2. implements jni interface to Lucene#KnnVectorsFormat |
BTW i wanna talk about diskann. and it shows great memory reduce, and less disk round trips (which iops and io throughput less than KnnVectorsFormat) |
@jmazanec15 and @luyuncheng thanks for all your comments. I created this issue to discuss on the specific part around making our current HNSW disk based with Faiss without even bringing any new algorithm.
As you know there is only 1 file. So either we need to build something in faiss that can read HNSW graph separately and then map vectors from same file partially in memory, or we can actually during serialization put vectors and HNSW structures in separate files. I don't have a strong opinion on this. whatever is feasible my vote is on that. Whether we should KNNVectorFormat/BinaryDocValues/FlatIndices, my problem with all 3 of them is they are not optimized for reads as HNSW algo progresses. So, my thinking would be to take some ideas from DiskANN implementation and put vectors and its neighbors side by side. This can reduce iops.
@jmazanec15 on this I think you mean an interface like KNNVectorsFormat where serialization/de-serialization is handled by Engine and how to read from the file. right?
@luyuncheng I think we are aware of diskANN and what benefits it bring in. There are multiple implementations diskANN too like JVector, MS DiskANN etc. The question which we need to ans is which algorithm we want to main in this plugin for diskANN ? |
I see. Yes, then I think we need to focus on hnsw disk access patterns and what we can do (i.e. vector ordering/caching strategy). As @luyuncheng mentioned, I have also seen performance very slow when using lucene's hnsw in memory constrained environment and its because of excessive disk reads - Im guessing faiss will have same issue. So Im not sure itd be worth adding disk-access support before figuring out how to make it more efficient.
Im saying that each engine (faiss, nmslib) should implement KnnVectorsFormat. Then, internally they can decide how they want to handle the vectors (i.e. use Lucene's flatvectors writers and consume it or use the engines to create files directly). Doing this, I think will make it easier to extend and experiment. Basically, extension point is a KnnVectorsFormat, an engine name, and a structure of search/index/train params. It will also logically isolate engines from each other. |
Yeah, I think its the same thing which I am thinking. But I don't want to rely on an interface from Lucene as they can change those interfaces without considering things which we would have done in the plugin. Hence I like the idea of having a common interface, but not exactly extending KnnVectorsFormat. KNNEngine interface has some promise, but it is bit overloaded. We can think over it. |
@navneet1v sounds good - maybe we can come up with a proposal |
Yes exactly. The way I want to break this problem is in 2 part:
Yeah I think the reason why it happens is the FlatIndex and KNNVectorFormats are just plain list of vectors without any optimization. |
Sure I can come up with a proposal. |
After resolved this issue (#1951), will present few ideas. We can discuss pros and cons from there. |
Currently drafting the first edition of design document on this. |
Description
Currently k-NN plugin supports 3 engines, Nmslib, Faiss and Lucene. The first 2 being native engines requires the whole vector related data structures(aka HNSW graphs here) to be loaded into memory for these engines to work. For large workloads the cost can soon become high if some quantization techniques are not applied.
If we look closely a HNSW graph mainly consist of below things:
From the above items, main memory is used by these full precision vectors(4bytes * number of vectors * number of dim).
The way Faiss stores these vectors is in a Flat Index and during serialization and deserialization these vectors are written and read to/from the file and put in the main memory which increases the memory consumption
Proposal
The proposal I am putting here is we can avoid loading this Flat Index into memory and as required we can read full precision vectors from disk on demand/or from cache(limited sized cache). The cache can be tiered cache too. Ref: opensearch-project/OpenSearch#10024
This will definitely have an impact on the search latency, but overall it will reduce the cost.
More Details Needed
What alternatives have you considered?
Given that Lucene doesn't load all the vectors in memory but mmap the .vec file, Lucene engine is able to work with low memory footprint, but it takes a hit on the Latency.
But using Lucene Engine posses other challenges like no support of fp16, dynamic search parameters, SIMD instruction set etc.
The text was updated successfully, but these errors were encountered: