-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate migrating custom codec from BinaryDocValuesFormat to KnnVectorsFormat #1087
Comments
OverviewWhen we add KNN support using native engine library(Faiss, NMSLib), there was no KnnVectorsFormat available in the Lucene library. So we used BinaryDocValuesFormat to store the vector data and added knn index using the native engine library while indexing doc value. For searching, we implemented our own query and utilized native engine index for ANN and doc value data for exact search. Now as KnnVectorsFormat is available in the Lucene library, we are exploring an opportunity to move from DocValuesFormat to KnnVectorsFormat to see if there are any benefit of doing it. With KnnVectorsFormat, we cannot store vector data and add native engine index along with it as we did with DocValuesFormat because it will create the Lucene vector data index which will be there for nothing but only consumes resources. Therefore, we need to override all the implementation extending VectorFormat class for both indexing and querying. After comparing pros and cons of migration, we want to hold on it until we have more data points to support migration given a required effort for it. Current BehaviorFew points to note
Indexing Searching For native engine, we have our own KNNQuery class. KNNQuery has its own logic to switch to exact search. For exact search to work, doc value is needed. ANN search works without doc value. Native engine library support vector data access by id but need to test how efficient it will be compared to doc value. Compound file handling Scoring script Others Migration detailMapperWe need to create FieldType using setVectorAttributes method of FieldType similar to KnnFloatVectorField and KnnByteVectorField so that the type is treated as vector type. To create doc value, the code can be shared among KNNVectorFieldMapper and LuceneFieldMapper
After QueryWe might either extends AbstracKnnVectorQuery or just use KnnFloatVectorQuery for native engine.
However, if we want to migrate exact search implementation later, we can extends AbstracKnnVectorQuery by overriding. createVectorScorer method is used inside exactSearch only. Therefore, if we override exactSearch we don’t need to implement createVectorScorer
CodecWe need to define our own NativeEngineVectorsFormat extending KnnVectorsFormat . In BasePerFieldKnnVectorsFormat, we return appropriate vector format based on engine type: NativeEngineVectorsFormat, and Lucene95HnswVectorsFormat Here, we also need to implement VectorsWriter and VectorsReader for native engine as well.
Pros
Cons
Other points
|
Will close for now as investigation is complete. We may revisit this topic later. |
Im going to keep this issue open. I think we need to do this work at some point. A couple issues where it could be potentially helpful:
Looking at the cons: (2) and (3) shouldnt be listed. Its possible to just disable doc values for this field and use the vectors directly. Also, for (2), we could implement a wrapper around the native engine storage of vectors and not store by default. (4) I dont think this will be the case (after our custom codec is deprecated). (6) I think this may be true, but I think in the long run, we will be able to reduce effort. Additionally, our codec right now has not been updated in awhile and the versioning is not scalable. This might help make it more maintainable. |
hi @jmazanec15 @navneet1v @heemin32 , I have an idea about this topic. in some scenarios, we want to
so I propose to use doc_values field for the vector fields. like:
and for this i rewrite
optimize result: for the continues dive in to I think I can create a new issue and pr for the code. and report the details. |
@luyuncheng not storing vector fields in _source is always an option for user. But this comes with its own limitations:
|
If we can support this kind of retrieval then it will be really awesome because for users who don't have usecases of re-indexing, update by query can still use it. When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface. |
@navneet1v if we would support |
@navneet1v I will create pr to show how I rewrite the |
Yeah thats right but we don't. I am particularly against removing the vector field from _source as its more index mapping driven. Its just that we should be aware of the pit falls. |
can you also explore the sorted doc values? I would like to ensure that we are not over engineering a solution. |
@luyuncheng In terms of source storage, there is an issue over here: opensearch-project/OpenSearch#6356 |
@jmazanec15 @navneet1v In #1571 I create a pr, which is WIP. and it shows how my idea works.
|
@jmazanec15 , @heemin32 , @luyuncheng I created a POC for Moving to KNNFloatVectorValues: navneet1v@990a58f Things which are working:
Things not working:
Will work on fixing the things not working in the POC. This code alone will not work. We need changes in OS and Lucene. Lucene PR is raised. |
What do you mean by this? |
So right now if someone is creating a training index, they can set index.knn: false and ensure graphs are not getting created for the index as this is just a training index. and our custom codec is not getting picked. But as of now the way I implemented it, from KNNFieldMapper standpoint I migrating all indices greater than a specific version of use FloatVectorField. Now this FloatVectorField by default will create the k-NN graphs(This is lucene side). Hence I was saying it is not optimal. I was trying add more complexity in the logic of when to pick LuceneVectorField vs OurKNNVectorField but then the exact search usecases will be not be optimal because then exact search will keep on using BinaryDocValues rather than KNNFloatVectorsValues. I am thinking we should add a capability in Lucene where someone can configure they just want FlatVectors and no HNSW graphs. But not sure how much it fly in Lucene world. I will think more if there are some other sophisticated conditions with which we can do all this in Opensearch only. |
Another option could be creating another codec for training index(index.knn: false) which does not create a graph but still use lucene vector format. |
@heemin32 the problem is our Codec classes will not get hit if index.knn:false. So I am trying to see what we can do here. |
I am able to come up with a way to go around this problem, I provided the solution here: #1079 (comment) |
Addressing in #1853 |
Currently, we integrate our native libraries with OpenSearch through Lucene's DocValuesFormat. At the time, Lucene did not have the KnnVectorsFormat format (which was released in 9.0).
Now that it exists, I am wondering if we should move to use KnnVectorsFormat. KnnVectorsFormat has a KnnVectorsWriter and KnnVectorsReader. Migrating to KnnVectorsFormat would allow us to:
In general, it would make the native library integrations more inline with the Lucene architecture which would have long-term benefits around maintainability and extendability.
All that being said, we need to do a deep dive into what switching means in terms of backwards compatibility and also scope out how much work needs to be done.
Tracking list of benefits of moving to KnnVectorsFormat
The text was updated successfully, but these errors were encountered: