Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose flat vectors in "user space" #13468

Closed
msokolov opened this issue Jun 7, 2024 · 9 comments
Closed

Expose flat vectors in "user space" #13468

msokolov opened this issue Jun 7, 2024 · 9 comments

Comments

@msokolov
Copy link
Contributor

msokolov commented Jun 7, 2024

Description

There are use cases where we want to store medium-dimensional vectors (ie embedding space vectors from ML models), retrieve them, compute distances among them, and perform KNN search, but we don't want to HNSW or any other special index-time support. If we search, we'll do it using an index scan. For example this could happen if we partition the index by some key and then rank the resulting documents by their vector distance. Currently if you make a KnnFloatVectorField or a KnnByteVectorField you get an HNSW graph even if you don't want it. We have all the tools to support this use case, but the API doesn't allow it.

My question is how should the API look? I started to familiarize myself with the flat vectors support we now have and I see it was done so we now KnnVectorsFormat and FlatVectorsFormat as separate formats that do not share anny common ancestor. I wonder what you all would think about folding FlatVectorsFormat in to KnnVectorsFormat? The only difference today is the search() method, which I would like to support over flat vectors. Otherwise I guess we could add search to FlatVectorsFormat?? But in that case how would we select this format for a field? I'd rather avoid plumbing a whole new format through IndexWriter when it is effectively a flavor of a format we already have. But I may be missing the rationale behind this format forking ... was there some discussion about it you could point me to - I might have been sleeping, sorry!

@jpountz
Copy link
Contributor

jpountz commented Jun 7, 2024

FlatVectorsFormat is an internal abstraction layer for vectors formats that helps configure the way vectors are stored (e.g. quantized or not) independently from how they're indexed.

I'm not a fan of reusing KnnVectorsFormat for flat (unindexed) vectors. To draw a parallel with other data types, if you want to index a number, it should go to points, but if you only want to only store them, it should go to doc values or stored fields. I'd like vectors to be no different: KnnVectorsFormat is for indexing them, if you only want to store them you can still use stored fields or binary doc values?

E.g. maybe we could have FloatVectorDocValuesField that indexes vectors as binary doc values the same way as FloatField indexes floats as numeric doc values. And then query factories, e.g. newSlowSimilarityQuery(String field, float[] queryVector) similarly to SortedNumericDocValuesField#newSlowRangeQuery(String field, long min, long max).

@msokolov
Copy link
Contributor Author

msokolov commented Jun 7, 2024

What I want to do is index float vectors, have them quantized and scored using the quantized form. I just only want the quantization part of indexing, not the graph-building part.

@msokolov
Copy link
Contributor Author

msokolov commented Jun 7, 2024

#13469 is just plumbing things through showing a possible way forward

@jpountz
Copy link
Contributor

jpountz commented Jun 7, 2024

Thanks, I had missed the quantization requirement and that you were ok with configuring a codec on the IndexWriter.

@msokolov
Copy link
Contributor Author

msokolov commented Jun 7, 2024

I was thinking this could be used by PerFieldKnnVectorFormat since with this change a FlatVectorFormat is a KnnVectorFormat

@navneet1v
Copy link
Contributor

Currently if you make a KnnFloatVectorField or a KnnByteVectorField you get an HNSW graph even if you don't want it. We have all the tools to support this use case, but the API doesn't allow it.

+1 on this. Currently the way I was thinking to achieve this was by creating my own KNNVectorsformat on top of FlatVectorsFormat and plumbing that vectors format for a field via Codec. Having this support in Lucene will go long way.

Personally I think there should also be search api on the flat vectors format, which does nothing but just do a brute force/exact search on top of all the vectors.

@msokolov
Copy link
Contributor Author

See #13469. This still leaves search() as throwing UnsupportedOperationException but enables scoring using quantized vectors. I think typical use case will be to drive the retrieval using some other search mechanism and then score/rank using vector score. Users that want to rank all documents by vector score can easily implement on top of this?

@benwtrent
Copy link
Member

Maybe this is done?

@msokolov
Copy link
Contributor Author

Yes, thanks - I'll resolve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants