-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10471 Increse max dims for vectors to 2048 #874
Conversation
Increase the maximum number of dims for KNN vectors to 2048. The current maximum allowed number of dimensions is equal to 1024. But we see in practice a number of models that produce vectors with > 1024 dimensions, especially for image encoding (e.g mobilenet_v2 uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing max dims to `2048` will satisfy these use cases. We should not recommend further increase of vector dims.
I'm curious about how such large models (to me) are practically common or will be common in the near future (in the IR area). |
My concerns are on the JIRA issue, I don't want them to be forgotten. https://issues.apache.org/jira/browse/LUCENE-10471 I don't know how we can say "we will not recommend further increase". What happens when the latest trendy dataset comes out with 4096 dimensions? I want to understand, why so many dimensions are really needed for search purposes. What is the concrete benefit in terms of quality, because we know what the performance hit is going to be. |
I understand that in general the more features you have in a vector of embeddings, the more details the model returns from the classification. In my case I used Fixed Average and it worked fine for Elmo model, as mentioned here in 3 Alternative Weighting Schemes Other option If I'm not mistaken this git is capable of supporting vectors larger than 1024. |
Should we punish and exclude customers who cannot complete requisite steps of dimensional reduction or allow them to explore with very expensive compute. Many popular large language models surpass the current threshold for better or worse. |
the performance with e.g. 768 is incredibly painful. hours and hours to index just 1M documents. Already doesn't scale with the current limit! |
I think slow indexing throughput is a pain that customers ought to surface. If they find that they mostly use vectors for use cases that don't have nrt-scaling and replication requirements that should drive our decision to inhibit the maximum number of dimensions. I have seen multiple Open AI and Hugging Face customers flock to other search engines because we impose this limit. 4096 is the number that keeps getting thrown but have seen one case of more. On the other hand, if there are stability concerns at a particular level of dimensionality, we should cap there. All customers don't have equivalent needs for indexing throughput. Plus — we can work on indexing throughput in the future as an incremental improvement to the feature. |
i dont agree, I think the problems are flaws with the HNSW and can't be worked around. Its too slow already at 768 and in fact the current limit overpromises and underdelivers by allowing you to even do this. |
Please don't do this. If somebody is not able to reduce the number of dimensions before indxing the stuff heshe should also not use vector search at all because it will just produce huge indexes that are slow like hell. If you understand your data you can also reduce dimensions. If not, it is the wrong tool for you. |
Neither of you are wrong. In this case, we have a world of people excited about a new thing, willing to take actions that go against science because vendors have told them it is right. While I am personally confident that the number of dimensions that is useful for the search use case ought not exceed 768, ithe hard and fast rule boxes us out a fabulous amount of explorational compute. I never want Lucene to be perceived as legacy software. On this point, I will stand down, especially because if users want to change they can. We’re open source. Reliability and performance of the unchanged system are more important. |
If, as you say, an entire document, regardless of it's lenght, content and so on, can be represented by a vector of 768 floats, why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities? Do you see the flaw in your reasoning here? If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for: Search over large amounts of data. If you need this functionality then you have no reason to use anything else than FAISS. If bringing in FAISS is too drastical, then it's implementation should be studied and integrated instead. Fast efficient vector functionality is a must, if lucene doesn't support this then it and anything that builds off of it is doomed. |
I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?
I may be wrong but it seems like this is where most of the lucene committers here are settling? Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for MongoDB inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these things hostage)? So maybe java is still not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option? Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will perform without JEP 426 or better SIMD support that's only available in newer JDKs? I know Pinecone (and others) have blogged about their love for RUST for these kinds of applications. Should Lucene just leave this to the job of alternative Search APIs? Maybe something like Tantivy or Rucene? Or is it time we explore a new optional Lucene Vector module that supports cutting edge JDK features through gradle tooling for optimizing the vector use case? Interested what others think. |
Closing this in favour of #12436 |
Increase the maximum number of dims for KNN vectors to 2048.
The current maximum allowed number of dimensions is equal to 1024.
But we see in practice a number of models that produce vectors with > 1024
dimensions, especially for image encoding (e.g mobilenet_v2 uses
1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors).
Increasing max dims to
2048
will satisfy these use cases.We will not recommend further increase of vector dims.
#11507