-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase Lucene max dimension limit to 16,000 #1346
Increase Lucene max dimension limit to 16,000 #1346
Conversation
Signed-off-by: Junqiu Lei <[email protected]>
fb79843
to
44febd3
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1346 +/- ##
============================================
- Coverage 85.15% 85.02% -0.13%
- Complexity 1216 1242 +26
============================================
Files 160 161 +1
Lines 4958 5069 +111
Branches 457 473 +16
============================================
+ Hits 4222 4310 +88
- Misses 538 554 +16
- Partials 198 205 +7 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Does it make sense to add a BWC for this change? Have you try to ingest data with this higher dimension value while doing rolling upgrade? |
src/main/java/org/opensearch/knn/index/codec/KNN950Codec/KNN950PerFieldKnnVectorsFormat.java
Show resolved
Hide resolved
16k Dimension Sanity TestCluster ConfigurationUsed a single node cluster with OpenSearch 3.0 installed and plugin from https://github.com/junqiu-lei/k-NN/tree/higher_dimension installed. Nodes were ran on AWS EC2 instances
Data setGenerated a random HDF5 data set with 100K train and 10K test vectors with dimension 16K. Used a random uniform distribution between -10K and 10K. Script can be found here: https://github.com/jmazanec15/k-NN-1/blob/dimension-test-16k/benchmarks/osb/scripts/generate-dataset.py. Algorithm ConfigurationsTested hnsw algorithm on lucene.
Test Tool
Result
|
can you post a comparison with other engines here too, which was done last time. |
src/main/java/org/opensearch/knn/index/codec/BasePerFieldKnnVectorsFormat.java
Show resolved
Hide resolved
I don't think we need BWC test for this, lower versions indices have <=1024 dimension settings. Will have a look on the rolling upgrade scenario. |
Sure, but last time was with 3 data nodes and some other differences. |
Signed-off-by: Junqiu Lei <[email protected]>
This is limit change during a new index creation. Existing bwc test should be enough to catch any issue with existing index. |
can we then rerun the tests for the nmslib and faiss with your setup? |
Overall code looks good to me. Please just paste the comparison with nmslib and faiss in this PR and also on the github issue. Once that is done I can approve the PR. |
Thanks, will re-run nmslib and faiss tests |
I have updated the table with new tests from different engines. #1346 (comment) |
* Increase Lucene max dimension limit to 16,000 Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 083ea2b)
@junqiu-lei can we add Faiss(HNSW)? Also can we do it on a smaller instance (r6gd.4xl) |
@junqiu-lei We need a document update as well. |
Sure, will add. |
* Increase Lucene max dimension limit to 16,000 Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 083ea2b) Co-authored-by: Junqiu Lei <[email protected]>
@vamshin Updated test results to have Faiss(HNSW) on r6gd.4xl instance. |
Description
Since Lucene moved the max dimension limit to codec, we are able to change it in k-NN Lucene codec. This PR updates the Lucene engine dimension max limit to 16k which is consistent with other k-NN engines.
Issues Resolved
Closes #925
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.