Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds section on product quantization for docs #6926

Merged
merged 38 commits into from
Apr 16, 2024

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented Apr 9, 2024

Description

Adds section in vector quantization docs for product quantization. In it, it contains tips for using it as well as memory estimations. Along with this, changed some formatting to make docs easier to write.

I decided to include completely accurate memory estimate for formula with a note about the typical number of segments.

We added a section on scalar quantization in 2.13 - but it did not include product quantization. Related comment here: https://github.com/opensearch-project/documentation-website/pull/6249/files#r1529479186. This should be backported for 2.13

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Adds section in vector quantization docs for product quantization. In
it, it contains tips for using it as well as memory estimations. Along
with this, changed some formatting to make docs easier to write.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 force-pushed the knn-pq-improved-docs branch from 58058f6 to 4f1bd63 Compare April 9, 2024 17:32
@jmazanec15 jmazanec15 requested a review from vamshin April 9, 2024 17:32
Copy link
Member

@vamshin vamshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks

@hdhalter hdhalter added 4 - Doc review PR: Doc review in progress backport 2.13 PR: Backport label for 2.13 labels Apr 10, 2024
@Naarcha-AWS Naarcha-AWS self-assigned this Apr 10, 2024
Fix formatting

Signed-off-by: Melissa Vagi <[email protected]>
Define abbreviation on first mention

Signed-off-by: Melissa Vagi <[email protected]>
Copy link
Contributor

@vagimeli vagimeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc review complete. Please let me know if you have any questions about my changes. Once you've addressed my feedback, I'll approve the PR as ready for editorial. Thank you.

_search-plugins/knn/knn-index.md Outdated Show resolved Hide resolved
@@ -10,22 +10,42 @@ has_math: true

# k-NN vector quantization

By default, the k-NN plugin supports the indexing and querying of vectors of type `float`, where each dimension of the vector occupies 4 bytes of memory. For use cases that require ingestion on a large scale, keeping `float` vectors can be expensive because OpenSearch needs to construct, load, save, and search graphs (for native `nmslib` and `faiss` engines). To reduce the memory footprint, you can use vector quantization.
By default, the k-NN plugin supports the indexing and querying of vectors of type `float`, where each dimension of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the line break formatting of lines 13--16.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the line breaks so that editing would be easier and it doesnt impact rendering (i.e. it wouldnt be one line that rolls out of the screen). Is this incorrect to do?

Copy link
Contributor

@vagimeli vagimeli Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's incorrect to enter line breaks. The site and OpenSearch Project doc team follow a specific formatting guide. I'll handle formatting the doc before moving it into editorial. https://github.com/opensearch-project/documentation-website/blob/main/FORMATTING_GUIDE.md

_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 @vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!

_search-plugins/knn/knn-index.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-index.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-index.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-index.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-index.md Outdated Show resolved Hide resolved

In OpenSearch, the training vectors need to be present in an index. In general, the amount of training data will depend on which ANN algorithm will be used and how much data will go into the index. For IVF-based indices, a good number of training vectors to use is `max(1000*nlist, 2^code_size * 1000)`. For HNSW-based indexes, a good number is `2^code_size*1000` training vectors. See [Faiss's documentation](https://github.com/facebookresearch/faiss/wiki/FAQ#how-many-training-points-do-i-need-for-k-means) for more details about the methodology behind calculating these figures.

For PQ, the two parameters that need to be selected are _m_ and _code_size_. _m_ determines how many sub-vectors the vectors should be split to encode separately. Consequently, the _dimension_ needs to be divisible by _m_. _code_size_ determines how many bits each sub-vector will be encoded with. In general, a good place to start is setting `code_size = 8` and then tuning _m_ to get the desired trade-off between memory footprint and recall.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the second sentence here. Do we mean something like "m determines the number of subvectors into which vectors should be split for separate encoding"? In the fourth sentence, is "with" the correct preposition, or should it be "into"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your rewrite is correct. I revised the following sentence to read: _code_size_ determines the number of bits used to encode each subvector.

_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
_search-plugins/knn/knn-vector-quantization.md Outdated Show resolved Hide resolved
vagimeli and others added 6 commits April 16, 2024 08:47
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
vagimeli and others added 11 commits April 16, 2024 08:49
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Address editorial feedback

Signed-off-by: Melissa Vagi <[email protected]>
@vagimeli
Copy link
Contributor

@jmazanec15 @vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!

@natebower Thank you for the review. I accepted your edits and addressed the rewrite comments.

Copy link
Contributor

@vagimeli vagimeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc review and editorial review completed

@vagimeli
Copy link
Contributor

@jmazanec15 @vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!

@natebower Thank you for the review. I accepted your edits and addressed the rewrite comments.

@vagimeli vagimeli removed the 5 - Editorial review PR: Editorial review in progress label Apr 16, 2024
@vagimeli vagimeli merged commit 9a6bb8a into opensearch-project:main Apr 16, 2024
7 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 16, 2024
* Adds section on product quantization for docs

Adds section in vector quantization docs for product quantization. In
it, it contains tips for using it as well as memory estimations. Along
with this, changed some formatting to make docs easier to write.

Signed-off-by: John Mazanec <[email protected]>

* Update knn-vector-quantization.md

Fix formatting

Signed-off-by: Melissa Vagi <[email protected]>

* Update knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update knn-vector-quantization.md

Define abbreviation on first mention

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: John Mazanec <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-index.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update knn-index.md

Formatting and copyedits

Signed-off-by: Melissa Vagi <[email protected]>

* Update knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-index.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-index.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-index.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-index.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-index.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update _search-plugins/knn/knn-vector-quantization.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>

* Update knn-vector-quantization.md

Address editorial feedback

Signed-off-by: Melissa Vagi <[email protected]>

---------

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit 9a6bb8a)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.13 PR: Backport label for 2.13
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants