Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#23752] DocDB: Integrating hnswlib into the vector indexing framewor…
…k and hnsw_tool Summary: Adding a wrapper around Hnswlib as another implementation of the vector index interface. The biggest required change to the framework here is that we are parameterizing many classes with an additional template argument for the distance calculation result type, rather than assuming that a distance is always float. For certain data types, e.g. uint8_t, in combination with distance types such as inner product or L2 square, it might be entirely reasonable to calculate distances as e.g. int32_t or uint32_t rather than float. For example, Hnswlib uses int32_t as the distance type for uint8_t vectors. Another complication arises due to usearch not supporting uint8_t type yet, although we plan to add it there. Currently, in order to run usearch on the SIFT 1B dataset, we cast vectors to float before adding them to our usearch index. This gives rise to the distinction between "input vector" and "indexed vector" types and the corresponding potential difference between distance result types that would be computed for any given distance function for those two vector/coordinate types. E.g. the Euclidean distance of float vectors is float, but the Euclidean distance of int vectors is a (potentially wider) type of int. This framework turns out to be usable in hnsw_tool, but it is still not clear if we will need those vector type conversions in the production codepath in case we implement uint8_t support in usearch. Other changes: - Renaming TestThreadHolder to ThreadHolder and moving it to the util library, since it is used for parallelizing index load and validation in hnsw_tool. - Supporting a sharded index in hnsw_tool: specifying num_index_shards > 1 will result in automatically creating that many copies of the Usearch/Hnswlib index, which sometimes allows to achieve higher throughput. - Enums: adding an operator >> to read an enum element from a stream. The parsing is case-insenstive and allows the input string to not have the "k" prefix. Use this for parsing enum-typed command line options in the Boost program options based command-line tool framework. - Removed the dependency of yb_vector library on yb_docdb -- the dependency should be the other way around. Jira: DB-12655 Test Plan: Jenkins Manual testing with hnsw_tool Reviewers: sergei, aleksandr.ponomarenko, tnayak Reviewed By: aleksandr.ponomarenko Subscribers: svc_phabricator, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37632
- Loading branch information