Skip to content

Commit

Permalink
[#23752] DocDB: Integrating hnswlib into the vector indexing framewor…
Browse files Browse the repository at this point in the history
…k and hnsw_tool

Summary:
Adding a wrapper around Hnswlib as another implementation of the vector index interface. The biggest required change to the framework here is that we are parameterizing many classes with an additional template argument for the distance calculation result type, rather than assuming that a distance is always float. For certain data types, e.g. uint8_t, in combination with distance types such as inner product or L2 square, it might be entirely reasonable to calculate distances as e.g. int32_t or uint32_t rather than float. For example, Hnswlib uses int32_t as the distance type for uint8_t vectors.

Another complication arises due to usearch not supporting uint8_t type yet, although we plan to add it there. Currently, in order to run usearch on the SIFT 1B dataset, we cast vectors to float before adding them to our usearch index. This gives rise to the distinction between "input vector" and "indexed vector" types and the corresponding potential difference between distance result types that would be computed for any given distance function for those two vector/coordinate types. E.g. the Euclidean distance of float vectors is float, but the Euclidean distance of int vectors is a (potentially wider) type of int. This framework turns out to be usable in hnsw_tool, but it is still not clear if we will need those vector type conversions in the production codepath in case we implement uint8_t support in usearch.

Other changes:
- Renaming TestThreadHolder to ThreadHolder and moving it to the util library, since it is used for parallelizing index load and validation in hnsw_tool.
- Supporting a sharded index in hnsw_tool: specifying num_index_shards > 1 will result in automatically creating that many copies of the Usearch/Hnswlib index, which sometimes allows to achieve higher throughput.
- Enums: adding an operator >> to read an enum element from a stream. The parsing is case-insenstive and allows the input string to not have the "k" prefix. Use this for parsing enum-typed command line options in the Boost program options based command-line tool framework.
- Removed the dependency of yb_vector library on yb_docdb -- the dependency should be the other way around.
Jira: DB-12655

Test Plan:
Jenkins

Manual testing with hnsw_tool

Reviewers: sergei, aleksandr.ponomarenko, tnayak

Reviewed By: aleksandr.ponomarenko

Subscribers: svc_phabricator, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37632
  • Loading branch information
mbautin committed Sep 6, 2024
1 parent a28b3ec commit bc28ee8
Show file tree
Hide file tree
Showing 31 changed files with 1,210 additions and 431 deletions.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -607,6 +607,7 @@ include_directories(src)

include_directories("src/inline-thirdparty/usearch")
include_directories("src/inline-thirdparty/fp16")
include_directories("src/inline-thirdparty/hnswlib")


enable_testing()
Expand Down
15 changes: 1 addition & 14 deletions src/yb/common/vector_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

#include <cstdint>
#include <vector>

#include "yb/gutil/integral_types.h"
#include "yb/gutil/macros.h"

Expand All @@ -28,18 +29,4 @@ using Int32Vector = std::vector<int32_t>;
using UInt64Vector = std::vector<uint64_t>;
using UInt8Vector = std::vector<uint8_t>;

// This MUST match the Vector struct definition in
// src/postgres/third-party-extensions/pgvector/src/vector.h.
struct YSQLVector {
// Commented out as this field is not transferred over the wire for all
// Varlens.
// int32 vl_len_; /* varlena header (do not touch directly!) */
int16 dim; /* number of dimensions */
int16 unused;
float elems[];

private:
DISALLOW_COPY_AND_ASSIGN(YSQLVector);
};

} // namespace yb
4 changes: 2 additions & 2 deletions src/yb/docdb/pgsql_operation.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1953,7 +1953,7 @@ Result<size_t> PgsqlReadOperation::ExecuteVectorSearch(

auto query_vec = request_.vector_idx_options().vector().binary_value();

auto ysql_query_vec = pointer_cast<const YSQLVector*>(query_vec.data());
auto ysql_query_vec = pointer_cast<const vectorindex::YSQLVector*>(query_vec.data());

SCHECK_EQ(ysql_query_vec->dim, dims, InvalidArgument, "Vector dimensions mismatch");

Expand Down Expand Up @@ -1999,7 +1999,7 @@ Result<size_t> PgsqlReadOperation::ExecuteVectorSearch(
if (!vec_value.has_value()) continue;
// Add the vector to the ANN store
auto vec = VERIFY_RESULT(VectorANN<FloatVector>::GetVectorFromYSQLWire(
*pointer_cast<const YSQLVector*>(vec_value->binary_value().data()),
*pointer_cast<const vectorindex::YSQLVector*>(vec_value->binary_value().data()),
vec_value->binary_value().size()));
auto doc_iter = down_cast<DocRowwiseIterator*>(table_iter_.get());
ann_store->Add(vec, doc_iter->GetRowKey());
Expand Down
4 changes: 1 addition & 3 deletions src/yb/tools/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,4 @@ ADD_YB_TEST_LIBRARY(yb-backup-test_base
DEPS ${YB_BACKUP_TEST_BASE_DEPS})

add_executable(hnsw_tool hnsw_tool.cc)
# We use yb_test_util here for TestThreadHolder.
# hnsw_tool is a test tool in a way, so this is OK.
target_link_libraries(hnsw_tool boost_program_options yb_util yb_docdb yb_vector yb_test_util)
target_link_libraries(hnsw_tool boost_program_options yb_util yb_docdb yb_vector)
Loading

0 comments on commit bc28ee8

Please sign in to comment.