Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Make dummy ordered read path for single-table vector indexes #22825

Closed
1 task done
tanujnay112 opened this issue Jun 11, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue

Comments

@tanujnay112
Copy link
Contributor

tanujnay112 commented Jun 11, 2024

Jira Link: DB-11724

Description

We can make a dummy implementation for an ordered read path for a single-table vector index. For now, we can materialize all rows in memory within a tablet and find their TopK vectors. This will be useful to lay some foundational read-path logic before we make vector indexes persistent.

Issue Type

kind/new-feature

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@tanujnay112 tanujnay112 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jun 11, 2024
@yugabyte-ci yugabyte-ci added kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue labels Jun 11, 2024
tanujnay112 added a commit that referenced this issue Sep 5, 2024
Summary:
This change lays some foundations for the read path of vector indexes. It relies on a dummy implementation on the DocDB side that materializes all of the tablet's rows in memory before iterating through the closest rows to the query vector in increasing order of (distance from query vector, ybctid). This dummy in-memory implementation logic is added in `vectorann.cc` in the `DummyANN` class. This class implements an interface `VectorANN`, which later more sophisticated ANN index algorithms are expected to satisfy.

The `VectorANN` class expects to take in pairs of Vectors and Values. The Values are expected to be RocksDB keys that one can use to read RocksDB rows with. A `VectorANNIterator` can be instantiated with a query vector and is expected to yield rows stored in the VectorANN in increasing order of `(distance from query vector, ybctid)`. The `Iterator` interface has a `ANNPagingState` that can be used to page through this iterator. This paging state will have a pair of (distance from query vector) and (ybctid) to keep track of where we are within the iteration and to enable paged responses.

For each result from the vector search, DocDB needs to serialize the distance/score of this vector result from the query vector to allow for merging at the Pggate layer. For this reason, this diff adds a `distances` field to `PgsqlResponsePB`. The ith value of this contains the distance from the query vector to the ith response row. This merging logic on Pggate will be implemented in a follow-up diff. This is why this diff forces DummyANN vector index tables to have just one tablet.

Note that `DummyANN` is not MVCC aware. This is not a problem for this change as all visible rows are loaded into a `DummyANN` during read-time.

This new logic intends on grabbing a bunch of RocksDB keys from a `VectorANN` and doing point lookups in RocksDB. This is very much like what happened in `ExecuteBatchYbctids` before this change. In order to reuse that logic, this change abstracts the source of ybctids in `ExecuteBatchYbctids` by providing the method with an iterable `KeyProvider` class. This method was also renamed to `ExecuteBatchKeys` as it is possible in the future that Vector index keys might not just be a `ybctid`.

Other changes are:
- Removed the dependency of yb_vector library on yb_docdb -- the dependency should be the other way around.
- Renaming TestThreadHolder to ThreadHolder and moving it to the util library, since it is used for parallelizing index load and validation in hnsw_tool.

**Upgrade/Rollback safety:**
This adds vector index protobuf fields that should not be used by anybody production customer right now.

Jira: DB-11724

Test Plan: Jenkins: test regex:  .*TestPgRegressThirdPartyExtensionsPgvector.*

Reviewers: sergei, mbautin

Reviewed By: sergei, mbautin

Subscribers: svc_phabricator, robert, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D35708
jasonyb pushed a commit that referenced this issue Sep 6, 2024
Summary:
 c587efd [docs] minor edit (#23796)
 31e09f3 [PLAT-15029] yba installer split data and software directory setup
 f5ba17d [PLAT-15039]: Fix bootstrap on bi-directional xCluster config creation
 578248a [#23770] YSQL: Deterministically populate catalog cache in tests with Connection Manager enabled
 7b1f22a [#23799] test: Fixed PgTableSizeTest.SharedTableSize test for pg15
 788434a [#18771, #21352] docdb: Fix LightweightMessage max size when parsing
 02ced43 [#22821] YSQL: Preserve local limit in a multi-page read
 50ff737 [#23741] docdb: Fix cloning of colocated databases with only parent table
 Excluded: 9889df7 [#23706] YSQL: Add table-level catcache Prometheus metrics
 c770d79 [#23747] MetaCache: Callback should not be called while holding the lock
 Excluded: 40689bc [#22150] YSQL, QueryDiagnostics:  EXPLAIN (ANALYZE, DIST) support for queryDiagnostics
 1655e69 [PLAT-15148]: Set XCluster Table Status to DroppedFromTarget if table in replication is dropped from target only
 16262f7 [#22519] YSQL: Simplify API of the ExplicitRowLockBuffer class
 6614afb [PLAT-14958][PLAT-14959] Make ssh fields optional if skipProvisioning is true
 1153b56 [PLAT-14867] Make sure restart alerts don't trigger for small time updates during NTP sync
 bf1c7bc [PLAT-12226] Add connection pooling status to universe health check
 38d8ae8 [PLAT-14805]Support adding EAR configs
 a180bef [#19134] YSQL, ASH: Setting ASH circular buffer size based on the number of cores
 7d8fc76 Adjust heading link (#23807)
 4c6cf5a [PLAT-4899]Basic validation of certificates
 f24eb10 [#23787] YSQL: Avoid executing conn mgr guc variables hooks for parallel workers
 ee18df8 [PLAT-13921] [K8] [UI] Universe action tasks are disabled after a failed shrink rr node task
 Excluded: dcf1821 [#23797] YSQL: Modify some tests to run in single connection mode with Connection Manager
 0ac22cd [Docs] Remove Drift chat bot (#23802)
 0e91003 [#22825] DocDB: Vector Index General Read Path with DummyANN
 e8f09b5 [PLAT-15175] Make runtime conf for skipping cluster consistency check public
 ee479ee Versionwarning (#23781)
 a05c6a3 [DB-12681] yugabyted-ui: Add Voyager commands to different Voyager phases in the UI.
 cc80d59 [#23777] yugabyted: updating the pg parity testcase to reflect the new gflags enabled for the pg parity feature.

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D37822
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants