Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for vector db semantic convention #1231
base: main
Are you sure you want to change the base?
Proposal for vector db semantic convention #1231
Changes from 3 commits
fa8ee30
7068720
5e12a86
3b61784
828bacc
53d82d4
a3330ff
e5ff387
da6649b
d99ec10
61b0f2c
ceae2ca
6db7ec5
e5e0d9d
ae0e066
03b67bf
93d2cbe
f411554
daa0a14
bc8a63c
a5f8661
d996cd9
a10e75f
fd0f2e7
1c6bd00
9feb74d
2357766
81dca47
ff03da1
523bcb9
fd891f6
bc2ddb1
fc90f3f
24cc812
3d50dc1
06a67d4
949b198
a88ac32
df19d76
94c2ca1
210ecb9
765a4a8
ccd11f7
4def570
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Check failure on line 556 in model/registry/db.yaml
GitHub Actions / yamllint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:
Suggesting
db.query.max_returned_items
. The actual returned count could be even better -db.query.item_count
could mean items inserted or returned depending on the operation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I see the similarity here but I think the
db.vector.query.top-k
is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. Thelimit
parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. usingORDER BY
).I personally think we should keep
db.vector.query.top-k
and potentially add adb.query.limit
(ordb.query.max_returned_items
as you suggested) in a separate PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of the few dbs I checked, they use
limit
in vector searchlimit
- same with cosmos and other sql databasesSo we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.
I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.
So, I'd still prefer
db.query.limit
or something similar (and it should be under the same condition asdb.query.text
- we cannot require instrumentations to do query parsing)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova do you agree that
top-k
andlimit
are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g.db.query.limit
) to manage both.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova just a reminder for this, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezimuel I see that databases use both terms to describe the same thing (see my comment above).
Let's say you have a postgres query like
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
- the general-purpose DB instrumentation can reportlimit
. If you make it understand vector search syntax, it may be able to usetop_k
instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I see your point but I think
top-k
has a different meaning fromlimit
. If you are using a relation database as vector db,limit
is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), thetop-k
is more relevant sincetop
implies the order, using a similarity metric.I think we should add both:
db.query.limit
db.vector.query.top-k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I found that the
SELECT LIMIT
is not part of the SQL standard. A statement compliant with SQL standard isFETCH FIRST
. Moreover, I discovered that we already havedb.operation.parameter,<key>
in the semantic convention. This means, we can havedb.operation.parameter.fetch-first
. I don't think we need to add alimit
orfetch-first
indb.query
. What do you think?At the same time, I believe we should include
db.vector.query.top-k
. As noted earlier, it differs in meaning fromfetch-first
and, additionally, this parameter is much easier to retrieve in a vector database because it's not part of any statement.Check failure on line 595 in model/registry/db.yaml
GitHub Actions / yamllint