Proposal for vector db semantic convention #1231

ezimuel · 2024-07-10T15:14:58Z

This is a proposal for vector db semantic convention (see #936). I tried to expand the db semantic convention adding some db.vector attributes. I tried to focus on the basic needs of a general purpose vector database.

I proposed the following experimental attributes (updated with the feedbacks in this PR):

db.search.similarity_metric: specify the metric used in similarity search (e.g. cosine)
db.record.id: the ID of the record (e.g. the ID of the vector)
db.vector.field_name: the name field of the vector embedding
db.vector.dimension_count: the dimension of the vector (e.g. 1536)
db.query.limit: the number of vectors returned by a query (e.g. top-k)

The operations performed in a vector db, such as insert, update, search and delete can be performed using the existing db.operation.name attribute.

Regarding the similarity search we can use the db.query attributes, such as db.query.parameter.<key>.

linux-foundation-easycla · 2024-07-10T15:15:03Z

The committers listed above are authorized under a signed CLA.

✅ login: lmolkova / name: Liudmila Molkova (fd0f2e7, 6db7ec5, d99ec10)
✅ login: dependabot[bot] (d996cd9, a5f8661, a10e75f)
✅ login: ezimuel / name: Enrico Zimuel (828bacc, ff03da1, fa8ee30, da6649b, 81dca47, 3b61784, e5ff387, 9feb74d, 7068720, 2357766, 523bcb9, 5e12a86, a3330ff, 3d50dc1, 210ecb9, 4def570, ccd11f7, 949b198, bc2ddb1, 24cc812, a88ac32, fc90f3f, 53d82d4, 94c2ca1, fd891f6, 06a67d4, 765a4a8, df19d76)
✅ login: maryliag / name: Marylia Gutierrez (1c6bd00, ceae2ca, 03b67bf)
✅ login: MadVikingGod / name: Aaron Clawson (ae0e066)
✅ login: ChrsMark / name: Christos Markou (bc8a63c, 61b0f2c, e5e0d9d)
✅ login: jsuereth / name: Josh Suereth (daa0a14, f411554)
✅ login: MSNev / name: Nev (93d2cbe)

model/registry/db.yaml

docs/database/dynamodb.md

model/registry/db.yaml

karthikscale3 · 2024-07-17T16:19:10Z

Thanks for creating this PR. A few additional attributes which we instrument today with our SDK that we have found useful are the following:

db.index
db.namespace
db.collection.name
db.top_k
db.query

Thoughts on the ones listed above? cc @lmolkova

ezimuel · 2024-07-17T18:58:39Z

@karthikscale3 regarding the attributes that you proposed, some already exists:

db.index can be represented using db.operation.name = index;
db.namespace already exists;
db.collection.name already exists;
db.query can be represented using db.query.text and also db.query.parameter;

The top_k proposal I think it's a good idea:

db.vector.query.top_k to represent the k most similar vectors returned by a similarity serach;

Moreover, I found very interesting the proposal of OpenLLMetry project especially the part regarding the attributes for vector db, here:

 # Vector DB
 VECTOR_DB_VENDOR = "db.system"
 VECTOR_DB_OPERATION = "db.operation"
 VECTOR_DB_QUERY_TOP_K = "db.vector.query.top_k"

ezimuel · 2024-07-17T19:13:37Z

@lmolkova I applied all the feedbacks, thanks for the review. @karthikscale3 I added the top_k attribute, thanks.

Summary of the changes:

removed vector in db.system;
removed the db.vector.embeddings;
renamed db.vector.dimension in db.vector.dimension_count;
added the db.vector.query.top_k as suggested by @karthikscale3;
removed the allow_custom_values: true in db.yaml, see last commit;

karthikscale3 · 2024-07-17T19:15:49Z

@karthikscale3 regarding the attributes that you proposed, some already exists:

db.index can be represented using db.operation.name = index;

db.namespace already exists;

db.collection.name already exists;

db.query can be represented using db.query.text and also db.query.parameter;

The top_k proposal I think it's a good idea:

db.vector.query.top_k to represent the k most similar vectors returned by a similarity serach;

Moreover, I found very interesting the proposal of OpenLLMetry project especially the part regarding the attributes for vector db, here:
 # Vector DB
 VECTOR_DB_VENDOR = "db.system"
 VECTOR_DB_OPERATION = "db.operation"
 VECTOR_DB_QUERY_TOP_K = "db.vector.query.top_k"

Yea that sounds good! And yes, my intention was to reuse the existing ones. Wasn't sure if we needed them redefined for the sake of vector dbs or not. But sounds like its unnecessary.

karthikscale3 · 2024-07-17T19:17:02Z

@lmolkova I applied all the feedbacks, thanks for the review. @karthikscale3 I added the top_k attribute, thanks.

Summary of the changes:

removed vector in db.system;

removed the db.vector.embeddings;

renamed db.vector.dimension in db.vector.dimension_count;

added the db.vector.query.top_k as suggested by @karthikscale3;

removed the allow_custom_values: true in db.yaml, see last commit;

Thank you! From my side, everything looks good. We discussed this PR in today's working group call and @nirga wanted to take a deeper look at it once again.

docs/attributes-registry/db.md

docs/database/dynamodb.md

ezimuel · 2024-07-18T06:20:22Z

I fixed the merge issues. Thanks @trask

nirga

Thanks, that's a great start! I wonder if we want to add specific spans that use these attributes in this PR as well?

ezimuel · 2024-07-20T16:04:24Z

@nirga can you give me an example of specific span? FYI, I'm going offline and I'll come back August 4 for further discussion.

nirga · 2024-07-20T16:15:55Z

Sorry, nvm I think this is already covered as part of the DB semconv

maryliag · 2024-07-25T13:23:49Z

docs/attributes-registry/db.md

@@ -199,6 +200,28 @@ This group defines attributes for Elasticsearch.

 **[8]:** Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names.

+## Db Vector Attributes


nit: Vector Database Attributes

maryliag · 2024-07-25T13:32:53Z

@ezimuel looks like you missed some of the changes you marked as resolved:

rename db.vector.dimension to db.vector.dimension_count: it is still showing db.vector.dimension
rename db.vector.similarity to db.vector.search.similarity_metric: it is still showing db.vector.similarity

lmolkova · 2024-07-26T03:28:17Z

model/registry/db.yaml

+        brief: >
+          The model used for the embedding.
+        examples: 'text-embedding-3-small'
+      - id: query.top_k


I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:

JDBC has Statement.setMaxRows,

Mongo allows to set a limit

Suggesting db.query.max_returned_items. The actual returned count could be even better - db.query.item_count could mean items inserted or returned depending on the operation.

@lmolkova I see the similarity here but I think the db.vector.query.top-k is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. The limit parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. using ORDER BY).
I personally think we should keep db.vector.query.top-k and potentially add a db.query.limit (or db.query.max_returned_items as you suggested) in a separate PR.

Out of the few dbs I checked, they use limit in vector search

pgvector uses traditional limit - same with cosmos and other sql databases

mongo uses imit

qdrant uses limit

So we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.

I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.

So, I'd still prefer db.query.limit or something similar (and it should be under the same condition as db.query.text - we cannot require instrumentations to do query parsing)

@lmolkova do you agree that top-k and limit are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g. db.query.limit) to manage both.

@lmolkova just a reminder for this, thanks.

@ezimuel I see that databases use both terms to describe the same thing (see my comment above).

Let's say you have a postgres query like SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5; - the general-purpose DB instrumentation can report limit. If you make it understand vector search syntax, it may be able to use top_k instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.

@lmolkova I see your point but I think top-k has a different meaning from limit. If you are using a relation database as vector db, limit is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), the top-k is more relevant since top implies the order, using a similarity metric.

I think we should add both:

db.query.limit

db.vector.query.top-k

@lmolkova I found that the SELECT LIMIT is not part of the SQL standard. A statement compliant with SQL standard is FETCH FIRST. Moreover, I discovered that we already have db.operation.parameter,<key> in the semantic convention. This means, we can have db.operation.parameter.fetch-first. I don't think we need to add a limit or fetch-first in db.query. What do you think?

At the same time, I believe we should include db.vector.query.top-k. As noted earlier, it differs in meaning from fetch-first and, additionally, this parameter is much easier to retrieve in a vector database because it's not part of any statement.

lmolkova · 2024-07-26T03:33:36Z

We need to reference new attributes in the database spans conventions (see https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml), specifically on the conventions for the databases we have there which support vector search.

We should describe how new attributes apply to them.

lmolkova · 2024-10-25T17:09:50Z

@ezimuel I'm sorry I did not reply earlier. I'm swamped with some work at the moment and, unfortunately, it might take me some time to reply.

If you look into https://github.com/open-telemetry/semantic-conventions/tree/main/docs/database you'd see that we have some docs for individual database systems - we reference attributes from the registry and we explain how these attributes apply to this system (or don't apply).

This is powered by the yaml in https://github.com/open-telemetry/semantic-conventions/tree/main/model/database.

Please look at the existing database conventions and update them to include new attributes you're adding.

ezimuel · 2024-11-05T13:13:53Z

@lmolkova thanks for the information and sorry also on my side for the late reply, very busy period.
I'm looking into the examples that you shared and I'll send a commit soon.

…nto vector-db

ezimuel · 2024-12-02T14:53:14Z

@lmolkova and @AlexanderWert I finally found some time to work on this PR. I’ve added the spans definition and updated the documentation. I hope everything looks good. Looking forward to your review. Thanks!

lmolkova · 2024-12-04T03:22:45Z

model/database/spans.yaml

@@ -764,3 +764,81 @@ groups:
      - ref: db.cosmosdb.regions_contacted
        requirement_level:
          conditionally_required: If available.
+
+  - id: span.db.vector.client


postgres, cosmos, mongodb, elasticsearch and others are vector databases too and we need to update their conventions to say how to record vector search capabilities for them

agree, but how about doing it in a follow up issue / PR, as it is additive information?

I don't think we should introduce span.db.vector.client span - it's not clear what/when/how will report it.
Also, referencing attributes in the specific conventions is a good test if these attributes are applicable and properly named/described to fit to specific db implementations.

E.g. db.vector.query.top_k debate will become obvious on postgres - you can't have an attribute like this there and would have to call it db.query.limit or similar.

It's ok to limit the scope to several DBs and not cover all possible details.

I don't think we should introduce span.db.vector.client span

Sorry, I don't get that. We also have a generic span.db.client (which then in addition has technology specific overrides). Wouldn't it be the same thing here, so having a generic description for vector db client spans + in addition having technology-specific overrides for the DBs listed above.

it's not clear what/when/how will report it

agree, in the readme above we need more explicit / clearer guidance on when to use span.db.vector.client vs. the more general span.db.client, but we should do it in a technology-independent way first.

Also, referencing attributes in the specific conventions is a good test if these attributes are applicable and properly named/described to fit to specific db implementations.
...
It's ok to limit the scope to several DBs and not cover all possible details.

Makes sense.

I agree with @AlexanderWert . I think we should have a generic span.db.vector.client for vector databases that do not have technology-specific conventions, like the generic sql clients.
@lmolkova I see the issue with db.vector.query.top_k and db.query.limit, I think I can remove the top-k even if they are semantically different.

Sorry, I don't get that. We also have a generic span.db.client (which then in addition has technology specific overrides). Wouldn't it be the same thing here, so having a generic description for vector db client spans + in addition having technology-specific overrides for the DBs listed above.

The problem is that none of the databases we have in the semconv should be considered a vector databases then. The span.db.vector.client is an abstract thing nothing is going to use. MongoDB, CosmosDB, elastic, postres - all of them are used as vector databases - will they extend span.db.vector.client? Will generic instrumentations for them cover vector-related operations?

Also we're trying to introduce attributes without attempting to apply them to existing databases - this is the major blocker I see in this PR.

model/database/spans.yaml

Co-authored-by: Alexander Wert <[email protected]>

ezimuel · 2024-12-05T08:27:58Z

Thanks @AlexanderWert for the feedback.

…nto vector-db

TaoChenOSU · 2024-12-11T00:04:54Z

docs/attributes-registry/db.md

+
+| Value  | Description | Stability |
+|---|---|---|
+| `cosine` | The cosine metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


nit:

There are cosine similarity and cosine distance. Better to distinguish the two in the convention.

There are a few common ones that are not listed here: Squared Euclidean and hamming.

TaoChenOSU · 2024-12-11T00:06:32Z

docs/attributes-registry/db.md


-**[11] `db.elasticsearch.path_parts`:** Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names.
+`db.search.similarity_metric` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.


The similarity metric doesn't need to be defined per search request. It can also be defined when the collection is created.

TaoChenOSU · 2024-12-11T00:07:41Z

docs/attributes-registry/db.md

+
+| Attribute | Type | Description | Examples | Stability |
+|---|---|---|---|---|
+| <a id="db-vector-dimension-count" href="#db-vector-dimension-count">`db.vector.dimension_count`</a> | int | The dimension of the vector. | `3` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


nit: db-vector-dimension-count -> db-vector-dimension.

lmolkova · 2024-12-14T02:29:25Z

docs/database/vector.md

+| Attribute  | Type | Description  | Examples  | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
+|---|---|---|---|---|---|
+| [`db.operation.name`](/docs/attributes-registry/db.md) | string | The operation  to be performed on the vector database (e.g. build an index/collection) [1] | `build`; `insert`; `search`; `delete` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
+| [`http.request.method`](/docs/attributes-registry/http.md) | string | HTTP request method. [2] | `GET`; `POST`; `HEAD` | `Required` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |


http.request.method should not be used on generic database span

Proposal for vector db semantic convention

fa8ee30

ezimuel requested review from a team July 10, 2024 15:14

github-actions bot assigned arminru Jul 10, 2024

ezimuel mentioned this pull request Jul 10, 2024

VectorDB Semantic Convention #936

Open

gregkalapos reviewed Jul 15, 2024

View reviewed changes

model/registry/db.yaml Outdated Show resolved Hide resolved

lmolkova reviewed Jul 15, 2024

View reviewed changes

Merge + applied feedbacks open-telemetry#1231

7068720

Removed allow_custom_values: true in db.yaml

5e12a86

karthikscale3 approved these changes Jul 17, 2024

View reviewed changes

trask reviewed Jul 17, 2024

View reviewed changes

docs/attributes-registry/db.md Outdated Show resolved Hide resolved

docs/database/dynamodb.md Outdated Show resolved Hide resolved

Fixed merge

3b61784

Merge branch 'main' into vector-db

828bacc

nirga reviewed Jul 20, 2024

View reviewed changes

maryliag reviewed Jul 25, 2024

View reviewed changes

lmolkova reviewed Jul 26, 2024

View reviewed changes

karthikscale3 mentioned this pull request Aug 4, 2024

REQUEST: New membership for karthikscale3 open-telemetry/community#2256

Closed

6 tasks

ezimuel added 2 commits August 5, 2024 09:25

Merge remote-tracking branch 'upstream/main' into vector-db

53d82d4

Updated dimension_count and similarity_metric

a3330ff

github-actions bot closed this Oct 24, 2024

lmolkova added never stale PRs marked with this label will be never staled and automatically closed and removed Stale labels Oct 25, 2024

lmolkova reopened this Oct 25, 2024

ezimuel added 3 commits November 5, 2024 14:30

Merge remote-tracking branch 'upstream/main' into vector-db

fd891f6

Merge remote-tracking branch 'upstream/main' into vector-db

bc2ddb1

Merge branch 'vector-db' of github.com:ezimuel/semantic-conventions i…

fc90f3f

…nto vector-db

nirga added the area:gen-ai label Nov 6, 2024

ezimuel added 3 commits November 18, 2024 10:07

Merge remote-tracking branch 'upstream/main' into vector-db

24cc812

Merge remote-tracking branch 'upstream/main' into vector-db

3d50dc1

Added docs + spans

06a67d4

ezimuel requested a review from a team as a code owner December 2, 2024 14:50

Merge remote-tracking branch 'upstream/main' into vector-db

949b198

lmolkova reviewed Dec 4, 2024

View reviewed changes

AlexanderWert reviewed Dec 4, 2024

View reviewed changes

ezimuel and others added 5 commits December 5, 2024 09:26

Update model/database/spans.yaml

a88ac32

Co-authored-by: Alexander Wert <[email protected]>

Update model/database/spans.yaml

df19d76

Co-authored-by: Alexander Wert <[email protected]>

Update model/database/spans.yaml

94c2ca1

Co-authored-by: Alexander Wert <[email protected]>

Update model/database/spans.yaml

210ecb9

Co-authored-by: Alexander Wert <[email protected]>

Update model/database/spans.yaml

765a4a8

Co-authored-by: Alexander Wert <[email protected]>

ezimuel added 2 commits December 5, 2024 09:32

Merge remote-tracking branch 'upstream/main' into vector-db

ccd11f7

Merge branch 'vector-db' of github.com:ezimuel/semantic-conventions i…

4def570

…nto vector-db

TaoChenOSU reviewed Dec 11, 2024

View reviewed changes

lmolkova reviewed Dec 14, 2024

View reviewed changes

		@@ -199,6 +200,28 @@ This group defines attributes for Elasticsearch.

		[8]: Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names.

		## Db Vector Attributes


		[11] `db.elasticsearch.path_parts`: Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names.
		`db.search.similarity_metric` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Proposal for vector db semantic convention #1231

Are you sure you want to change the base?

Proposal for vector db semantic convention #1231

Conversation

ezimuel commented Jul 10, 2024 • edited Loading

linux-foundation-easycla bot commented Jul 10, 2024 • edited Loading

karthikscale3 commented Jul 17, 2024 • edited Loading

ezimuel commented Jul 17, 2024

ezimuel commented Jul 17, 2024 • edited Loading

karthikscale3 commented Jul 17, 2024

karthikscale3 commented Jul 17, 2024

ezimuel commented Jul 18, 2024

nirga left a comment

Choose a reason for hiding this comment

ezimuel commented Jul 20, 2024

nirga commented Jul 20, 2024

Choose a reason for hiding this comment

maryliag commented Jul 25, 2024

lmolkova Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova commented Jul 26, 2024

lmolkova commented Oct 25, 2024 • edited Loading

ezimuel commented Nov 5, 2024

ezimuel commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezimuel commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezimuel commented Jul 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jul 10, 2024 •

edited

Loading

karthikscale3 commented Jul 17, 2024 •

edited

Loading

ezimuel commented Jul 17, 2024 •

edited

Loading

lmolkova Jul 26, 2024 •

edited

Loading

lmolkova commented Oct 25, 2024 •

edited

Loading

lmolkova Dec 4, 2024 •

edited

Loading