Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for vector db semantic convention #1231

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
fa8ee30
Proposal for vector db semantic convention
ezimuel Jul 10, 2024
7068720
Merge + applied feedbacks #1231
ezimuel Jul 17, 2024
5e12a86
Removed allow_custom_values: true in db.yaml
ezimuel Jul 17, 2024
3b61784
Fixed merge
ezimuel Jul 18, 2024
828bacc
Merge branch 'main' into vector-db
ezimuel Jul 20, 2024
53d82d4
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Aug 5, 2024
a3330ff
Updated dimension_count and similarity_metric
ezimuel Aug 5, 2024
e5ff387
Merge remote-tracking branch 'origin/vector-db' into vector-db
ezimuel Aug 5, 2024
da6649b
Merge branch 'main' into vector-db
ezimuel Aug 7, 2024
d99ec10
Fix array attribute examples (#1325)
lmolkova Aug 8, 2024
61b0f2c
Add k8s.{pod,node}.cpu.{time,usage} metrics (#1320)
ChrsMark Aug 11, 2024
ceae2ca
Db metrics pending requests (#1290)
maryliag Aug 12, 2024
6db7ec5
Fix `process.args_count` attribute (#1331)
lmolkova Aug 12, 2024
e5e0d9d
Add k8s.volume.{name,type} attributes (#1251)
ChrsMark Aug 14, 2024
ae0e066
Add tests for rego policies (#1334)
MadVikingGod Aug 14, 2024
03b67bf
add `nodejs.eventloop.time` metric (#1259)
maryliag Aug 15, 2024
93d2cbe
chore: Remove support for the event `fields` referencing/inheriting d…
MSNev Aug 18, 2024
f411554
Attempt to optimise attribute name collision checks. (#1328)
jsuereth Aug 19, 2024
daa0a14
(chore) Add dependabot config to keep tooling up to date. (#1346)
jsuereth Aug 19, 2024
bc8a63c
Fix broken docker link (#1332)
ChrsMark Aug 19, 2024
a5f8661
Bump markdownlint-cli from 0.31.0 to 0.41.0 (#1349)
dependabot[bot] Aug 19, 2024
d996cd9
Bump go.opentelemetry.io/build-tools/chloggen from 0.12.0 to 0.14.0 i…
dependabot[bot] Aug 19, 2024
a10e75f
Bump gulp from 4.0.2 to 5.0.0 (#1348)
dependabot[bot] Aug 19, 2024
fd0f2e7
Fix link anchors (#1354)
lmolkova Aug 19, 2024
1c6bd00
chore: update ids (#1352)
maryliag Aug 20, 2024
9feb74d
Removed db.vector.id and added db.record.id, renamed db.vector.field_…
ezimuel Aug 20, 2024
2357766
Merge branch 'main' into vector-db
ezimuel Aug 20, 2024
81dca47
Merge from upstream/main
ezimuel Sep 25, 2024
ff03da1
Removed db.vector.model and moved db.vector.search.similarity_metric …
ezimuel Sep 25, 2024
523bcb9
Merge branch 'main' into vector-db
ezimuel Sep 30, 2024
fd891f6
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Nov 5, 2024
bc2ddb1
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Nov 5, 2024
fc90f3f
Merge branch 'vector-db' of github.com:ezimuel/semantic-conventions i…
ezimuel Nov 5, 2024
24cc812
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Nov 18, 2024
3d50dc1
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Dec 2, 2024
06a67d4
Added docs + spans
ezimuel Dec 2, 2024
949b198
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Dec 3, 2024
a88ac32
Update model/database/spans.yaml
ezimuel Dec 5, 2024
df19d76
Update model/database/spans.yaml
ezimuel Dec 5, 2024
94c2ca1
Update model/database/spans.yaml
ezimuel Dec 5, 2024
210ecb9
Update model/database/spans.yaml
ezimuel Dec 5, 2024
765a4a8
Update model/database/spans.yaml
ezimuel Dec 5, 2024
ccd11f7
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Dec 5, 2024
4def570
Merge branch 'vector-db' of github.com:ezimuel/semantic-conventions i…
ezimuel Dec 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/attributes-registry/db.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,22 @@

# Db

<<<<<<< HEAD
ezimuel marked this conversation as resolved.
Show resolved Hide resolved
- [Db](#db-attributes)
- [Db Cassandra](#db-cassandra-attributes)
- [Db Cosmosdb](#db-cosmosdb-attributes)
- [Db Deprecated](#db-deprecated-attributes)
- [Db Elasticsearch](#db-elasticsearch-attributes)
- [Db Metrics Deprecated](#db-metrics-deprecated-attributes)
- [Db Vector](#db-vector-attributes)
=======
- [General Database Attributes](#general-database-attributes)
- [Cassandra Attributes](#cassandra-attributes)
- [Azure Cosmos DB Attributes](#azure-cosmos-db-attributes)
- [Elasticsearch Attributes](#elasticsearch-attributes)
- [Deprecated Database Attributes](#deprecated-database-attributes)
- [Deprecated Database Metrics](#deprecated-database-metrics)
>>>>>>> upstream/main

## General Database Attributes

Expand Down Expand Up @@ -116,6 +126,7 @@ Even though parameterized query text can potentially have sensitive data, by usi
| `sybase` | Sybase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `teradata` | Teradata | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `trino` | Trino | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `vector` | vector | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `vertica` | Vertica | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

## Cassandra Attributes
Expand Down Expand Up @@ -244,3 +255,25 @@ This group defines attributes for Elasticsearch.
| ------ | ----------- | ---------------------------------------------------------------- |
| `idle` | idle | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `used` | used | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

## Db Vector Attributes

This group defines attributes for vector databases.

| Attribute | Type | Description | Examples | Stability |
| ---------------------- | -------- | ---------------------------------------------------- | -------------------------------------- | ---------------------------------------------------------------- |
| `db.vector.dimension` | int | The dimension of the vector. | `3` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.embeddings` | double[] | The values of the vector, the array of numbers. | `[0.9, 0.1, 0.1]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.id` | string | The ID of vector. | `5c56c793-69f3-4fbf-87e6-c4bf54c28c26` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.model` | string | The model used for the embedding. | `text-embedding-3-small` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.name` | string | The name field as of the vector (e.g. a field name). | `vector` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.similarity` | string | The metric used in similarity search. | `cosine` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

`db.vector.similarity` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| ----------- | ------------------------------ | ---------------------------------------------------------------- |
| `cosine` | The cosine metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `dot` | The dot product metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `euclidean` | The euclidean distance metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `manhattan` | The Manhattan distance metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
83 changes: 83 additions & 0 deletions docs/database/dynamodb.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,90 @@ linkTitle: AWS DynamoDB
The Semantic Conventions for [AWS DynamoDB](https://aws.amazon.com/dynamodb/) extend and override the general
[AWS SDK Semantic Conventions](/docs/cloud-providers/aws-sdk.md) and [Database Semantic Conventions](database-spans.md).

<<<<<<< HEAD
ezimuel marked this conversation as resolved.
Show resolved Hide resolved
## Common Attributes

These attributes are filled in for all DynamoDB request types.

<!-- semconv dynamodb.all(full) -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`db.system`](/docs/attributes-registry/db.md) | string | The value `dynamodb`. [1] | `dynamodb` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** The actual DBMS may differ from the one identified by the client. For example, when using PostgreSQL client libraries to connect to a CockroachDB, the `db.system` is set to `postgresql` based on the instrumentation's best knowledge.



`db.system` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `adabas` | Adabas (Adaptable Database System) | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `cassandra` | Apache Cassandra | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `clickhouse` | ClickHouse | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `cockroachdb` | CockroachDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `cosmosdb` | Microsoft Azure Cosmos DB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `couchbase` | Couchbase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `couchdb` | CouchDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db2` | IBM Db2 | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `derby` | Apache Derby | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `dynamodb` | Amazon DynamoDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `edb` | EnterpriseDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `elasticsearch` | Elasticsearch | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `filemaker` | FileMaker | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `firebird` | Firebird | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `geode` | Apache Geode | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `h2` | H2 | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hanadb` | SAP HANA | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hbase` | Apache HBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hive` | Apache Hive | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hsqldb` | HyperSQL DataBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `influxdb` | InfluxDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `informix` | Informix | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `ingres` | Ingres | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `instantdb` | InstantDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `interbase` | InterBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `intersystems_cache` | InterSystems Caché | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mariadb` | MariaDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `maxdb` | SAP MaxDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `memcached` | Memcached | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mongodb` | MongoDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mssql` | Microsoft SQL Server | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mysql` | MySQL | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `neo4j` | Neo4j | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `netezza` | Netezza | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `opensearch` | OpenSearch | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `oracle` | Oracle Database | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `other_sql` | Some other SQL database. Fallback only. See notes. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `pervasive` | Pervasive PSQL | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `pointbase` | PointBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `postgresql` | PostgreSQL | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `progress` | Progress Database | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `redis` | Redis | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `redshift` | Amazon Redshift | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `spanner` | Cloud Spanner | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `sqlite` | SQLite | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `sybase` | Sybase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `teradata` | Teradata | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `trino` | Trino | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `vector` | vector | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
ezimuel marked this conversation as resolved.
Show resolved Hide resolved
| `vertica` | Vertica | ![Experimental](https://img.shields.io/badge/-experimental-blue) |



<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->
=======
`db.system` MUST be set to `"dynamodb"` and SHOULD be provided **at span creation time**.
>>>>>>> upstream/main

## DynamoDB.BatchGetItem

Expand Down
66 changes: 63 additions & 3 deletions model/registry/db.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@
For example, when using PostgreSQL client libraries to connect to a CockroachDB, the `db.system`
is set to `postgresql` based on the instrumentation's best knowledge.
type:
allow_custom_values: true
members:
- id: other_sql
value: 'other_sql'
Expand Down Expand Up @@ -319,7 +318,6 @@
- id: client.connection.state
stability: experimental
type:
allow_custom_values: true
members:
- id: idle
value: 'idle'
Expand Down Expand Up @@ -441,7 +439,6 @@
brief: Cosmos client connection mode.
- id: cosmosdb.operation_type
type:
allow_custom_values: true
members:
- id: invalid
value: 'Invalid'
Expand Down Expand Up @@ -533,3 +530,66 @@
reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json)
in order to map the path part values to their names.
examples: ['db.elasticsearch.path_parts.index=test-index', 'db.elasticsearch.path_parts.doc_id=123']
- id: registry.db.vector
prefix: db.vector
type: attribute_group
brief: >
This group defines attributes for vector databases.
attributes:
- id: similarity
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
type:
members:
- id: cosine
value: 'cosine'
brief: >
The cosine metric.
stability: experimental
- id: dot
value: 'dot'
brief: >
The dot product metric.
stability: experimental
- id: euclidean
value: 'euclidean'
brief: >
The euclidean distance metric.
stability: experimental

Check failure on line 556 in model/registry/db.yaml

View workflow job for this annotation

GitHub Actions / yamllint

[trailing-spaces] trailing spaces
- id: manhattan
value: 'manhattan'
brief: >
The Manhattan distance metric.
stability: experimental
stability: experimental
brief: >
The metric used in similarity search.
examples: 'cosine'
- id: id
type: string
stability: experimental
brief: >
The ID of vector.
examples: '5c56c793-69f3-4fbf-87e6-c4bf54c28c26'
- id: name
type: string
stability: experimental
brief: >
The name field as of the vector (e.g. a field name).
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
examples: 'vector'
- id: dimension
type: int
stability: experimental
brief: >
The dimension of the vector.
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
examples: [3]
- id: model
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
type: string
stability: experimental
brief: >
The model used for the embedding.
examples: 'text-embedding-3-small'
- id: query.top_k
Copy link
Contributor

@lmolkova lmolkova Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:

Suggesting db.query.max_returned_items. The actual returned count could be even better - db.query.item_count could mean items inserted or returned depending on the operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova I see the similarity here but I think the db.vector.query.top-k is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. The limit parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. using ORDER BY).
I personally think we should keep db.vector.query.top-k and potentially add a db.query.limit (or db.query.max_returned_items as you suggested) in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of the few dbs I checked, they use limit in vector search

So we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.

I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.

So, I'd still prefer db.query.limit or something similar (and it should be under the same condition as db.query.text - we cannot require instrumentations to do query parsing)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova do you agree that top-k and limit are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g. db.query.limit) to manage both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova just a reminder for this, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezimuel I see that databases use both terms to describe the same thing (see my comment above).

Let's say you have a postgres query like SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5; - the general-purpose DB instrumentation can report limit. If you make it understand vector search syntax, it may be able to use top_k instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova I see your point but I think top-k has a different meaning from limit. If you are using a relation database as vector db, limit is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), the top-k is more relevant since top implies the order, using a similarity metric.

I think we should add both:

  • db.query.limit
  • db.vector.query.top-k

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova I found that the SELECT LIMIT is not part of the SQL standard. A statement compliant with SQL standard is FETCH FIRST. Moreover, I discovered that we already have db.operation.parameter,<key> in the semantic convention. This means, we can have db.operation.parameter.fetch-first. I don't think we need to add a limit or fetch-first in db.query. What do you think?

At the same time, I believe we should include db.vector.query.top-k. As noted earlier, it differs in meaning from fetch-first and, additionally, this parameter is much easier to retrieve in a vector database because it's not part of any statement.

type: int
stability: experimental
brief: >
The top-k most similar vectors returned by a query.
examples: [5]

Check failure on line 595 in model/registry/db.yaml

View workflow job for this annotation

GitHub Actions / yamllint

[new-line-at-end-of-file] no new line character at the end of file
Loading