Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add client semantic conventions for socket connections #756

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .chloggen/756.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
change_type: enhancement

component: connection

note: Add semantic conventions for client connections

issues: [454, 756]

subtext:
20 changes: 20 additions & 0 deletions docs/attributes-registry/connection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Client
--->

# Connection

These attributes may be used to describe the socket connection.

<!-- semconv connection(omit_requirement_level) -->
| Attribute | Type | Description | Examples |
|---|---|---|---|
| `connection.state` | string | State of the connection in the connection pool. | `active` |

`connection.state` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `active` | Connection is being used. |
| `idle` | Connection idle |
<!-- endsemconv -->
19 changes: 19 additions & 0 deletions docs/connection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Socket connection
path_base_for_github_subdir:
from: tmp/semconv/docs/connection/_index.md
to: connection/README.md
--->

# Semantic Conventions for Socket Connections

**Status**: [Experimental][DocumentStatus]

This document defines semantic conventions for socket connection.

Semantic conventions for socket connections are defined for the following signals:

- [Connection Spans](connection-spans.md): Semantic Conventions for modeling connections as _spans_.
- [Connection Metrics](connection-metrics.md): Semantic Conventions for recording connection metrics.

[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.26.0/specification/document-status.md
107 changes: 107 additions & 0 deletions docs/connection/connection-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Connection Metrics
--->

# Semantic Conventions for Connection Metrics

This document defines semantic conventions to apply when instrumenting client side of socket connections with metrics.

**Status**: [Experimental][DocumentStatus]

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Common attributes](#common-attributes)
- [Metric: `connection.client.connect_duration`](#metric-connectionclientconnect_duration)
- [Metric: `connection.client.duration`](#metric-connectionclientduration)
- [Metric: `connection.client.open_connections`](#metric-connectionclientopen_connections)

<!-- tocstop -->

## Common attributes

All connection metrics share the same set of attributes:

<!-- semconv metric_attributes.connection.client(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| [`error.type`](../attributes-registry/error.md) | string | Describes a class of error the operation ended with. [1] | `econnreset`; `econnrefused`; `address_family_not_supported`; `java.net.SocketException` | Conditionally Required: [2] |
| [`network.peer.address`](../attributes-registry/network.md) | string | Peer address of the network connection - IP address or Unix domain socket name. [3] | `10.1.2.80`; `/tmp/my.sock` | Recommended: see the note below |
| [`network.peer.port`](../attributes-registry/network.md) | int | Peer port number of the network connection. | `65123` | Recommended: if `network.peer.address` is set. |
| [`network.transport`](../attributes-registry/network.md) | string | [OSI transport layer](https://osi-model.com/transport-layer/) or [inter-process communication method](https://wikipedia.org/wiki/Inter-process_communication). [4] | `tcp`; `udp` | Recommended |
| [`network.type`](../attributes-registry/network.md) | string | [OSI network layer](https://osi-model.com/network-layer/) or non-OSI equivalent. [5] | `ipv4`; `ipv6` | Recommended |
| [`server.address`](../attributes-registry/server.md) | string | Server domain name if available without reverse DNS lookup; otherwise, IP address or Unix domain socket name. [6] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | Conditionally Required: if available without reverse DNS lookup |

**[1]:** It's REQUIRED to document error types instrumentation produces. It's RECOMMENDED to use error codes provided by the socket library, runtime, or the OS (such as `connect` method error codes on [Linux or other POSIX systems](https://man7.org/linux/man-pages/man2/connect.2.html#ERRORS) or [Windows](https://docs.microsoft.com/windows/win32/api/winsock2/nf-winsock2-connect#return-value)).

**[2]:** If and only if a connection (attempt) ended with an error.

**[3]:** The `network.peer.address` could be of a high cardinality. In practice, however, its cardinality is limited to the number of distinct IP addresses for the given domain name, which is small when destination service is behind a load balancer or NAT.
Connection instrumentations MAY set `network.peer.address` by default or let users opt into collecting it. If instrumentation collects `network.peer.address` by default, it MUST allow users to opt-out of `network.peer.address` collection or disable collection of all connection metrics that set the attribute.

**[4]:** The value SHOULD be normalized to lowercase.

Consider always setting the transport when setting a port number, since
a port number is ambiguous without knowing the transport. For example
different processes could be listening on TCP port 12345 and UDP port 12345.

**[5]:** The value SHOULD be normalized to lowercase.

**[6]:** When observed from the client side, and when communicating through an intermediary, `server.address` SHOULD represent the server address behind any intermediaries, for example proxies, if it's available.

`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `_OTHER` | A fallback error value to be used when the instrumentation doesn't define a custom value. |

`network.transport` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `tcp` | TCP |
| `udp` | UDP |
| `pipe` | Named or anonymous pipe. |
| `unix` | Unix domain socket |

`network.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `ipv4` | IPv4 |
| `ipv6` | IPv6 |
<!-- endsemconv -->

## Metric: `connection.client.connect_duration`

This metric is [recommended][MetricRequirementLevel].

<!-- semconv metric.connection.client.connect_duration(metric_table) -->
| Name | Instrument Type | Unit (UCUM) | Description |
| -------- | --------------- | ----------- | -------------- |
| `connection.client.connect_duration` | Histogram | `s` | The duration of the attempt to establish connection. |
<!-- endsemconv -->

## Metric: `connection.client.duration`

This metric is [recommended][MetricRequirementLevel].

<!-- semconv metric.connection.client.duration(metric_table) -->
| Name | Instrument Type | Unit (UCUM) | Description |
| -------- | --------------- | ----------- | -------------- |
| `connection.client.duration` | Histogram | `s` | The duration of the successfully established outbound connection. |
<!-- endsemconv -->

## Metric: `connection.client.open_connections`

This metric is [recommended][MetricRequirementLevel].

<!-- semconv metric.connection.client.open_connections(metric_table) -->
| Name | Instrument Type | Unit (UCUM) | Description |
| -------- | --------------- | ----------- | -------------- |
| `connection.client.open_connections` | UpDownCounter | `{connection}` | Number of outbound connections that are currently open. |
<!-- endsemconv -->

[MetricRequirementLevel]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.26.0/specification/metrics/metric-requirement-level.md
[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.26.0/specification/document-status.md
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
180 changes: 180 additions & 0 deletions docs/connection/connection-spans.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Connection Spans
--->

# Semantic Conventions for Connection Spans

This document defines semantic conventions to apply when instrumenting client side of socket connections with spans.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With http/3 over quick, the connection is virtual and may span multiple UDP packets. The client IP/Network may even change during the duration of the connection, for example switching between wifi and cellular when a mobile client is moved out of range.
Rather than tying this directly to a socket, the type can be tracked by an additional type property. This same concept can then be used for database, http and a range of scenarios, but with optional attributes based on the scenario.

Copy link
Contributor Author

@lmolkova lmolkova Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http/3 still operates on top of UDP sockets.
I'm not an expert, but I believe from socket perspective we still have different connections established when QUIC connection migration happens, the only thing it saves is TLS handshake - it won't happen again during migration.

It's a good question how to represent QUIC logical connection, but given it's such a long lived thing, I don't see why we can't have a span for it and spans for all the underlying real socket connections it creates.


**Status**: [Experimental][DocumentStatus]

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

* [Span name](#span-name)
* [Attributes](#attributes)
* [Examples](#examples)
* [Successful connection](#successful-connection)
* [Successful connect, but connection terminates with an error](#successful-connect-but-connection-terminates-with-an-error)
* [Attempt to establish connection ends with `econnrefused` error](#attempt-to-establish-connection-ends-with-econnrefused-error)
* [Relationship with application protocols such as HTTP](#relationship-with-application-protocols-such-as-http)
* [Connection retry example](#connection-retry-example)

<!-- tocstop -->

this convention defines two types of spans:

- `connect` span: describes the process of establishing a connection. It corresponds to `connect` function ([Linux or other POSIX systems](https://man7.org/linux/man-pages/man2/connect.2.html) /
[Windows](https://docs.microsoft.com/windows/win32/api/winsock2/nf-winsock2-connect)).
- `connection` span: describes the connection lifetime: it starts right after the connection is successfully established and ends when connection terminates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need both, connect and connection, or if all the data can be represented in the connection.
In an HTTP case, the equivalent of connect is a wire-request - as the typical http request span tracks a logical operation rather than what happens on the wire. If there is auto-redirection for example as part of the http library, then the http request may actually result in multiple wire-requests as it retrieves the redirect and then makes a subsequent call for the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's important to know how long it takes to establish a connection and important to know if connection was ever established and then terminated.

We can potentially have one span for connection and then indicate when the connection has happened with an event, but I'd still argue that we need two separate metrics.


If `connect` spans ends with an error (connection cannot be established), `connection` span SHOULD NOT be created.

If connection can be reused in multiple independent operations, instrumentation SHOULD create `connection` span as a root span in a new trace. The `connection` span should link to the `connect` span. This allows to avoid associating long-lived connection span with a trace which coincidentally started it.

Both spans SHOULD be of a `CLIENT` kind.

## Span name

The **span names** SHOULD match `connect` or `connection` depending on the span type.

## Attributes

The `connect` and `connection` span share the same list of attributes:

<!-- semconv span_attributes.connection.client(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| [`error.type`](../attributes-registry/error.md) | string | Describes a class of error the operation ended with. [1] | `econnreset`; `econnrefused`; `address_family_not_supported`; `java.net.SocketException` | Conditionally Required: [2] |
| [`network.local.port`](../attributes-registry/network.md) | int | Local port number of the network connection. | `65123` | Recommended |
| [`network.peer.address`](../attributes-registry/network.md) | string | Peer address of the network connection - IP address or Unix domain socket name. | `10.1.2.80`; `/tmp/my.sock` | Required |
| [`network.peer.port`](../attributes-registry/network.md) | int | Peer port number of the network connection. | `65123` | Conditionally Required: when applicable |
| [`network.transport`](../attributes-registry/network.md) | string | [OSI transport layer](https://osi-model.com/transport-layer/) or [inter-process communication method](https://wikipedia.org/wiki/Inter-process_communication). [3] | `tcp`; `udp` | Recommended |
| [`network.type`](../attributes-registry/network.md) | string | [OSI network layer](https://osi-model.com/network-layer/) or non-OSI equivalent. [4] | `ipv4`; `ipv6` | Recommended |
| [`server.address`](../attributes-registry/server.md) | string | Server domain name if available without reverse DNS lookup; otherwise, IP address or Unix domain socket name. [5] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | Conditionally Required: if available without reverse DNS lookup |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where we would add network.protocol.name, network.protocol.version, tls.version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think they'd be useful on connection spans/metrics?

network.protocol.* describe application-level protocol, not a transport-level thing.
You can send AMQP or HTTP over the socket connection - the connection does not care and does not need to know.

For TLS and DNS we'll need a new spans not described in this PR


**[1]:** It's REQUIRED to document error types instrumentation produces. It's RECOMMENDED to use error codes provided by the socket library, runtime, or the OS (such as `connect` method error codes on [Linux or other POSIX systems](https://man7.org/linux/man-pages/man2/connect.2.html#ERRORS) or [Windows](https://docs.microsoft.com/windows/win32/api/winsock2/nf-winsock2-connect#return-value)).

**[2]:** If and only if a connection (attempt) ended with an error.

**[3]:** The value SHOULD be normalized to lowercase.

Consider always setting the transport when setting a port number, since
a port number is ambiguous without knowing the transport. For example
different processes could be listening on TCP port 12345 and UDP port 12345.

**[4]:** The value SHOULD be normalized to lowercase.

**[5]:** When observed from the client side, and when communicating through an intermediary, `server.address` SHOULD represent the server address behind any intermediaries, for example proxies, if it's available.

`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `_OTHER` | A fallback error value to be used when the instrumentation doesn't define a custom value. |

`network.transport` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `tcp` | TCP |
| `udp` | UDP |
| `pipe` | Named or anonymous pipe. |
| `unix` | Unix domain socket |

`network.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used, otherwise a custom value MAY be used.

| Value | Description |
|---|---|
| `ipv4` | IPv4 |
| `ipv6` | IPv6 |
<!-- endsemconv -->

## Examples

### Successful connection

Successful connection attempt to `"/tmp/my.sock"` results in the following span:

| Attribute name | Value |
| :--------------------- | :-------------------|
| name | `"connect"` |
| `network.peer.address` | `"/tmp/my.sock"` |
| `network.transport` | `"unix"` |

Once corresponding connection is gracefully closed, another span is reported:

| Attribute name | Value |
| :--------------------- | :-------------------|
| name | `"connection"` |
| `network.peer.address` | `"/tmp/my.sock"` |
| `network.transport` | `"unix"` |

### Successful connect, but connection terminates with an error

Successful connection attempt to `example.com` results in the following span:
> Note: DNS lookup is outside of the scope of this semantic convention

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that we shouldn't be trying to include all the dns info in the connection - having an event/marker for timing that indicates when it was complete would be helpful.
If dns is being tracked with its own spans - having a convention for linking from the connection span to dns would make sense.


| Attribute name | Value |
| :--------------------- | :-------------------|
| name | `"connect"` |
| `server.address` | `"example.com"` |
| `network.peer.address` | `"93.184.216.34"` |
| `network.peer.port` | `443` |
| `network.transport` | `"tcp"` |
| `network.transport` | `"ipv4"` |

But then after some packet exchange, the connection is reset:

| Attribute name | Value |
| :--------------------- | :-------------------|
| name | `"connection"` |
| `server.address` | `"example.com"` |
| `network.peer.address` | `"93.184.216.34"` |
| `network.peer.port` | `443` |
| `network.transport` | `"tcp"` |
| `network.transport` | `"ipv4"` |
| `error.type` | `econnreset` |

### Attempt to establish connection ends with `econnrefused` error

An attempt to establish connection to `127.0.0.1:8080` without any application
listening on this port results in the following span:

| Attribute name | Value |
| :--------------------- | :-------------------|
| name | `"connect"` |
| `network.peer.address` | `"127.0.0.1"` |
| `network.peer.port` | `8080` |
| `network.transport` | `"tcp"` |
| `network.type` | `"ipv4"` |
| `error.type` | `econnrefused` |

### Relationship with application protocols such as HTTP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Http becomes interesting with connection pooling, and the ability to either do sequential (http 1.1) or parallel (http2,3) requests over the same connection.
We would probably want some form or event for when the wire-requests are put on to a specific connection. Similarly for DNS lookup and TLS handshake, those are events that occur as part of connection establishment that are important to collect in some way for deeper diagnostics. Should they be events as defined by tracing, or log messages tagged with the same traceid/spanid?

Copy link
Contributor Author

@lmolkova lmolkova Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why events? we have links to correlate request to connection and we can put attributes on them if any is necessary.

Do you want to capture moment in time when the request is associated with the connection? We don't capture it on links yet, but we can start. Record a link and an event is an overkill.

I wonder if DNS and TLS should be spans or events. Since they involve network and have non-zero duration, spans would work better (but will be slightly less performant).
Given that connections live much longer than requests, the volume of such spans would be low and perf should not be a big deal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that this needs to be a dial that ops can turn based on how much data that they want to collect. Using the scenario of an HttpClient call (outgoing):

  • You have HttpClient spans (implemented today) - these are really tracking a logical request rather than what is physically happening on the wire.
  • The next level would be a chain of physical requests - in the case of redirection, it could be multiple before the final request, or it could do a continue to resume collection of data.
  • The connections themselves are longer lived beyond a single request. They have DNS and TLS as part of the initialization.
  • DNS lookup for a connection
  • TLS negotiation

In most cases you probably don't want to collect all of these all the time. However I can see ops turning them on when needed to collect more specific diagnostics data. So can we make it adaptive, and be able to correlate the data when applicable?

I am wondering if an "event" + optional link approach would be best. When a request is put on the wire for a connection - you'd get an event - that way you kind of know what the delay was before your request was processed. If the connections are being tracked, then that "event" would have a link to the connection span, so you could correlate them together.
Similarly a connection would have an "event" for when the dns resolution has occurred, and TLS negotiation is complete. If either are being tracked, that event would link to their respective spans - although those could probably be parented to the connection.

I use "event" in quotes as I am told the future of events on spans is unclear - it could be done with a log message instead.


It could be impossible to record any relationships between HTTP spans and connection-level spans when connections are pooled and reused.

The following picture demonstrates an ideal example when recording such relationships (via span links) is possible.

![connection-spans-and-application-protocols.png](connection-spans-and-application-protocols.png)

### Connection retry example

Example of retries when attempting to connect

```
HTTP request attempt 1 (trace=t1, span=s1)
|
-- domain name resolution (not covered here)
|
-- connect(127.0.0.1:8080) - timeout (trace=t1, span=s2, error.type=timeout)
|
HTTP request attempt 2 (trace=t1, span=s3)
|
-- connect(127.0.0.1:8080) - (trace=t1, span=s3)

connection(127.0.0.1:8080) - (trace=t2, span=s4, link=t1:s3)
```

[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.26.0/specification/document-status.md
22 changes: 22 additions & 0 deletions model/connection.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
groups:
- id: common_attributes.connection.client
type: attribute_group
brief: >
Describes common client connections attributes
attributes:
- ref: network.peer.address
- ref: network.peer.port
- ref: server.address
requirement_level:
conditionally_required: if available without reverse DNS lookup
- ref: error.type
requirement_level:
conditionally_required: If and only if a connection (attempt) ended with an error.
note: >
It's REQUIRED to document error types instrumentation produces.
It's RECOMMENDED to use error codes provided by the socket library, runtime, or the OS
(such as `connect` method error codes on [Linux or other POSIX systems](https://man7.org/linux/man-pages/man2/connect.2.html#ERRORS) or
[Windows](https://docs.microsoft.com/windows/win32/api/winsock2/nf-winsock2-connect#return-value)).
examples: ["econnreset", "econnrefused", "address_family_not_supported", "java.net.SocketException"]
- ref: network.transport
- ref: network.type
Loading
Loading