Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit OpenTelemetry metrics #2211

Open
roji opened this issue Nov 7, 2023 · 11 comments
Open

Emit OpenTelemetry metrics #2211

roji opened this issue Nov 7, 2023 · 11 comments
Labels
💡 Enhancement Issues that are feature requests for the drivers we maintain.

Comments

@roji
Copy link
Member

roji commented Nov 7, 2023

OpenTelemetry has become the de-facto standard for distributed, cross-platform tracing; it is being rapidly adopted both inside the .NET ecosystem and outside. OTel defines semantic conventions for how a database client emits metrics, which are aggregated, numeric counters about the state and performance of the driver (e.g. number of commands executed, idle/busy connections in the pool, etc.). these specifications are currently in experimental state, but stabilization is likely start happening soon.

Note that unlike with tracing, AFAIK there's no way to get data like this out of SqlClient at the moment. This sort of data can be invaluable in diagnosing performance problems (e.g. by inspecting real-time connection pool state) and tracking what's going on.

@JRahnama JRahnama added the 🆕 Triage Needed For new issues, not triaged yet. label Nov 7, 2023
@JRahnama
Copy link
Contributor

JRahnama commented Nov 7, 2023

Is this a duplicate of #2210 ?

Forget my comment. I just noticed the difference.

@JRahnama JRahnama added 💡 Enhancement Issues that are feature requests for the drivers we maintain. and removed 🆕 Triage Needed For new issues, not triaged yet. labels Nov 7, 2023
@gfoidl

This comment was marked as outdated.

@roji
Copy link
Member Author

roji commented Nov 8, 2023

Thanks @gfoidl, corrected the link.

@edwardneal
Copy link
Contributor

SqlClient actually already emits some metrics, but their reporting is a little spotty. On .NET Framework in Windows, every process which uses the library has a set of performance counters (listed here.) On .NET Core and .NET Standard 2.1 and above, Microsoft.Data.SqlClient.EventSource has a number of event counters (listed here.) Although the counter names are slightly different, their meanings are the same:

Performance counter Event counter Unit Description
(none) active-hard-connections count Actual active connections current made to servers
HardConnectsPerSecond hard-connects count / sec The number of actual connections per second that are being made to servers
HardDisconnectsPerSecond hard-disconnects count / sec The number of actual disconnects per second that are being made to servers
(none) active-soft-connects count Active connections retrieved from the connection pool
SoftConnectsPerSecond soft-connects count / sec The number of connections we get from the pool per second
SoftDisconnectsPerSecond soft-disconnects count / sec The number of connections we return to the pool per second
NumberOfNonPooledConnections number-of-non-pooled-connections count The number of connections that are not using connection pooling
NumberOfPooledConnections number-of-pooled-connections count The number of connections that are managed by the connection pooler
NumberOfActiveConnectionPoolGroups number-of-active-connection-pool-groups count The number of unique connection strings
NumberOfInactiveConnectionPoolGroups number-of-inactive-connection-pool-groups count The number of unique connection strings waiting for pruning
NumberOfActiveConnectionPools number-of-active-connection-pools count The number of active connection pools
NumberOfInactiveConnectionPools number-of-inactive-connection-pools count The number of inactive connection pools
NumberOfActiveConnections number-of-active-connections count The number of connections currently in-use
NumberOfFreeConnections number-of-free-connections count The number of connections currently available for use
NumberOfStasisConnections number-of-stasis-connections count The number of connections currently waiting to be made ready for use
NumberOfReclaimedConnections number-of-reclaimed-connections count The number of connections we reclaim from GC'd external connections

The SoftConnectsPerSecond, SoftDisconnectsPerSecond, NumberOfActiveConnections and NumberOfFreeConnections performance counters are only created if the ConnectionPoolPerformanceCounterDetail trace switch is set to Verbose.

Having a consistent interface to increment/decrement the metrics would be helpful for code merging, but the real value would probably come from making System.Diagnostics.Metrics the primary metrics source. I think the right thing to do would be to also add an AppContext switch to control the legacy metrics sources - but I'm not sure whether it'd be better to have 6.0 disable the legacy metrics sources by default and drop them in 7.0, or to have 6.0 implement the new metrics source, 7.0 disable the legacy ones and 8.0 to drop these legacy ones.

For the current metrics collection, perhaps OpenTelemetry could listen to these event counters or performance counters?

@roji
Copy link
Member Author

roji commented Mar 1, 2024

@edwardneal there's generally no specific need to drop the old metrics or introduce an app context switch for them; as long as they're not listened to, their perf overhead is negligible. So this really is about emitting OpenTelemetry metrics via System.Diagnostics.Metrics.

@edwardneal
Copy link
Contributor

That makes sense, thanks. There's only one mandatory counter (db.client.connections.usage), which maps neatly to the existing counters of number-of-active-connections and number-of-free-connections for state attributes of "used" and "idle" respectively. If the specification allows us to add extra state attribute values, I'd add "stasis" and "reclaimed" for the number-of-stasis-connections and number-of-reclaimed-connections counters. The concept of a connection which is currently being brought back into an issuable state might make sense for the upstream specification.

Besides the mandatory, other standard metrics are:

  • db.client.connections.idle.max: I read this to be a configuration parameter: the maximum number of idle connections allowed, rather than the maximum number of idle connections which the pool has historically possessed. I think this'll always be the value of DbConnectionPool.MaxPoolSize.
  • db.client.connections.idle.min: Similar principle as above, but it'd be DbConnectionPool.MinPoolSize.
  • db.client.connections.max: DbConnectionPool has a maximum capacity, and any connection in its backing store can be idle or used, so I'd say that this should be the same as db.client.connections.idle.max. I'm a little unsure on whether to include this though - the .NET connection pool has minimum + maximum sizes, while I'm a bit unclear on what the semantics expect. This comment seems to suggest that clients should expect .max (or .limit) to be the sum of .usage. That won't be true in our implementation.
  • db.client.connections.pending_requests, db.client.connections.timeouts, db.client.connections.create_time: Not currently instrumented, but probably should be. The integration points for them exist.
  • db.client.connections.wait_time: As I understand it, this duration will essentially start at the same time as db.client.connections.create_time and end when SqlConnection.Open() completes
  • db.client.connections.use_time: We already instrument something similar to this and make it available via SqlConnection.RetrieveStatistics().

Outside the standard metrics, the rest of the existing counters should also be mapped. I'd add these as:

  • dotnet.db.client.connection_pools.usage, with a state attribute of "used" and "idle" and the same meaning of pool.name as db.client.connections.usage: number-of-active-connection-pools and number-of-inactive-connection-pools.
  • dotnet.db.client.connection_pool_groups.usage, with a state attribute of "used" and "idle": number-of-active-connection-pool-groups and number-of-inactive-connection-pool-groups
  • dotnet.db.client.connections.hard.usage, with a type attribute of "pooled" and "unpooled": number-of-pooled-connections and number-of-non-pooled-connections.
  • dotnet.db.client.connections.hard.total: active-hard-connections. This is tentative - at a glance, I think this works out as a sum of the metric above. I'm not absolutely certain though
  • dotnet.db.client.connection.operations.connect, with a type attribute of "hard" and "soft": hard-connects and soft-connects. This is a behaviour change: it'll contain a counter, rather than a rate
  • dotnet.db.client.connection.operations.disconnect, with a type attribute of "hard" and "soft": hard-disconnects and soft-disconnects. Same behaviour change as above.

Finally, there are some interesting fields in SqlConnection.RetrieveStatistics():

  • dotnet.db.client.connections.command_time: This comes from the ExecutionTime key. It describes the total time spent executing a possible server-side command. That server-side command would be a user SQL statement, but it could also be SqlConnection.ChangeDatabase, SqlCommand.Prepare, any kind of transaction, and getting a field's value (whether that field's value is part of a packet which has already been received or not.)
  • dotnet.db.client.connections.network_wait_time: This comes from the NetworkServerTime key. It describes the amount of time spent waiting for the TDS response from the remote server.
  • dotnet.db.client.connections.io, with a direction attribute of "receive" and "transmit": BytesReceived and BytesSent.
  • dotnet.db.client.connections.buffers, with a direction attribute of "receive" and "transmit": BuffersReceived and BuffersSent.
  • dotnet.db.client.commands.count, with a type attribute of "insert", "delete", "update", "merge" and "select": CursorOpens, IduCount and SelectCount. I've pulled these apart quite a bit, largely because I don't think "IDU" is particularly useful to clients.
  • dotnet.db.client.commands.rows, with the type attribute above: IduRows and SelectRows.

I started looking to see what other metrics might be available or tracked within the library for exposure, and this response ballooned. If the metrics in the OTel specification can be checked for correctness, there's at least a place to start.

@roji
Copy link
Member Author

roji commented Mar 1, 2024

db.client.connections.max: [...] open-telemetry/semantic-conventions#211 (comment) seems to suggest that clients should expect .max (or .limit) to be the sum of .usage. That won't be true in our implementation.

The definition in the specs seems pretty clear and unambiguous to me: "The maximum number of open connections allowed". Max/min connections is a really standard thing across DB connection pools, I certainly implemented it as MaxPoolSize in the Npgsql driver.

In general, metrics on commands are currently missing in the specs. I went ahead and implemented some in Npgsql (code) - if implementing something similar, I'd call these out as experimental and subject to change, since I'm assuming some command-related metrics will make it in to the final specs etc.

@edwardneal
Copy link
Contributor

I was thinking of a situation where a connection could have been allocated by the client, added to the connection pool, (to meet MinPoolSize's constraint) but not yet opened, because it's either not yet been used or it's still in the process of opening. In that situation, it would be neither idle nor used, but its existence in the pool would mean that MaxPoolSize wasn't really the maximum number of open connections, because this closed (or still-opening) connection would reduce the number of slots available for completely opened connections.

From re-checking the logic which creates a connection, it looks like a background thread synchronously opens the connection and adds it to the connection pool pre-opened (rather than being created closed and opened dynamically or asynchronously) so my confusion's cleared now: MaxPoolSize is the correct approach. If that creation methodology changes in the future, I think the ambiguity would return though.

I agreed on the point around command-related metrics. It'd be useful to be able to assign some kind of tag (or dictionary of tags) to a SqlCommand and aggregate at that tag level, but that's probably a better fit for DbCommand. I've prefixed the existing metrics with dotnet because it seems to fit the pattern here but this'll no doubt need to change when the specs are finalised.

@roji
Copy link
Member Author

roji commented Mar 2, 2024

I was thinking of a situation where a connection could have been allocated by the client, added to the connection pool, (to meet MinPoolSize's constraint) but not yet opened, because it's either not yet been used or it's still in the process of opening. In that situation, it would be neither idle nor used [...]

I'm not aware of such a state - I'm also not sure what it would mean for a connection to be "added but not idle". If a physical connection has been opened (which it must have been) and is in the pool in order to meet MinPoolSize, then it seems to me that it's just idle. Also, if there were such a special opened-for-MinPoolSize state, and then it's rented out by the application and then returned, does it then become just regular-idle? That would seem to be very weird, since the pool is exactly in the same state as it originally was.

In any case, I don't think any legitimate usage scenario out there would care (or want to deal with the added complexity) of an additional 3rd state... The important things to track with metrics are how many total physical connections the application is using, and to have a breakdown of how many of those are actually in-use vs. not in-use (to track application efficiency with regards to connection management etc.).

@edwardneal
Copy link
Contributor

No, I fully agree - a background thread synchronously opens the connections, pushing them onto the pool. By the time the pool knows about them, the connections are in exactly the state you're describing.

For what it's worth though, I was thinking about two scenarios:

  1. Large-scale expansion of the connection pool when the connections are slow to open. In that scenario, a two second connection time would take 40s to add twenty connections. My first thought there would have been to add the still-opening connection to the pool so that the pool expansion could continue apace, and the first client to actually use the connection would pay any remaining time required to open it. The three states we'd care about in the pool would be "active", "idle" and "starting".
  2. Throttling connections to the SQL Server. In database-constrained scenarios such as Azure SQL Edge, we might want to allocate the memory for a connection on the client end, but not want to consume resources on the SQL Server. There's no ambiguity or background state transitions going on there: the connections added to the pool would simply be closed. The three states here would be "active", "idle" and "closed".

Neither of these scenarios are supported by the pool at the moment, so the point's academic. I can imagine situations where someone might want them though, so it's worth noting that the metrics might need some adjustment if that's implemented in the future.


I think we could match most of the command metrics in Npgsql in advance of the specs. By way of comparison, I think the full list of metrics should be:

Proposed metric Npgsql metric
OpenTelemetry specifications
db.client.connections.usage db.client.connections.usage Already reported as a counter
db.client.connections.idle.max db.client.connections.idle.max Report DbConnectionPool's MaxPoolSize
db.client.connections.idle.min db.client.connections.idle.min Report DbConnectionPool's MinPoolSize
db.client.connections.idle.max db.client.connections.max Report DbConnectionPool's MaxPoolSize
db.client.connections.pending_requests db.client.connections.pending_requests To be instrumented
db.client.connections.timeouts db.client.connections.timeouts To be instrumented
db.client.connections.create_time db.client.connections.create_time To be instrumented, with units in milliseconds to match spec. Npgsql reports seconds - is this deliberate?
db.client.connections.wait_time To be instrumented
db.client.connections.use_time Report based on SqlStatistics' ConnectionTime key
Existing counters
dotnet.db.client.connection_pools.usage Reported as counters
dotnet.db.client.connection_pool_groups.usage Reported as counters
dotnet.db.client.connections.hard.usage Reported as counters
dotnet.db.client.connections.connects Reported as counters
dotnet.db.client.connections.disconnects Reported as counters
SqlStatistics keys
dotnet.db.client.connections.server_command_time Report ExecutionTime key, and document that this is more than just the time to execute SqlCommands directly
dotnet.db.client.connections.network_wait_time Reported as NetworkServerTime key
dotnet.db.client.connections.io db.client.commands.bytes_written, db.client.commands.bytes_read BytesReceived and BytesSent. I include the bytes required to set up the connection, which is why it's in the .connections namespace. I've picked the .io name to match OTel's metric guidelines
dotnet.db.client.connections.buffers Report BuffersReceived and BuffersSent
dotnet.db.client.commands.count Expand CursorOpens, IduCount and SelectCount
dotnet.db.client.commands.rows Expand IduRows and SelectRows
Npgsql metrics
dotnet.db.client.commands.pending_requests db.client.commands.executing To be instrumented. Synced name to connection metric names
dotnet.db.client.commands.failed db.client.commands.failed To be instrumented
dotnet.db.client.commands.use_time db.client.commands.duration To be instrumented. Ambivalent on the name, but have tried to sync it to the connection metric names
dotnet.db.client.commands.prepared_ratio db.client.commands.prepared_ratio Reported as PreparedExecs, UnpreparedExecs
Additional mopup metric
dotnet.db.client.commands.total To be instrumented - total commands executed
New transactions metrics
dotnet.db.client.transactions.committed To be instrumented
dotnet.db.client.transactions.active To be instrumented
dotnet.db.client.transactions.rolled_back To be instrumented
dotnet.db.client.transactions.total To be instrumented
dotnet.db.client.transactions.commit_time To be instrumented
dotnet.db.client.transactions.rollback_time To be instrumented
dotnet.db.client.transactions.active_time To be instrumented

Most of these are pretty generic across different implementations of DbCommand, so might be a useful model for other providers to follow. There's nothing stopping them from being implemented in SqlCommand and being moved into core later. They might also be informative to the rest of the specification's development.

A few points remain:

  • We've currently got commands.[total|failed|pending_requests] as three metrics. I'm thinking about whether it'd be better semantics to have commands.count{status=failed|pending|succeeded} and commands.total as rollups
  • It'd be good to have a PoolName property or similar in the connection string, so pool.name doesn't need to just be the redacted connection string. This is definitely limited to the wish list
  • A way to group specific commands. This might be a list of KeyValuePairs on the SqlCommand, so that commands which we expect to have a different performance profile can have their metrics tracked independently

If you and someone from the SqlClient team are happy with the approach, I can start work on the basic infrastructure, implement the first mandatory metric and then start trickling the rest of the metrics in piecemeal.

@roji
Copy link
Member Author

roji commented Mar 2, 2024

FWIW I'd prefix the non-standard metrics with sqlclient rather than dotnet - I think the idea is for that to represent the specific component emitting the non-standard metric (so you know where it's coming from). In this case it's SqlClient rather than the .NET runtime etc.

Other than that, I'd recommend concentrating on the basic support as a first step, and then iterate from there, rather than thinking too much up-front about e.g. a new PoolName in the connection string or command groupings. There's definitely a lot that's possible with OpenTelemetry instrumentation, but it's probably a good idea to design the advanced stuff separately and as they become relevant etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 Enhancement Issues that are feature requests for the drivers we maintain.
Projects
Status: Ideas for Future
Development

No branches or pull requests

4 participants