-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
otel: actually instrument core signals #2194
Comments
Implement `ftl.pubsub.published` and `ftl.pubsub.subscriber.called` ``` ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ftl.pubsub Metric #0 Descriptor: -> Name: ftl.pubsub.published -> Description: the number of times that an event is published to a topic -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.module.name: Str(echo) -> ftl.pubsub.topic.name: Str(echotopic) StartTimestamp: 2024-07-30 00:53:26.173902 +0000 UTC Timestamp: 2024-07-30 00:53:41.175052 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: ftl.pubsub.subscriber.called -> Description: the number of times that a pubsub event has been enqueued to asynchronously send to a subscriber -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.module.name: Str(echo) -> ftl.pubsub.subscriber.sink.ref: Str(echo.echoSinkOne) -> ftl.pubsub.subscription.ref: Str(echo.sub) -> ftl.pubsub.topic.name: Str(echotopic) StartTimestamp: 2024-07-30 00:53:26.173906 +0000 UTC Timestamp: 2024-07-30 00:53:41.175069 +0000 UTC Value: 1 ``` PubSub design doc for reference: https://hackmd.io/@ftl/HyDTmzMdp Issue: #2194 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
- Adds `ftl.status.succeeded` attr to `ftl.pubsub.published` metric - Adds separate subscription and subscriber `module.name` attrs to all metrics except `.published` - Adds `ftl.pubsub.propagation.failed` metric Success case: ``` ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ftl.pubsub Metric #0 Descriptor: -> Name: ftl.pubsub.published -> Description: the number of times that an event is published to a topic -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.module.name: Str(echo) -> ftl.pubsub.topic.name: Str(echotopic) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-30 20:42:41.054792 +0000 UTC Timestamp: 2024-07-30 20:43:01.055944 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: ftl.pubsub.sink.called -> Description: the number of times that a pubsub event has been enqueued to asynchronously send to a subscriber -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.pubsub.sink.module.name: Str(echo) -> ftl.pubsub.sink.ref: Str(echo.echoSinkOne) -> ftl.pubsub.subscription.module.name: Str(echo) -> ftl.pubsub.subscription.ref: Str(echo.sub) -> ftl.pubsub.topic.name: Str(echotopic) StartTimestamp: 2024-07-30 20:42:41.054797 +0000 UTC Timestamp: 2024-07-30 20:43:01.055965 +0000 UTC Value: 1 ``` Publish error case: ``` Metric #0 Descriptor: -> Name: ftl.pubsub.published -> Description: the number of times that an event is published to a topic -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.module.name: Str(echo) -> ftl.pubsub.publish.error.message: Str(test error) -> ftl.pubsub.topic.name: Str(echotopic) -> ftl.status.succeeded: Bool(false) StartTimestamp: 2024-07-30 20:49:19.294421 +0000 UTC Timestamp: 2024-07-30 20:49:44.294924 +0000 UTC Value: 1 ``` Progressing subscribers error case with subscriber defined: ``` Metric #1 Descriptor: -> Name: ftl.pubsub.propagation.failed -> Description: the number of times that subscriptions fail to progress -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.pubsub.propagation.failed_operation: Str(BeginConsumingTopicEvent) -> ftl.pubsub.sink.module.name: Str(echo) -> ftl.pubsub.sink.ref: Str(echo.echoSinkOne) -> ftl.pubsub.subscription.module.name: Str(echo) -> ftl.pubsub.subscription.ref: Str(echo.sub) -> ftl.pubsub.topic.name: Str(echotopic) StartTimestamp: 2024-07-30 20:46:45.475043 +0000 UTC Timestamp: 2024-07-30 20:47:00.476169 +0000 UTC Value: 4 ``` Progressing subscribers error case without subscriber: ``` Metric #1 Descriptor: -> Name: ftl.pubsub.propagation.failed -> Description: the number of times that subscriptions fail to progress -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.pubsub.propagation.failed_operation: Str(GetNextEventForSubscription) -> ftl.pubsub.subscription.module.name: Str(echo) -> ftl.pubsub.subscription.ref: Str(echo.sub) -> ftl.pubsub.topic.name: Str(echotopic) StartTimestamp: 2024-07-30 19:09:10.084914 +0000 UTC Timestamp: 2024-07-30 19:09:25.086338 +0000 UTC Value: 4 ``` Issue: #2194 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Issue: #2194 Remove error messages from attribute logging to avoid leaking private info into metrics
These are just the basic metrics. Queue depth up-down counters should come next ``` ScopeMetrics #3 ScopeMetrics SchemaURL: InstrumentationScope ftl.async_call Metric #0 Descriptor: -> Name: ftl.async_call.acquired -> Description: the number of times that the controller tries acquiring an async call -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.time_since_scheduled_at_ms: Int(24) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-31 00:33:53.897947 +0000 UTC Timestamp: 2024-07-31 00:34:08.898478 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: ftl.async_call.executed -> Description: the number of times that the controller tries executing an async call -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.time_since_scheduled_at_ms: Int(41) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-31 00:33:53.897952 +0000 UTC Timestamp: 2024-07-31 00:34:08.898485 +0000 UTC Value: 1 Metric #2 Descriptor: -> Name: ftl.async_call.completed -> Description: the number of times that the controller tries completing an async call -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.time_since_scheduled_at_ms: Int(46) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-31 00:33:53.897956 +0000 UTC Timestamp: 2024-07-31 00:34:08.898487 +0000 UTC Value: 1 ``` When a call fails, it looks like: ``` Metric #1 Descriptor: -> Name: ftl.async_call.executed -> Description: the number of times that the controller tries executing an async call -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.async_call.execution.failure_mode: Str(async call failed) -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.time_since_scheduled_at_ms: Int(16) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(false) StartTimestamp: 2024-07-31 00:41:26.013892 +0000 UTC Timestamp: 2024-07-31 00:41:41.014946 +0000 UTC Value: 1 ``` Issue: #2194 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ributes (#2219) Add `ftl.pubsub.publish.caller.verb.name` attribute to `ftl.pubsub.published` metric. Only the verb name, not module, is needed because the module will be the same as the topic's module, which is already logged (`ftl.module.name`) ``` Metric #0 Descriptor: -> Name: ftl.pubsub.published -> Description: the number of times that an event is published to a topic -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.module.name: Str(echo) -> ftl.pubsub.publish.caller.verb.name: Str(Echo) -> ftl.pubsub.topic.name: Str(echotopic) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-31 17:28:47.451585 +0000 UTC Timestamp: 2024-07-31 17:29:12.44603 +0000 UTC Value: 1 ``` Issue: #2194
…_to_complete` (#2221) Printout showing the exact latency value is the same on the counter attr and on the hist metric itself: ``` Metric #2 Descriptor: -> Name: ftl.async_call.completed -> Description: the number of times that the controller tries completing an async call -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.time_since_scheduled_at_ms: Int(39) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-31 18:54:35.526397 +0000 UTC Timestamp: 2024-07-31 18:55:10.527278 +0000 UTC Value: 1 Metric #3 Descriptor: -> Name: ftl.async_call.ms_to_complete -> Description: duration in ms to complete an async call, from the earliest time it was scheduled to execute -> Unit: ms -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-07-31 18:54:35.526402 +0000 UTC Timestamp: 2024-07-31 18:55:10.527279 +0000 UTC Count: 1 Sum: 39.000000 Min: 39.000000 Max: 39.000000 ``` Issue: #2194
Previously, only `ftl.pubsub.published` had the caller attribute. This PR adds caller to the rest of the pubsub metrics and also slightly restructures the topic name/ref logging to be more consistent with the rest of this file ``` Metric #0 Descriptor: -> Name: ftl.pubsub.published -> Description: the number of times that an event is published to a topic -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.module.name: Str(echo) -> ftl.pubsub.publish.caller.verb.ref: Str(echo.Echo) -> ftl.pubsub.topic.ref: Str(echo.echotopic) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-08-01 01:19:43.556864 +0000 UTC Timestamp: 2024-08-01 01:20:08.558236 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: ftl.pubsub.sink.called -> Description: the number of times that a pubsub event has been enqueued to asynchronously send to a subscriber -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.pubsub.publish.caller.verb.ref: Str(echo.Echo) -> ftl.pubsub.sink.module.name: Str(echo) -> ftl.pubsub.sink.ref: Str(echo.echoSinkOne) -> ftl.pubsub.subscription.module.name: Str(echo) -> ftl.pubsub.subscription.ref: Str(echo.sub) -> ftl.pubsub.topic.module.name: Str(echo) -> ftl.pubsub.topic.ref: Str(echo.echotopic) ``` Issue: #2194
…etrics (#2257) Adds 2 metrics: `ftl.async_call.queue_depth` is a gauge, so not all recorded values will appear in the stream. Locally, enqueued async calls complete so quickly that the gauge gets reset to 0 before the incremented value (1) can appear in the stream. I confirmed the enqueue `Record()` does work by temporarily removing the `Record()` call upon completion. ``` Metric #5 Descriptor: -> Name: ftl.async_call.queue_depth -> Description: number of async calls queued up -> Unit: 1 -> DataType: Gauge NumberDataPoints #0 StartTimestamp: 2024-08-05 18:13:41.233336 +0000 UTC Timestamp: 2024-08-05 18:14:01.232978 +0000 UTC Value: 0 ``` `ftl.async_call.created` is a simple counter: ``` Metric #0 Descriptor: -> Name: ftl.async_call.created -> Description: the number of times that an async call was created -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.async_call.origin: Str(sub:echo.sub) -> ftl.async_call.remaining_attempts: Int(0) -> ftl.async_call.verb.ref: Str(echo.echoSinkOne) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-08-05 18:13:41.233319 +0000 UTC Timestamp: 2024-08-05 18:14:06.233054 +0000 UTC Value: 1 ``` Issue: #2194
I'm not super optimistic I can handle attending the team meeting in a bit so I'll leave my updates here:
|
This PR bucketizes the latency _attribute_ (NOT the latency _metric_! That would be very bad!) For better filtering capabilities in datadog, we need to bucket our high-cardinality numeric attributes. This uses log base8 because: 1. That seems to give the most useful results for current production numbers we're seeing in DD - good functional granularity for analysis purposes without being so granular that it's annoying to sort through the buckets. Current values range from 62 to 20279, concentrated in the 100-1000 range. 2. Base 8 bucket labels are all powers of 2, so it's relatively easy for our brains to skim Sample output: ``` -> ftl.async_call.time_since_scheduled_at_ms.bucket: Str([8,64)) ``` Issue: #2194
) Replaces the original toy `requests` metric with `ftl.call.requests` and adds a new `ftl.async_call.ms_to_complete` histogram metric. Latency attribute is bucketed by powers of 2, rather than 8 as async calls are, because (1) direct verb calls should be faster and (2) this is a new metric, so while we know less about realistic production values, it's better to err on the side of more granularity. ``` InstrumentationScope ftl.call Metric #0 Descriptor: -> Name: ftl.call.requests -> Description: the number of times that the FTL controller receives a verb call request -> Unit: 1 -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> ftl.call.run_time_ms.bucket: Str([8,16)) -> ftl.call.verb.ref: Str(time.time) -> ftl.module.name: Str(time) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-08-06 19:50:37.269674 +0000 UTC Timestamp: 2024-08-06 19:51:17.271013 +0000 UTC Value: 1 NumberDataPoints #1 Data point attributes: -> ftl.call.run_time_ms.bucket: Str([32,64)) -> ftl.call.verb.ref: Str(echo.echo) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-08-06 19:50:37.269674 +0000 UTC Timestamp: 2024-08-06 19:51:17.271013 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: ftl.async_call.ms_to_complete -> Description: duration in ms to complete a verb call -> Unit: ms -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> ftl.call.verb.ref: Str(echo.echo) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-08-06 19:50:37.269677 +0000 UTC Timestamp: 2024-08-06 19:51:17.271014 +0000 UTC Count: 1 Sum: 32.000000 Min: 32.000000 Max: 32.000000 ``` Sample output when a verb call fails: ``` -> ftl.call.failure_mode: Str(invalid request: missing verb) -> ftl.call.run_time_ms.bucket: Str(<1) -> ftl.call.verb.ref: Str(echo.echo) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(false) ``` Issue: #2194 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
whoops ``` Metric #1 Descriptor: -> Name: ftl.call.ms_to_complete -> Description: duration in ms to complete a verb call -> Unit: ms -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> ftl.call.verb.ref: Str(echo.echo) -> ftl.module.name: Str(echo) -> ftl.status.succeeded: Bool(true) StartTimestamp: 2024-08-06 21:59:38.317129 +0000 UTC Timestamp: 2024-08-06 22:00:33.317447 +0000 UTC Count: 1 Sum: 17.000000 Min: 17.000000 Max: 17.000000 ``` Issue: #2194
The last remaining piece here is database observability. Immediate NeedsFor metrics in DD, the well-lit path appears to be DD's official Postgres integration. That comes with not only all the core metrics we want, but also a nice dashboard. For immediate use cases, this is definitely the best solution. Next step on that is talking to @bradleydwyer If we cannot do that^, then we'll need to instrument custom metrics right now, which is possible but not ideal, both because of the implementation time and also because of the added complexity to the sql code (much of which is generated by sqlc, so we can't just do whatever we want to make the interface nice/intuitive). Long TermThat said, FTL is not tied to DD, so in the long run, we’ll likely want some integration for this outside of the DD world as well. The first reasonable avenue for this was otel's zero code instrumentation for Go. Unfortunately, that integration only supports traces right now, not metrics. The whole project is marked as a work in progress, but there doesn't appear to be any immediate plans to add metric support. The urgency for supporting non-DD database metrics is fairly low. Our current customer team and next customer team are both internal, so they should both use DD. Most likely, the next one after that will be as well. We will want to lift this instrumentation out of the DD-only world before significant amounts of open source development, but perhaps otel will also provide better built-ins for instrumenting database metrics by then. This seems like something we can afford to wait for. |
As soon as #2256 goes in and we have an answer on databases, we can consider the metric side of this issue (all the P1 work) done! Only non-metric work remaining is the pubsub trace propagation, which is already underway (thanks @safeer !) There are some P2 cleanup+improvement items that I'll create separate issues for. |
JK, brain farted and forgot pubsub tracing is covered in #2258 :) Also, I'm going to pull the database discussion into a separate issue. So this issue is now DONE!! |
No description provided.
The text was updated successfully, but these errors were encountered: