Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otel: actually instrument core signals #2194

Closed
deniseli opened this issue Jul 29, 2024 · 4 comments
Closed

otel: actually instrument core signals #2194

deniseli opened this issue Jul 29, 2024 · 4 comments
Assignees
Labels

Comments

@deniseli
Copy link
Contributor

No description provided.

@github-actions github-actions bot added the triage Issue needs triaging label Jul 29, 2024
@deniseli deniseli mentioned this issue Jul 29, 2024
15 tasks
@ftl-robot ftl-robot mentioned this issue Jul 29, 2024
@gak gak added P1 next Work that will be be picked up next and removed triage Issue needs triaging labels Jul 29, 2024
@gak gak assigned gak and unassigned gak Jul 29, 2024
@github-actions github-actions bot removed the next Work that will be be picked up next label Jul 29, 2024
deniseli added a commit that referenced this issue Jul 30, 2024
Implement `ftl.pubsub.published` and `ftl.pubsub.subscriber.called`

```
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ftl.pubsub 

Metric #0
Descriptor:
     -> Name: ftl.pubsub.published
     -> Description: the number of times that an event is published to a topic
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.module.name: Str(echo)
     -> ftl.pubsub.topic.name: Str(echotopic)
StartTimestamp: 2024-07-30 00:53:26.173902 +0000 UTC
Timestamp: 2024-07-30 00:53:41.175052 +0000 UTC
Value: 1

Metric #1
Descriptor:
     -> Name: ftl.pubsub.subscriber.called
     -> Description: the number of times that a pubsub event has been enqueued to asynchronously send to a subscriber
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.module.name: Str(echo)
     -> ftl.pubsub.subscriber.sink.ref: Str(echo.echoSinkOne)
     -> ftl.pubsub.subscription.ref: Str(echo.sub)
     -> ftl.pubsub.topic.name: Str(echotopic)
StartTimestamp: 2024-07-30 00:53:26.173906 +0000 UTC
Timestamp: 2024-07-30 00:53:41.175069 +0000 UTC
Value: 1
```

PubSub design doc for reference: https://hackmd.io/@ftl/HyDTmzMdp
Issue: #2194

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deniseli added a commit that referenced this issue Jul 30, 2024
- Adds `ftl.status.succeeded` attr to `ftl.pubsub.published` metric
- Adds separate subscription and subscriber `module.name` attrs to all
metrics except `.published`
- Adds `ftl.pubsub.propagation.failed` metric

Success case:
```
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ftl.pubsub 

Metric #0
Descriptor:
     -> Name: ftl.pubsub.published
     -> Description: the number of times that an event is published to a topic
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.module.name: Str(echo)
     -> ftl.pubsub.topic.name: Str(echotopic)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-30 20:42:41.054792 +0000 UTC
Timestamp: 2024-07-30 20:43:01.055944 +0000 UTC
Value: 1

Metric #1
Descriptor:
     -> Name: ftl.pubsub.sink.called
     -> Description: the number of times that a pubsub event has been enqueued to asynchronously send to a subscriber
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.pubsub.sink.module.name: Str(echo)
     -> ftl.pubsub.sink.ref: Str(echo.echoSinkOne)
     -> ftl.pubsub.subscription.module.name: Str(echo)
     -> ftl.pubsub.subscription.ref: Str(echo.sub)
     -> ftl.pubsub.topic.name: Str(echotopic)
StartTimestamp: 2024-07-30 20:42:41.054797 +0000 UTC
Timestamp: 2024-07-30 20:43:01.055965 +0000 UTC
Value: 1
```

Publish error case:
```
Metric #0
Descriptor:
     -> Name: ftl.pubsub.published
     -> Description: the number of times that an event is published to a topic
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.module.name: Str(echo)
     -> ftl.pubsub.publish.error.message: Str(test error)
     -> ftl.pubsub.topic.name: Str(echotopic)
     -> ftl.status.succeeded: Bool(false)
StartTimestamp: 2024-07-30 20:49:19.294421 +0000 UTC
Timestamp: 2024-07-30 20:49:44.294924 +0000 UTC
Value: 1
```

Progressing subscribers error case with subscriber defined:
```
Metric #1
Descriptor:
     -> Name: ftl.pubsub.propagation.failed
     -> Description: the number of times that subscriptions fail to progress
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.pubsub.propagation.failed_operation: Str(BeginConsumingTopicEvent)
     -> ftl.pubsub.sink.module.name: Str(echo)
     -> ftl.pubsub.sink.ref: Str(echo.echoSinkOne)
     -> ftl.pubsub.subscription.module.name: Str(echo)
     -> ftl.pubsub.subscription.ref: Str(echo.sub)
     -> ftl.pubsub.topic.name: Str(echotopic)
StartTimestamp: 2024-07-30 20:46:45.475043 +0000 UTC
Timestamp: 2024-07-30 20:47:00.476169 +0000 UTC
Value: 4
```

Progressing subscribers error case without subscriber:
```
Metric #1
Descriptor:
     -> Name: ftl.pubsub.propagation.failed
     -> Description: the number of times that subscriptions fail to progress
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.pubsub.propagation.failed_operation: Str(GetNextEventForSubscription)
     -> ftl.pubsub.subscription.module.name: Str(echo)
     -> ftl.pubsub.subscription.ref: Str(echo.sub)
     -> ftl.pubsub.topic.name: Str(echotopic)
StartTimestamp: 2024-07-30 19:09:10.084914 +0000 UTC
Timestamp: 2024-07-30 19:09:25.086338 +0000 UTC
Value: 4
```

Issue: #2194

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deniseli added a commit that referenced this issue Jul 30, 2024
Issue: #2194

Remove error messages from attribute logging to avoid leaking private
info into metrics
deniseli added a commit that referenced this issue Jul 31, 2024
These are just the basic metrics. Queue depth up-down counters should
come next

```
ScopeMetrics #3
ScopeMetrics SchemaURL: 
InstrumentationScope ftl.async_call 

Metric #0
Descriptor:
     -> Name: ftl.async_call.acquired
     -> Description: the number of times that the controller tries acquiring an async call
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.time_since_scheduled_at_ms: Int(24)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-31 00:33:53.897947 +0000 UTC
Timestamp: 2024-07-31 00:34:08.898478 +0000 UTC
Value: 1

Metric #1
Descriptor:
     -> Name: ftl.async_call.executed
     -> Description: the number of times that the controller tries executing an async call
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.time_since_scheduled_at_ms: Int(41)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-31 00:33:53.897952 +0000 UTC
Timestamp: 2024-07-31 00:34:08.898485 +0000 UTC
Value: 1

Metric #2
Descriptor:
     -> Name: ftl.async_call.completed
     -> Description: the number of times that the controller tries completing an async call
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.time_since_scheduled_at_ms: Int(46)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-31 00:33:53.897956 +0000 UTC
Timestamp: 2024-07-31 00:34:08.898487 +0000 UTC
Value: 1
```

When a call fails, it looks like:
```
Metric #1
Descriptor:
     -> Name: ftl.async_call.executed
     -> Description: the number of times that the controller tries executing an async call
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.async_call.execution.failure_mode: Str(async call failed)
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.time_since_scheduled_at_ms: Int(16)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(false)
StartTimestamp: 2024-07-31 00:41:26.013892 +0000 UTC
Timestamp: 2024-07-31 00:41:41.014946 +0000 UTC
Value: 1
```

Issue: #2194

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deniseli added a commit that referenced this issue Jul 31, 2024
…ributes (#2219)

Add `ftl.pubsub.publish.caller.verb.name` attribute to
`ftl.pubsub.published` metric. Only the verb name, not module, is needed
because the module will be the same as the topic's module, which is
already logged (`ftl.module.name`)

```
Metric #0
Descriptor:
     -> Name: ftl.pubsub.published
     -> Description: the number of times that an event is published to a topic
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.module.name: Str(echo)
     -> ftl.pubsub.publish.caller.verb.name: Str(Echo)
     -> ftl.pubsub.topic.name: Str(echotopic)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-31 17:28:47.451585 +0000 UTC
Timestamp: 2024-07-31 17:29:12.44603 +0000 UTC
Value: 1
```

Issue: #2194
deniseli added a commit that referenced this issue Jul 31, 2024
…_to_complete` (#2221)

Printout showing the exact latency value is the same on the counter attr
and on the hist metric itself:
```
Metric #2
Descriptor:
     -> Name: ftl.async_call.completed
     -> Description: the number of times that the controller tries completing an async call
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.time_since_scheduled_at_ms: Int(39)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-31 18:54:35.526397 +0000 UTC
Timestamp: 2024-07-31 18:55:10.527278 +0000 UTC
Value: 1

Metric #3
Descriptor:
     -> Name: ftl.async_call.ms_to_complete
     -> Description: duration in ms to complete an async call, from the earliest time it was scheduled to execute
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative

HistogramDataPoints #0
Data point attributes:
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-07-31 18:54:35.526402 +0000 UTC
Timestamp: 2024-07-31 18:55:10.527279 +0000 UTC
Count: 1
Sum: 39.000000
Min: 39.000000
Max: 39.000000
```

Issue: #2194
deniseli added a commit that referenced this issue Aug 1, 2024
Previously, only `ftl.pubsub.published` had the caller attribute. This
PR adds caller to the rest of the pubsub metrics and also slightly
restructures the topic name/ref logging to be more consistent with the
rest of this file

```
Metric #0
Descriptor:
     -> Name: ftl.pubsub.published
     -> Description: the number of times that an event is published to a topic
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.module.name: Str(echo)
     -> ftl.pubsub.publish.caller.verb.ref: Str(echo.Echo)
     -> ftl.pubsub.topic.ref: Str(echo.echotopic)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-08-01 01:19:43.556864 +0000 UTC
Timestamp: 2024-08-01 01:20:08.558236 +0000 UTC
Value: 1

Metric #1
Descriptor:
     -> Name: ftl.pubsub.sink.called
     -> Description: the number of times that a pubsub event has been enqueued to asynchronously send to a subscriber
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.pubsub.publish.caller.verb.ref: Str(echo.Echo)
     -> ftl.pubsub.sink.module.name: Str(echo)
     -> ftl.pubsub.sink.ref: Str(echo.echoSinkOne)
     -> ftl.pubsub.subscription.module.name: Str(echo)
     -> ftl.pubsub.subscription.ref: Str(echo.sub)
     -> ftl.pubsub.topic.module.name: Str(echo)
     -> ftl.pubsub.topic.ref: Str(echo.echotopic)
```

Issue: #2194
deniseli added a commit that referenced this issue Aug 5, 2024
…etrics (#2257)

Adds 2 metrics:

`ftl.async_call.queue_depth` is a gauge, so not all recorded values will
appear in the stream. Locally, enqueued async calls complete so quickly
that the gauge gets reset to 0 before the incremented value (1) can
appear in the stream. I confirmed the enqueue `Record()` does work by
temporarily removing the `Record()` call upon completion.
```
Metric #5
Descriptor:
     -> Name: ftl.async_call.queue_depth
     -> Description: number of async calls queued up
     -> Unit: 1
     -> DataType: Gauge

NumberDataPoints #0
StartTimestamp: 2024-08-05 18:13:41.233336 +0000 UTC
Timestamp: 2024-08-05 18:14:01.232978 +0000 UTC
Value: 0
```

`ftl.async_call.created` is a simple counter:
```
Metric #0
Descriptor:
     -> Name: ftl.async_call.created
     -> Description: the number of times that an async call was created
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.async_call.origin: Str(sub:echo.sub)
     -> ftl.async_call.remaining_attempts: Int(0)
     -> ftl.async_call.verb.ref: Str(echo.echoSinkOne)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-08-05 18:13:41.233319 +0000 UTC
Timestamp: 2024-08-05 18:14:06.233054 +0000 UTC
Value: 1
```

Issue: #2194
@deniseli
Copy link
Contributor Author

deniseli commented Aug 5, 2024

I'm not super optimistic I can handle attending the team meeting in a bit so I'll leave my updates here:

  • Async call queue depth is in!
  • Metrics remaining (after Jon's in-flight cron metrics): I think is just database + call. We do have 1 call metric but it's just the single toy one I added at the beginning of all this, so we need to flesh that out into an actual proper suite of call metrics.
  • Only other metrics we might want to add: queue depth is instrumented without any attributes due to the nature of the async call interface, but on a higher-load project, there could be quite a few pending calls stemming from either FSMs or pubsub. So we might want to add separate queue depth gauges on a per-topic level for pubsub (and similar for FSM) to give visibility into exactly which topics/whatever are behind.

deniseli added a commit that referenced this issue Aug 6, 2024
This PR bucketizes the latency _attribute_ (NOT the latency _metric_!
That would be very bad!)

For better filtering capabilities in datadog, we need to bucket our
high-cardinality numeric attributes. This uses log base8 because:

1. That seems to give the most useful results for current production
numbers we're seeing in DD - good functional granularity for analysis
purposes without being so granular that it's annoying to sort through
the buckets. Current values range from 62 to 20279, concentrated in the
100-1000 range.
2. Base 8 bucket labels are all powers of 2, so it's relatively easy for
our brains to skim

Sample output:
```
-> ftl.async_call.time_since_scheduled_at_ms.bucket: Str([8,64))
```

Issue: #2194
deniseli added a commit that referenced this issue Aug 6, 2024
)

Replaces the original toy `requests` metric with `ftl.call.requests` and
adds a new `ftl.async_call.ms_to_complete` histogram metric. Latency
attribute is bucketed by powers of 2, rather than 8 as async calls are,
because (1) direct verb calls should be faster and (2) this is a new
metric, so while we know less about realistic production values, it's
better to err on the side of more granularity.

```
InstrumentationScope ftl.call 

Metric #0
Descriptor:
     -> Name: ftl.call.requests
     -> Description: the number of times that the FTL controller receives a verb call request
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative

NumberDataPoints #0
Data point attributes:
     -> ftl.call.run_time_ms.bucket: Str([8,16))
     -> ftl.call.verb.ref: Str(time.time)
     -> ftl.module.name: Str(time)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-08-06 19:50:37.269674 +0000 UTC
Timestamp: 2024-08-06 19:51:17.271013 +0000 UTC
Value: 1

NumberDataPoints #1
Data point attributes:
     -> ftl.call.run_time_ms.bucket: Str([32,64))
     -> ftl.call.verb.ref: Str(echo.echo)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-08-06 19:50:37.269674 +0000 UTC
Timestamp: 2024-08-06 19:51:17.271013 +0000 UTC
Value: 1

Metric #1
Descriptor:
     -> Name: ftl.async_call.ms_to_complete
     -> Description: duration in ms to complete a verb call
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative

HistogramDataPoints #0
Data point attributes:
     -> ftl.call.verb.ref: Str(echo.echo)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-08-06 19:50:37.269677 +0000 UTC
Timestamp: 2024-08-06 19:51:17.271014 +0000 UTC
Count: 1
Sum: 32.000000
Min: 32.000000
Max: 32.000000
```

Sample output when a verb call fails:
```
     -> ftl.call.failure_mode: Str(invalid request: missing verb)
     -> ftl.call.run_time_ms.bucket: Str(<1)
     -> ftl.call.verb.ref: Str(echo.echo)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(false)
```

Issue: #2194

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deniseli added a commit that referenced this issue Aug 6, 2024
whoops

```
Metric #1
Descriptor:
     -> Name: ftl.call.ms_to_complete
     -> Description: duration in ms to complete a verb call
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative

HistogramDataPoints #0
Data point attributes:
     -> ftl.call.verb.ref: Str(echo.echo)
     -> ftl.module.name: Str(echo)
     -> ftl.status.succeeded: Bool(true)
StartTimestamp: 2024-08-06 21:59:38.317129 +0000 UTC
Timestamp: 2024-08-06 22:00:33.317447 +0000 UTC
Count: 1
Sum: 17.000000
Min: 17.000000
Max: 17.000000
```

Issue: #2194
@deniseli
Copy link
Contributor Author

deniseli commented Aug 7, 2024

The last remaining piece here is database observability.

Immediate Needs

For metrics in DD, the well-lit path appears to be DD's official Postgres integration. That comes with not only all the core metrics we want, but also a nice dashboard. For immediate use cases, this is definitely the best solution. Next step on that is talking to @bradleydwyer

If we cannot do that^, then we'll need to instrument custom metrics right now, which is possible but not ideal, both because of the implementation time and also because of the added complexity to the sql code (much of which is generated by sqlc, so we can't just do whatever we want to make the interface nice/intuitive).

Long Term

That said, FTL is not tied to DD, so in the long run, we’ll likely want some integration for this outside of the DD world as well. The first reasonable avenue for this was otel's zero code instrumentation for Go. Unfortunately, that integration only supports traces right now, not metrics. The whole project is marked as a work in progress, but there doesn't appear to be any immediate plans to add metric support.

The urgency for supporting non-DD database metrics is fairly low. Our current customer team and next customer team are both internal, so they should both use DD. Most likely, the next one after that will be as well. We will want to lift this instrumentation out of the DD-only world before significant amounts of open source development, but perhaps otel will also provide better built-ins for instrumenting database metrics by then. This seems like something we can afford to wait for.

@deniseli deniseli changed the title otel: actually instrument signals otel: actually instrument core signals Aug 7, 2024
@deniseli
Copy link
Contributor Author

deniseli commented Aug 7, 2024

As soon as #2256 goes in and we have an answer on databases, we can consider the metric side of this issue (all the P1 work) done! Only non-metric work remaining is the pubsub trace propagation, which is already underway (thanks @safeer !)

There are some P2 cleanup+improvement items that I'll create separate issues for.

@deniseli
Copy link
Contributor Author

deniseli commented Aug 7, 2024

JK, brain farted and forgot pubsub tracing is covered in #2258 :)

Also, I'm going to pull the database discussion into a separate issue. So this issue is now DONE!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants