-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: otel traces + interceptor #2144
Conversation
internal/rpc/otel_interceptor.go
Outdated
responseSize metric.Int64Histogram | ||
} | ||
|
||
func createInstruments(meter metric.Meter) (instruments, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we didn't lose all the instruments whenever one of their instantiations fails. What do you think about refactoring the code here and above to look something like below? This would also have the helpful maintainability effect of parceling the computation of each metric value into its own function, so it's easier to figure out what logic is used for what. WrapUnary
is somewhat long at the moment, and it's easy to imagine how it could grow longer.
func (i *otelInterceptor) WrapUnary(next connect.UnaryFunc) connect.UnaryFunc {
return func(ctx context.Context, request connect.AnyRequest) (connect.AnyResponse, error) {
...
recordDuration(ctx, attributes, requestStartTime)
recordRequestSize(...)
recordResponseSize(...)
return response, err
}
}
func recordDuration(ctx context.Context, attributes []attribute.KeyValue, requestStartTime time.Time) {
recordHist(ctx, attributes, histDetails{
name: "duration",
unit: "ms",
desc: "...",
val: time.Since(requestStartTime).Milliseconds(),
})
}
type histDetails struct {
name string
unit string
desc string
val int64
}
func recordHist(ctx context.Context, attributes []attribute.KeyValue, details histDetails) {
logger := log.FromContext(ctx) // just log the error instead of exiting the function early
hist, err := meter.Int64Histogram(details.name, metric.WithUnit(details.unit), metric.WithDescription(details.desc))
if err != nil {
logger.Errorf(...)
}
hist.Record(ctx, details.val, metric.WithAttributes(attributes...))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ended up deciding to panic on metric instantiation error after all to make sure we don't miss any errors in development, but the rest of this still holds :)
internal/rpc/otel_interceptor.go
Outdated
const ( | ||
otelFtlRequestKey = "ftl.requestKey" | ||
otelFtlVerbRef = "ftl.verb.ref" | ||
otelFtlVerbModule = "ftl.verb.module" | ||
otelFtlEventName = "ftl.message" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice! For reference, Jon's PR from yesterday added a few attributes wrapped in functions. I really like your way of doing const keys here since these attributes all have type string, and this is simpler/shorter than doing functions here, but we should clearly define the pattern at some point so it's clear when to follow/use each. Possibly: for interceptor trace attributes, which are kind of their own thing, we follow this pattern (const keys), but for metrics scattered across many directories, we'll use the functions from the linked PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I didn't see Jon's go in. I don't mind using that function pattern at all, happy to switch it, the consistency might be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ended up changing his file back to consts so what you have here is perfect :D
d3665aa
to
e30c0dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome :D
internal/rpc/otel_interceptor.go
Outdated
otelFtlRequestKey = "ftl.requestKey" | ||
otelFtlVerbRef = "ftl.verb.ref" | ||
otelFtlVerbModule = "ftl.verb.module" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nittiest of nits: Jon appended Attribute
to each of these variable names so I started doing that too. Kinda wondered about just doing Attr
instead of brevity but either works
internal/rpc/otel_interceptor.go
Outdated
} | ||
requestKey, err := RequestKeyFromContext(ctx) | ||
if err != nil { | ||
logger.Debugf("failed to get request key: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since request key is so valuable, do you think we should raise this log level to Warn or even Error? I'm not sure personally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The err case is only really hit if there is a request key that we're unable to parse. It is indicative of an issue though, so good call bumping the sev
internal/rpc/otel_interceptor.go
Outdated
trace.WithAttributes( | ||
requestSpan, | ||
otelMessageIDKey.Int(1), | ||
otelMessageSizeKey.Int(requestSize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work when requestSize is undefined? I assume message size attr just won't show up, but figured I'd ask just in case it's wonky :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, good catch on defining requestSize :)
internal/rpc/otel_interceptor.go
Outdated
span.AddEvent(otelMessageEvent, | ||
trace.WithAttributes( | ||
requestSpan, | ||
otelMessageIDKey.Int(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's message ID for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this for consistency across the rpc interceptors. The individual messages are all shown in the span event, and their ID's must be unique, in the unary case there is one request and one response
if err != nil { | ||
panic(fmt.Errorf("failed to create response size metric: %w", err)) | ||
} | ||
requestsPerRPC, err := meter.Int64Histogram("requests_per_rpc", metric.WithUnit("1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any docs on how unit is supposed to be set for counts? I wasn't sure if we should just leave it out. "1" makes sense intuitively but I'm mostly wondering if this is how we're supposed to do it for all the count metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, docs say that for counts the unit should be "1"
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's so interesting. Seems like some sort of shorthand, but it is what they recommend :)
internal/rpc/otel_interceptor.go
Outdated
ctx, span := tracer.Start(ctx, name, traceOpts...) | ||
defer span.End() | ||
|
||
var requestSize int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default to 0 if there's no request object? If not, we should probably mention in the description of this metric that it's only recorded for RPCs with a non-null request.
} | ||
|
||
func statusCodeAttribute(protocol string, err error) attribute.KeyValue { | ||
statusCodeKey := fmt.Sprintf("rpc.%s.status_code", protocol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to lowercase protocol, or is it always guaranteed to be consistent case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the raw value from the Connection and Spec, so for consistency I'm not lowering here, but in our use it's always just "grpc"
anyway
52f5688
to
a9f5a54
Compare
backend/controller/controller.go
Outdated
@@ -1022,6 +1022,7 @@ func (s *Service) callWithRequest( | |||
} | |||
} | |||
|
|||
ctx = rpc.WithRequestName(ctx, requestKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should this be rpc.WithRequestKey
?
|
||
span := trace.SpanFromContext(ctx) | ||
span.AddEvent(otelMessageEvent, trace.WithAttributes(attrs...)) | ||
return err // nolint:wrapcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no we want to nolint
all these err
or should we wrap them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; the interceptor just logs errors as they pass through, and it didn't feel quite right to attach this context to the trace; all errors that flow through rpc would then have some otel error:
string attached
if err != nil { | ||
panic(fmt.Errorf("failed to create response size metric: %w", err)) | ||
} | ||
requestsPerRPC, err := meter.Int64Histogram("requests_per_rpc", metric.WithUnit("1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's so interesting. Seems like some sort of shorthand, but it is what they recommend :)
Linking issue: #2194 |
42e4ef7
to
ce5d526
Compare
bac631b
to
1e8872f
Compare
otel interceptor that propagates spans and logs FTL attributes