Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: otel traces + interceptor #2144

Merged
merged 11 commits into from
Jul 29, 2024
Merged

feat: otel traces + interceptor #2144

merged 11 commits into from
Jul 29, 2024

Conversation

safeer
Copy link
Contributor

@safeer safeer commented Jul 24, 2024

otel interceptor that propagates spans and logs FTL attributes

@ftl-robot ftl-robot mentioned this pull request Jul 24, 2024
@safeer safeer changed the title feat: otel traces feat: otel traces + interceptor Jul 24, 2024
responseSize metric.Int64Histogram
}

func createInstruments(meter metric.Meter) (instruments, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we didn't lose all the instruments whenever one of their instantiations fails. What do you think about refactoring the code here and above to look something like below? This would also have the helpful maintainability effect of parceling the computation of each metric value into its own function, so it's easier to figure out what logic is used for what. WrapUnary is somewhat long at the moment, and it's easy to imagine how it could grow longer.

func (i *otelInterceptor) WrapUnary(next connect.UnaryFunc) connect.UnaryFunc {
	return func(ctx context.Context, request connect.AnyRequest) (connect.AnyResponse, error) {
		...
		recordDuration(ctx, attributes, requestStartTime)
		recordRequestSize(...)
		recordResponseSize(...)
		return response, err
	}
}

func recordDuration(ctx context.Context, attributes []attribute.KeyValue, requestStartTime time.Time) {
	recordHist(ctx, attributes, histDetails{
		name: "duration",
		unit: "ms",
		desc: "...",
		val: time.Since(requestStartTime).Milliseconds(),
	})
}

type histDetails struct {
	name string
	unit string
	desc string
	val  int64
}

func recordHist(ctx context.Context, attributes []attribute.KeyValue, details histDetails) {
	logger := log.FromContext(ctx) // just log the error instead of exiting the function early
	hist, err := meter.Int64Histogram(details.name, metric.WithUnit(details.unit), metric.WithDescription(details.desc))
	if err != nil {
		logger.Errorf(...)
	}
	hist.Record(ctx, details.val, metric.WithAttributes(attributes...))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ended up deciding to panic on metric instantiation error after all to make sure we don't miss any errors in development, but the rest of this still holds :)

Comment on lines 19 to 36
const (
otelFtlRequestKey = "ftl.requestKey"
otelFtlVerbRef = "ftl.verb.ref"
otelFtlVerbModule = "ftl.verb.module"
otelFtlEventName = "ftl.message"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiice! For reference, Jon's PR from yesterday added a few attributes wrapped in functions. I really like your way of doing const keys here since these attributes all have type string, and this is simpler/shorter than doing functions here, but we should clearly define the pattern at some point so it's clear when to follow/use each. Possibly: for interceptor trace attributes, which are kind of their own thing, we follow this pattern (const keys), but for metrics scattered across many directories, we'll use the functions from the linked PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I didn't see Jon's go in. I don't mind using that function pattern at all, happy to switch it, the consistency might be nice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ended up changing his file back to consts so what you have here is perfect :D

@safeer safeer force-pushed the saf/otel-traceid-2 branch from d3665aa to e30c0dc Compare July 29, 2024 19:41
@safeer safeer requested a review from wesbillman July 29, 2024 19:50
@safeer safeer marked this pull request as ready for review July 29, 2024 19:50
@safeer safeer requested a review from alecthomas as a code owner July 29, 2024 19:50
@safeer safeer requested review from a team and matt2e and removed request for a team July 29, 2024 19:50
Copy link
Contributor

@deniseli deniseli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome :D

Comment on lines 27 to 29
otelFtlRequestKey = "ftl.requestKey"
otelFtlVerbRef = "ftl.verb.ref"
otelFtlVerbModule = "ftl.verb.module"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nittiest of nits: Jon appended Attribute to each of these variable names so I started doing that too. Kinda wondered about just doing Attr instead of brevity but either works

}
requestKey, err := RequestKeyFromContext(ctx)
if err != nil {
logger.Debugf("failed to get request key: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since request key is so valuable, do you think we should raise this log level to Warn or even Error? I'm not sure personally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The err case is only really hit if there is a request key that we're unable to parse. It is indicative of an issue though, so good call bumping the sev

trace.WithAttributes(
requestSpan,
otelMessageIDKey.Int(1),
otelMessageSizeKey.Int(requestSize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work when requestSize is undefined? I assume message size attr just won't show up, but figured I'd ask just in case it's wonky :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, good catch on defining requestSize :)

span.AddEvent(otelMessageEvent,
trace.WithAttributes(
requestSpan,
otelMessageIDKey.Int(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's message ID for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this for consistency across the rpc interceptors. The individual messages are all shown in the span event, and their ID's must be unique, in the unary case there is one request and one response

if err != nil {
panic(fmt.Errorf("failed to create response size metric: %w", err))
}
requestsPerRPC, err := meter.Int64Histogram("requests_per_rpc", metric.WithUnit("1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find any docs on how unit is supposed to be set for counts? I wasn't sure if we should just leave it out. "1" makes sense intuitively but I'm mostly wondering if this is how we're supposed to do it for all the count metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, docs say that for counts the unit should be "1" 🤷

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's so interesting. Seems like some sort of shorthand, but it is what they recommend :)

ctx, span := tracer.Start(ctx, name, traceOpts...)
defer span.End()

var requestSize int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to 0 if there's no request object? If not, we should probably mention in the description of this metric that it's only recorded for RPCs with a non-null request.

}

func statusCodeAttribute(protocol string, err error) attribute.KeyValue {
statusCodeKey := fmt.Sprintf("rpc.%s.status_code", protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to lowercase protocol, or is it always guaranteed to be consistent case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the raw value from the Connection and Spec, so for consistency I'm not lowering here, but in our use it's always just "grpc" anyway

@safeer safeer force-pushed the saf/otel-traceid-2 branch from 52f5688 to a9f5a54 Compare July 29, 2024 21:49
@@ -1022,6 +1022,7 @@ func (s *Service) callWithRequest(
}
}

ctx = rpc.WithRequestName(ctx, requestKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be rpc.WithRequestKey?


span := trace.SpanFromContext(ctx)
span.AddEvent(otelMessageEvent, trace.WithAttributes(attrs...))
return err // nolint:wrapcheck
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we want to nolint all these err or should we wrap them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; the interceptor just logs errors as they pass through, and it didn't feel quite right to attach this context to the trace; all errors that flow through rpc would then have some otel error: string attached

if err != nil {
panic(fmt.Errorf("failed to create response size metric: %w", err))
}
requestsPerRPC, err := meter.Int64Histogram("requests_per_rpc", metric.WithUnit("1"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's so interesting. Seems like some sort of shorthand, but it is what they recommend :)

@deniseli
Copy link
Contributor

Linking issue: #2194

@safeer safeer force-pushed the saf/otel-traceid-2 branch from 42e4ef7 to ce5d526 Compare July 29, 2024 22:46
@safeer safeer force-pushed the saf/otel-traceid-2 branch from bac631b to 1e8872f Compare July 29, 2024 22:51
@safeer safeer merged commit f31dd1f into main Jul 29, 2024
61 checks passed
@safeer safeer deleted the saf/otel-traceid-2 branch July 29, 2024 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants