Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tracing APIs and instrumentation #534

Merged
merged 3 commits into from
Sep 3, 2024

Conversation

lucix-aws
Copy link
Contributor

Tracing component of #470

  • misc
    • adds Java-esque abstract Middleware codegen classes
    • adds the most common stdlib functions we use directly into Writer's context so they can be referenced without having to inject a symbol (e.g. fmt.Sprintf)
  • Adds smithy-go/tracing APIs
  • New module: github.com/aws/smithy-go/tracing/smithy-otel-tracing adapts an OTEL SDK TracerProvider for use with Smithy clients
  • Instrument clients operations with the following span hierarchy:
    • <Service.Operation>
      • Initialize
      • BuildRequest
        • OperationSerializer
      • ResolveAuthScheme
      • GetIdentity
      • ResolveEndpoint
      • ComputePayloadSHA256
      • (note - everything from here down is part of the retry loop, but it is not instrumented that way here because all of the retry logic is SDK-side right now)
        • SignRequest
        • DoHTTPRequest
        • OperationDeserializer
  • Adds the following Span attributes
    • rpc.method - name of the operation
    • rpc.service - sdkID of the service (or shape name if not present)
    • rpc.system - hardcoded to aws-api
      • Theoretically this would vary by client scope (e.g. non-AWS SDKs would set something different here). I've elected to defer allowing configuration of it for now since it's really just a string, if you disagree let's discuss
    • error.go.type - literal go type of an operation's error e.g. &smithy.GenericAPIError
    • error.go.error - results of Error() on an operation's error
    • error.api.fault - if error is a smithy API error, the fault type (client/server/unknown)
    • error.api.code - if error ", the error code
    • error.api.message - if error ", the error message
    • error - true if operation returned an error
    • http.proto - HTTP protocol version (as read from the response)
    • http.request.content_length - content length of serialized request body
    • http.request.method - HTTP method of serialized request
    • http.response.status_code - HTTP status code of raw response
    • http.response.content_length - content length of raw response body
    • operation.resolved_endpoint - the exact URL returned from endpoint resolution (including the scheme and any base path)
    • operation.auth.resolved_scheme_id - the auth scheme ID selected during resolution e.g. aws.auth#sigv4

Copy link
Contributor

@Madrigal Madrigal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment for the spans:

  1. Any reason why we picked the middlewares that we picked for when to start a child span? Retry makes a lot of sense, the other ones I suspect are the ones that we know can take longer but curious to hear your thoughts
  2. Thanks a lot for actually commenting out the error codes and tracing we emit. We'll likely need to put this on a doc somewhere later, so this would make it a lot easier

func toOTELKeyValues(props smithy.Properties) []otelattribute.KeyValue {
var kvs []otelattribute.KeyValue
for k, v := range props.Values() {
if kv, ok := toOTELKeyValue(k, v); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you mention that all of these values will be silently ignored, which can make this very hard to debug if things go wrong. Should we at least log this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yes, couldn't it just be a span event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, it's probably better to just map to a string representation as default. Pushed.


private String getTracingServiceId() {
return service.hasTrait(ServiceTrait.class)
? service.expectTrait(ServiceTrait.class).getSdkId().replaceAll("\\s", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, do we really have services with spaces on the sdk id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many do -- here's a grep snippet. It's really almost the rule rather than the exception

iot-wireless.json                                                                                                                                                [13/1970]
16254:                    "sdkId": "IoT Wireless",                                                                                                                        
                                                                                                                                                                          
connect.json                                                                                                                                                              
1640:                    "sdkId": "Connect",                                                                                                                              
                                                                                                                                                                          
guardduty.json                                                                                                                                                            
6989:                    "sdkId": "GuardDuty",                                                                                                                            
                                                                                                                                                                          
lex-runtime-v2.json                                                                                                                                                       
57:                    "sdkId": "Lex Runtime V2",                                                                                                                         
                                                                                                                                                                          
route-53-domains.json                                                                                                                                                     
4974:                    "sdkId": "Route 53 Domains",                                                                                                                     

mediastore.json
1135:                    "sdkId": "MediaStore",

resource-explorer-2.json
1597:                    "sdkId": "Resource Explorer 2",

elasticache.json
453:                    "sdkId": "ElastiCache",

athena.json
275:                    "sdkId": "Athena",

migration-hub-refactor-spaces.json
3534:                    "sdkId": "Migration Hub Refactor Spaces",

kafkaconnect.json
1856:                    "sdkId": "KafkaConnect",

backup-gateway.json
128:                    "sdkId": "Backup Gateway",

@lucix-aws
Copy link
Contributor Author

General comment for the spans:

  1. Any reason why we picked the middlewares that we picked for when to start a child span? Retry makes a lot of sense, the other ones I suspect are the ones that we know can take longer but curious to hear your thoughts

Most of them are just the principal "phases" of the operation lifecycle, e.g. resolving endpoint, serializing. There's no real hard list of this anywhere, the closest thing we have to that might be the lifecycle breakdown in the (unpublished) client reference architecture.

The more niche ones were giving Initialize a span, and the outer BuildRequest (basically everything from serialize to build), I wanted to generally contain those since

  1. customizations go there
  2. I wanted the top level of root spans to roughly add up to the overall duration. Without the BuildRequest one, you'd basically have OperationSerializer in an oasis of un-spanned (well root spanned but yes) work.

@lucix-aws lucix-aws merged commit 868fb89 into feat-observability Sep 3, 2024
1 check passed
@lucix-aws lucix-aws deleted the feat-observability-tracing branch September 3, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants