Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVLS-5034] Create trace context from Step Function execution details #27988

Merged

Conversation

agocs
Copy link
Contributor

@agocs agocs commented Jul 26, 2024

What does this PR do?

Deterministically create a trace context from Step Function execution ARN, state name, and state entered time for Universal Instrumentation runtimes.


A Step Function has no way of generating a Datadog trace context. We generate the trace context for a Step Function execution after the fact in the Logs to Traces reducer.

Trace ID: sha256(execution ARN)
Span ID: sha256(execution ARN + state name + state entered time)

When a Step Function invokes a Lambda Function, it can be made to include its execution context in the invocation. This change allows the extension to inspect a Step Function invocation event, extract the Step Function execution context, and use the same math to generate a Trace ID and Parent ID that match the Trace ID and Span ID that will be generated in Logs to Traces.

Of Note

The Trace ID generated in Logs to Traces is a 128-bit Trace ID. The common practice (Python, Node) to support 128 bit trace IDs is to split them in half. The lower 64 bits are expressed as the uint64 Trace ID, and the upper 64 bits are encoded as a hexadecimal string and written to the root span's _meta["_dd.p.tid"] tag.

How does it work?

I defined a Step Function event type in events.go. Then, in extractor.go, I added a case to the type switch for the StepFunctionEvent type. That case calls createTraceContextFromStepFunctionInput in carriers.go, createTraceContextFromStepFunctionInput extracts the Execution ID, the state name, and the state entered time, and uses them to create a trace context with a 128 bit trace ID.

In order to support the 128 bit trace ID, I had to modify the TraceContext struct. I added a string field, TraceIDUpper64Hex. Because it's a string, the zero value is "".

Anywho, the case statement I added in extractor.go returns the created trace context. lp.Extractor.Extract() is called in the startExecutionSpan function in trace.go. The only change I made there was to check for the existence of traceContext.TraceIDUpper64Hex, and set a Meta Tag in the lifecycleProcessor.requestHandler _dd.p.tid:{TraceIDUpper64Hex}.

Possible Drawbacks / Trade-offs

This change adds crypto/sha256 as a dependency in carriers.go, which might increase the size of the Extension binary if that library is not already linked somewhere else.

Describe how to test/QA your changes

  • Extensive unit tests
  • Figured out how to build the extension and tested it manually.

Screenshot of a Step Function invoking a .Net and a Java lambda function:
image

@pr-commenter
Copy link

pr-commenter bot commented Jul 26, 2024

Regression Detector

Regression Detector Results

Run ID: 1f0988c4-8ca3-4ffc-8487-2233241d0098 Metrics dashboard Target profiles

Baseline: a0be446
Comparison: c223273

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
pycheck_lots_of_tags % cpu utilization +0.65 [-1.92, +3.22] 1 Logs
uds_dogstatsd_to_api_cpu % cpu utilization +0.54 [-0.23, +1.31] 1 Logs
uds_dogstatsd_to_api ingress throughput +0.00 [-0.00, +0.00] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.01, +0.01] 1 Logs
file_tree memory utilization -0.24 [-0.36, -0.12] 1 Logs
basic_py_check % cpu utilization -0.28 [-3.16, +2.61] 1 Logs
idle memory utilization -0.34 [-0.38, -0.30] 1 Logs
otel_to_otel_logs ingress throughput -0.42 [-1.25, +0.41] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.72 [-0.77, -0.67] 1 Logs

Bounds Checks

perf experiment bounds_check_name replicates_passed
idle memory_usage 10/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@pr-commenter
Copy link

pr-commenter bot commented Jul 30, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=44720602 --os-family=ubuntu

Note: This applies to commit 1c5f028

@agocs agocs marked this pull request as ready for review August 1, 2024 17:50
@agocs agocs requested review from a team as code owners August 1, 2024 17:50
@@ -112,6 +113,11 @@ func (e Extractor) extract(event interface{}) (*TraceContext, error) {
carrier, err = headersCarrier(ev.Headers)
case events.LambdaFunctionURLRequest:
carrier, err = headersCarrier(ev.Headers)
case events.StepFunctionEvent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a Go expert. Does it mean that if the event include three StepFunctions specific metadata, it matches the StepFunctionEvent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParentID uint64
SamplingPriority sampler.SamplingPriority
TraceID uint64
TraceIDUpper64Hex string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Java, .NET, and Go tracers will automatically recognize this new attribute in the trace context, and add it to the span metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. My understanding is the extension starts the trace and adds the TraceID+ Upper 64. The Java, .Net, or Go tracer creates spans and then sends them to the extension. The extension adds them to the trace it created. I'll need to verify that the extension is doing enough with respect to the Upper 64 bits (i.e. do they need to be added to the _meta._dd.p.tid field for every span? Or just the root span?)

@@ -189,3 +189,7 @@ func (lp *LifecycleProcessor) initFromLambdaFunctionURLEvent(event events.Lambda
lp.addTag(tagFunctionTriggerEventSourceArn, fmt.Sprintf("arn:aws:lambda:%v:%v:url:%v", region, accountID, functionName))
lp.addTags(trigger.GetTagsFromLambdaFunctionURLRequest(event))
}

func (lp *LifecycleProcessor) initFromStepFunctionEvent(event events.StepFunctionEvent) {
lp.requestHandler.event = event.Payload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In every other method in this file, we set

lp.requestHandler.event = event

Why are we setting it to event.Payload in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the consequence of a hack to get around the fact that the trace context was wrapped in a Payload field, per the instructions here: https://docs.datadoghq.com/serverless/step_functions/installation/?tab=custom#:~:text=The%20JsonMerge%20intrinsic%20function%20merges%20the%20Step%20Functions%20context%20object%20(%24%24)%20with%20the%20original%20Lambda%E2%80%99s%20input%20payload%20(%24).%20Fields%20of%20the%20original%20payload%20overwrite%20the%20Step%20Functions%20context%20object%20if%20their%20keys%20are%20the%20same.

The actual event is a definitionless JSON object, so deserializing it and storing it is going to be slightly tricky. I'll think about it a little more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's best to store the event in its original form. This way, all the way through the call stack, we can assume that the event is the actual event, and not just a piece of it.

As for json deserialization, you've already made that a bit easier by adding definitions to pkg/serverless/trigger/events/events.go. I created that file so that during deserialization, we only need to store the keys we're actually going to use. It really cuts down the number of allocs we make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend you save the entire event, not just the payload. In addition to the reasons listed above, it makes your logic easier below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe I'm miss understanding the order of operations here? Please correct me if I am wrong.

@@ -63,6 +63,9 @@ func (lp *LifecycleProcessor) startExecutionSpan(event interface{}, rawPayload [
inferredSpan.Span.TraceID = traceContext.TraceID
inferredSpan.Span.ParentID = traceContext.ParentID
}
if traceContext.TraceIDUpper64Hex != "" {
lp.requestHandler.SetMetaTag(ddTraceIDUpper64BitsHeader, traceContext.TraceIDUpper64Hex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding this correctly, the upper 64 bits are set as a span tag? They are not included in any in/outbound trace context headers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is an actual concern that I believe could become a actual problem.

Since the requestHandler is not reset on each invocation, it's feasible that one invocation has the upper 64 and then one later on does not. The later one would then still get the meta tag from the previous invocation. That's a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. That is a concern. A lambda function could be invoked by a Step Function and then a non-Step-Function-thing. That would be rare, but not impossible. Would it be reasonable to remove that Meta tag in OnInvokeEnd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the meta tag here; instead I'm passing the upper 64 in the response that goes back to the tracer.

tc, err := createTraceContextFromStepFunctionInput(ev)
if err == nil {
return tc, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First looking at this, I thought there was a better way to do this. But after closer inspection, there is another way to do this, but I'm not so sure it's better.

My biggest concern is consistency. We want to defer as much work to dd-trace-go as possible. That's what all these carriers are for -- so we can pass the set of found headers to the dd-trace-go propagator for extraction.

I threw together what this could look like, here 8ed2a33. While this version makes the code more consistent, I'm not convinced this is any better, and in fact I think it is worse because it means creating the headers map, extracting from that map, just to later recreate it. Dumb.

So, in the end, this PR is good. I'm sad we can't be more consistent about how we are extracting trace context from step functions. Ultimately though, they are different enough to get their own way of doing things.

}
sfe := events.StepFunctionEvent{Payload: eventPayload}
ev = eventPayload
lp.initFromStepFunctionEvent(sfe)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like this. This way, the event passed out of this method for both of these step function types will be the same.

@@ -63,6 +64,9 @@ func (lp *LifecycleProcessor) startExecutionSpan(event interface{}, rawPayload [
inferredSpan.Span.TraceID = traceContext.TraceID
inferredSpan.Span.ParentID = traceContext.ParentID
}
if traceContext.TraceIDUpper64Hex != "" {
executionContext.TraceIDUpper64Hex = traceContext.TraceIDUpper64Hex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.

I think you do not need the conditional here. You can just always set the upper 64 from the context.

On each invocation, the execution context is reset.

When execution reaches this point, executionContext.TraceIDUpper64Hex will therefore always be "".

Therefore, there is no need to avoid setting executionContext.TraceIDUpper64Hex to "".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's more expensive, if traceContext.TraceIDUpper64Hex != "" or executionContext.TraceIDUpper64Hex = traceContext.TraceIDUpper64Hex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure actually. But I just realized that we shouldn't set the Upper64BitsTag tag if the upper 64 is empty. So we should leave the conditional, at least for that portion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact 💡, instead of doing the del of the Upper64BitsTag key at end invocation, you could do it here. That would keep all logic that mutates the requestHander's meta tags in a single location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, clever

@purple4reina
Copy link
Contributor

Okay, correct me if I am wrong, but I think there are a couple things that we've missed. The extension creates several spans of its own (aws.lambda span and the inferred spans). These must also get the upper 64bits.

@agocs
Copy link
Contributor Author

agocs commented Sep 16, 2024

Okay, correct me if I am wrong, but I think there are a couple things that we've missed. The extension creates several spans of its own (aws.lambda span and the inferred spans). These must also get the upper 64bits.

Do you think we want to set meta._dd.p.tid in the Extension in onInvokeStart and unset it in onInvokeEnd

@agocs
Copy link
Contributor Author

agocs commented Sep 18, 2024

@purple4reina I went ahead and added the upper64 bits to the meta tags, and then removed them after endExecutionSpan

log.Debugf("Failed to unmarshal %s event: %s", stepFunction, err)
break
}
sfe := events.StepFunctionEvent{Payload: eventPayload}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set the event to the full event and not just event.Payload above, you wouldn't need to do this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Legacy and non-Legacy step function lambda invocations unmarshal into different types, we have to do a little awkward juggling regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I think you mentioned this before. Can you point me again to what those two json blobs look like? I'm now wondering if there's a way we could unify them into a single event type, even if they look a bit different from one another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy:

{
  "Payload": {
    "Execution": {
      "Id": "arn:aws:states:us-east-1:425362996713:execution:agocsTestSF:bc9f281c-3daa-4e5a-9a60-471a3810bf44",
      "Input": {},
      "StartTime": "2024-07-30T19:55:52.976Z",
      "Name": "bc9f281c-3daa-4e5a-9a60-471a3810bf44",
      "RoleArn": "arn:aws:iam::425362996713:role/test-serverless-stepfunctions-dev-AgocsTestSFRole-tRkeFXScjyk4",
      "RedriveCount": 0
    },
    "StateMachine": {
      "Id": "arn:aws:states:us-east-1:425362996713:stateMachine:agocsTestSF",
      "Name": "agocsTestSF"
    },
    "State": {
      "Name": "agocsTest1",
      "EnteredTime": "2024-07-30T19:55:53.018Z",
      "RetryCount": 0
    }
  }
}

Non-legacy

{
  "Execution": {
    "Id": "arn:aws:states:us-east-1:425362996713:execution:agocsTestSF:bc9f281c-3daa-4e5a-9a60-471a3810bf44",
    "Input": {},
    "StartTime": "2024-07-30T19:55:52.976Z",
    "Name": "bc9f281c-3daa-4e5a-9a60-471a3810bf44",
    "RoleArn": "arn:aws:iam::425362996713:role/test-serverless-stepfunctions-dev-AgocsTestSFRole-tRkeFXScjyk4",
    "RedriveCount": 0
  },
  "StateMachine": {
    "Id": "arn:aws:states:us-east-1:425362996713:stateMachine:agocsTestSF",
    "Name": "agocsTestSF"
  },
  "State": {
    "Name": "agocsTest1",
    "EnteredTime": "2024-07-30T19:55:53.018Z",
    "RetryCount": 0
  }
}

Copy link
Contributor

Serverless Benchmark Results

BenchmarkStartEndInvocation comparison between 0df804c and fdbc451.

tl;dr

Use these benchmarks as an insight tool during development.

  1. Skim down the vs base column in each chart. If there is a ~, then there was no statistically significant change to the benchmark. Otherwise, ensure the estimated percent change is either negative or very small.

  2. The last row of each chart is the geomean. Ensure this percentage is either negative or very small.

What is this benchmarking?

The BenchmarkStartEndInvocation compares the amount of time it takes to call the start-invocation and end-invocation endpoints. For universal instrumentation languages (Dotnet, Golang, Java, Ruby), this represents the majority of the duration overhead added by our tracing layer.

The benchmark is run using a large variety of lambda request payloads. In the charts below, there is one row for each event payload type.

How do I interpret these charts?

The charts below comes from benchstat. They represent the statistical change in duration (sec/op), memory overhead (B/op), and allocations (allocs/op).

The benchstat docs explain how to interpret these charts.

Before the comparison table, we see common file-level configuration. If there are benchmarks with different configuration (for example, from different packages), benchstat will print separate tables for each configuration.

The table then compares the two input files for each benchmark. It shows the median and 95% confidence interval summaries for each benchmark before and after the change, and an A/B comparison under "vs base". ... The p-value measures how likely it is that any differences were due to random chance (i.e., noise). The "~" means benchstat did not detect a statistically significant difference between the two inputs. ...

Note that "statistically significant" is not the same as "large": with enough low-noise data, even very small changes can be distinguished from noise and considered statistically significant. It is, of course, generally easier to distinguish large changes from noise.

Finally, the last row of the table shows the geometric mean of each column, giving an overall picture of how the benchmarks changed. Proportional changes in the geomean reflect proportional changes in the benchmarks. For example, given n benchmarks, if sec/op for one of them increases by a factor of 2, then the sec/op geomean will increase by a factor of ⁿ√2.

I need more help

First off, do not worry if the benchmarks are failing. They are not tests. The intention is for them to be a tool for you to use during development.

If you would like a hand interpreting the results come chat with us in #serverless-agent in the internal DataDog slack or in #serverless in the public DataDog slack. We're happy to help!

Benchmark stats
goos: linux
goarch: amd64
pkg: github.com/DataDog/datadog-agent/pkg/serverless/daemon
cpu: AMD EPYC 7763 64-Core Processor                
                                      │ baseline/benchmark.log │        current/benchmark.log        │
                                      │         sec/op         │    sec/op     vs base               │
api-gateway-appsec.json                           84.50µ ± 15%    88.52µ ± 2%       ~ (p=0.143 n=10)
api-gateway-kong-appsec.json                      67.07µ ±  2%    71.43µ ± 4%  +6.50% (p=0.000 n=10)
api-gateway-kong.json                             65.19µ ±  1%    70.15µ ± 4%  +7.61% (p=0.000 n=10)
api-gateway-non-proxy-async.json                  104.9µ ±  2%    106.9µ ± 4%  +1.89% (p=0.023 n=10)
api-gateway-non-proxy.json                        106.0µ ±  1%    109.0µ ± 2%  +2.82% (p=0.000 n=10)
api-gateway-websocket-connect.json                70.23µ ±  2%    73.41µ ± 1%  +4.53% (p=0.000 n=10)
api-gateway-websocket-default.json                64.15µ ±  1%    65.90µ ± 1%  +2.72% (p=0.000 n=10)
api-gateway-websocket-disconnect.json             64.32µ ±  1%    65.41µ ± 2%  +1.68% (p=0.004 n=10)
api-gateway.json                                  117.9µ ±  1%    118.0µ ± 1%       ~ (p=0.912 n=10)
application-load-balancer.json                    64.10µ ±  1%    65.49µ ± 1%  +2.17% (p=0.002 n=10)
cloudfront.json                                   47.89µ ±  3%    50.27µ ± 2%  +4.97% (p=0.000 n=10)
cloudwatch-events.json                            38.81µ ±  2%    39.62µ ± 2%  +2.08% (p=0.001 n=10)
cloudwatch-logs.json                              66.63µ ±  1%    68.60µ ± 2%  +2.95% (p=0.002 n=10)
custom.json                                       31.49µ ±  2%    32.72µ ± 2%  +3.93% (p=0.000 n=10)
dynamodb.json                                     93.86µ ±  1%    97.45µ ± 2%  +3.83% (p=0.000 n=10)
empty.json                                        29.16µ ±  2%    31.12µ ± 4%  +6.71% (p=0.000 n=10)
eventbridge-custom.json                           42.30µ ±  2%    44.03µ ± 3%  +4.08% (p=0.001 n=10)
http-api.json                                     72.73µ ±  1%    74.62µ ± 2%  +2.60% (p=0.000 n=10)
kinesis-batch.json                                71.26µ ±  2%    72.74µ ± 1%  +2.07% (p=0.000 n=10)
kinesis.json                                      54.16µ ±  1%    56.33µ ± 2%  +3.99% (p=0.000 n=10)
s3.json                                           59.69µ ±  3%    61.42µ ± 2%  +2.88% (p=0.002 n=10)
sns-batch.json                                    89.64µ ±  1%    92.60µ ± 2%  +3.31% (p=0.000 n=10)
sns.json                                          64.56µ ±  2%    67.45µ ± 1%  +4.47% (p=0.000 n=10)
snssqs.json                                       109.8µ ±  1%    113.7µ ± 3%  +3.55% (p=0.000 n=10)
snssqs_no_dd_context.json                         99.00µ ±  2%   104.00µ ± 3%  +5.04% (p=0.002 n=10)
sqs-aws-header.json                               55.80µ ±  2%    57.97µ ± 2%  +3.90% (p=0.000 n=10)
sqs-batch.json                                    95.36µ ±  2%    97.64µ ± 2%  +2.39% (p=0.003 n=10)
sqs.json                                          69.30µ ±  2%    71.20µ ± 2%  +2.74% (p=0.002 n=10)
sqs_no_dd_context.json                            62.31µ ±  3%    63.44µ ± 1%  +1.81% (p=0.043 n=10)
stepfunction.json                                                 49.71µ ± 5%
geomean                                           67.20µ          68.78µ       +3.51%

                                      │ baseline/benchmark.log │        current/benchmark.log        │
                                      │          B/op          │     B/op      vs base               │
api-gateway-appsec.json                           37.26Ki ± 0%   37.27Ki ± 0%  +0.05% (p=0.001 n=10)
api-gateway-kong-appsec.json                      26.92Ki ± 0%   26.93Ki ± 0%  +0.07% (p=0.000 n=10)
api-gateway-kong.json                             24.41Ki ± 0%   24.44Ki ± 0%  +0.09% (p=0.000 n=10)
api-gateway-non-proxy-async.json                  48.03Ki ± 0%   48.08Ki ± 0%  +0.10% (p=0.007 n=10)
api-gateway-non-proxy.json                        47.26Ki ± 0%   47.31Ki ± 0%  +0.09% (p=0.000 n=10)
api-gateway-websocket-connect.json                25.47Ki ± 0%   25.50Ki ± 0%  +0.12% (p=0.000 n=10)
api-gateway-websocket-default.json                21.38Ki ± 0%   21.42Ki ± 0%  +0.19% (p=0.000 n=10)
api-gateway-websocket-disconnect.json             21.16Ki ± 0%   21.20Ki ± 0%  +0.21% (p=0.000 n=10)
api-gateway.json                                  49.56Ki ± 0%   49.60Ki ± 0%  +0.08% (p=0.000 n=10)
application-load-balancer.json                    22.35Ki ± 0%   22.38Ki ± 0%  +0.16% (p=0.000 n=10)
cloudfront.json                                   17.65Ki ± 0%   17.68Ki ± 0%  +0.13% (p=0.000 n=10)
cloudwatch-events.json                            11.69Ki ± 0%   11.72Ki ± 0%  +0.22% (p=0.001 n=10)
cloudwatch-logs.json                              53.37Ki ± 0%   53.39Ki ± 0%  +0.03% (p=0.022 n=10)
custom.json                                       9.736Ki ± 0%   9.760Ki ± 0%  +0.24% (p=0.005 n=10)
dynamodb.json                                     40.78Ki ± 0%   40.80Ki ± 0%       ~ (p=0.593 n=10)
empty.json                                        9.304Ki ± 0%   9.333Ki ± 0%  +0.31% (p=0.022 n=10)
eventbridge-custom.json                           13.44Ki ± 0%   13.44Ki ± 0%       ~ (p=0.255 n=10)
http-api.json                                     23.73Ki ± 0%   23.76Ki ± 0%  +0.12% (p=0.001 n=10)
kinesis-batch.json                                27.01Ki ± 0%   27.05Ki ± 0%  +0.13% (p=0.025 n=10)
kinesis.json                                      17.81Ki ± 0%   17.83Ki ± 0%  +0.08% (p=0.027 n=10)
s3.json                                           20.36Ki ± 0%   20.35Ki ± 0%       ~ (p=0.753 n=10)
sns-batch.json                                    38.66Ki ± 0%   38.74Ki ± 0%  +0.21% (p=0.001 n=10)
sns.json                                          24.01Ki ± 0%   24.08Ki ± 0%  +0.30% (p=0.005 n=10)
snssqs.json                                       50.55Ki ± 0%   50.60Ki ± 0%       ~ (p=0.105 n=10)
snssqs_no_dd_context.json                         44.81Ki ± 0%   44.87Ki ± 0%  +0.12% (p=0.008 n=10)
sqs-aws-header.json                               18.84Ki ± 0%   18.94Ki ± 1%  +0.49% (p=0.019 n=10)
sqs-batch.json                                    41.65Ki ± 0%   41.76Ki ± 0%  +0.27% (p=0.000 n=10)
sqs.json                                          25.63Ki ± 0%   25.64Ki ± 0%       ~ (p=0.671 n=10)
sqs_no_dd_context.json                            20.66Ki ± 0%   20.69Ki ± 1%       ~ (p=0.631 n=10)
stepfunction.json                                                14.17Ki ± 1%
geomean                                           25.72Ki        25.25Ki       +0.14%

                                      │ baseline/benchmark.log │        current/benchmark.log        │
                                      │       allocs/op        │ allocs/op   vs base                 │
api-gateway-appsec.json                             630.0 ± 0%   629.5 ± 0%       ~ (p=1.000 n=10)
api-gateway-kong-appsec.json                        488.0 ± 0%   488.0 ± 0%       ~ (p=1.000 n=10) ¹
api-gateway-kong.json                               466.0 ± 0%   466.0 ± 0%       ~ (p=1.000 n=10) ¹
api-gateway-non-proxy-async.json                    726.0 ± 0%   726.0 ± 0%       ~ (p=1.000 n=10)
api-gateway-non-proxy.json                          716.0 ± 0%   716.0 ± 0%       ~ (p=0.582 n=10)
api-gateway-websocket-connect.json                  453.0 ± 0%   453.0 ± 0%       ~ (p=1.000 n=10)
api-gateway-websocket-default.json                  379.0 ± 0%   379.0 ± 0%       ~ (p=1.000 n=10)
api-gateway-websocket-disconnect.json               370.0 ± 0%   370.0 ± 0%       ~ (p=1.000 n=10)
api-gateway.json                                    791.0 ± 0%   791.0 ± 0%       ~ (p=1.000 n=10)
application-load-balancer.json                      352.0 ± 0%   352.0 ± 0%       ~ (p=1.000 n=10) ¹
cloudfront.json                                     284.0 ± 0%   284.0 ± 0%       ~ (p=1.000 n=10) ¹
cloudwatch-events.json                              220.0 ± 0%   220.5 ± 0%       ~ (p=0.650 n=10)
cloudwatch-logs.json                                216.0 ± 0%   216.0 ± 0%       ~ (p=0.628 n=10)
custom.json                                         169.0 ± 1%   169.0 ± 1%       ~ (p=0.628 n=10)
dynamodb.json                                       589.0 ± 0%   589.0 ± 0%       ~ (p=0.697 n=10)
empty.json                                          160.0 ± 0%   160.0 ± 1%       ~ (p=0.211 n=10)
eventbridge-custom.json                             254.0 ± 0%   254.0 ± 0%       ~ (p=1.000 n=10)
http-api.json                                       433.0 ± 0%   432.0 ± 0%       ~ (p=0.370 n=10)
kinesis-batch.json                                  390.5 ± 0%   391.0 ± 0%       ~ (p=0.650 n=10)
kinesis.json                                        285.0 ± 0%   285.0 ± 0%       ~ (p=1.000 n=10)
s3.json                                             358.0 ± 0%   357.5 ± 0%       ~ (p=0.253 n=10)
sns-batch.json                                      454.5 ± 0%   455.0 ± 0%       ~ (p=0.059 n=10)
sns.json                                            323.5 ± 0%   324.0 ± 0%       ~ (p=0.068 n=10)
snssqs.json                                         438.0 ± 0%   438.0 ± 0%       ~ (p=0.810 n=10)
snssqs_no_dd_context.json                           400.0 ± 0%   400.0 ± 0%       ~ (p=0.230 n=10)
sqs-aws-header.json                                 274.0 ± 0%   275.0 ± 1%  +0.36% (p=0.047 n=10)
sqs-batch.json                                      504.0 ± 0%   505.0 ± 0%  +0.20% (p=0.012 n=10)
sqs.json                                            352.0 ± 0%   352.0 ± 1%       ~ (p=0.310 n=10)
sqs_no_dd_context.json                              324.0 ± 1%   325.0 ± 1%       ~ (p=0.806 n=10)
stepfunction.json                                                236.0 ± 1%
geomean                                             376.6        370.9       +0.04%
¹ all samples are equal

executionContext.TraceIDUpper64Hex = traceContext.TraceIDUpper64Hex
lp.requestHandler.SetMetaTag(Upper64BitsTag, traceContext.TraceIDUpper64Hex)
} else {
delete(lp.requestHandler.triggerTags, Upper64BitsTag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool cool, and I just confirmed that attempting to delete a key that does not exist does not panic. Nor does it panic if the map is actually nil. 👍🏽

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good lookin' out!

@agocs
Copy link
Contributor Author

agocs commented Sep 19, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Sep 19, 2024

🚂 MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

@dd-devflow
Copy link

dd-devflow bot commented Sep 19, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit a4534bc into main Sep 19, 2024
232 checks passed
@dd-mergequeue dd-mergequeue bot deleted the chris.agocs/create_trace_context_from_stepfunction_input branch September 19, 2024 20:03
@github-actions github-actions bot added this to the 7.59.0 milestone Sep 19, 2024
bouwkast added a commit to DataDog/dd-trace-dotnet that referenced this pull request Oct 22, 2024
## Summary of changes

I've updated the Lambda extension so it is capable of returning a 128
bit trace ID when a tracer calls the `/lambda/start-invocation` endpoint
[in this PR](DataDog/datadog-agent#27988)

As per [the
RFC](https://datadoghq.atlassian.net/wiki/spaces/RUMP/pages/3545630931/RFC+Support+128+bit+trace+IDs+in+RUM+SDKs#:~:text=For%20Datadog%20headers%2C%20the%20128%20bit%20trace%20id%20is%20sent%20in%20two%20parts%2C%20lower%2064%20bits%20in%20x%2Ddatadog%2Dtrace%2Did%20(decimal)%20and%20the%20higher%2064%20bits%20in%20x%2Ddatadog%2Dtags%20header%20under%20_dd.p.tid%20(hex)%20tag),
the

> lower 64 bits in `x-datadog-trace-id` (decimal) and the higher 64 bits
in `x-datadog-tags` header under `_dd.p.tid` (hex) tag.

This change modifies the function that calls `/lambda/start-invocation`,
allowing it to pick out the upper 64 bits of the trace ID and set the
resulting 128-bit trace ID in the extracted context.


## Reason for change

The Lambda Extension may now return a 128 bit trace ID when a Step
Function invokes a Lambda Function.

## Implementation details

I rewrote LambdaCommon's `CreatePlaceholderScope` so it uses
`SpanContextPropagator.Instance.Extract` rather than extracting trace
context elements one by one.

## Test coverage

Added a unit test for 128 bit trace IDs. Fixed existing unit tests so
they pass a dictionary of headers to CreatePlaceholderScope. Removed a
unit test that only passes SamplingPriority, since a distributed trace
with only a sampling priority is hardly a distributed trace at all.

## Other details
Backported to 2.x in (TODO)
<!-- Fixes #{issue} -->

<!-- ⚠️ Note: where possible, please obtain 2 approvals prior to
merging. Unless CODEOWNERS specifies otherwise, for external teams it is
typically best to have one review from a team member, and one review
from apm-dotnet. Trivial changes do not require 2 reviews. -->

---------

Co-authored-by: Lucas Pimentel <[email protected]>
Co-authored-by: Andrew Lock <[email protected]>
Co-authored-by: Daniel Romano <[email protected]>
Co-authored-by: Steven Bouwkamp <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: NachoEchevarria <[email protected]>
Co-authored-by: William Conti <[email protected]>
Co-authored-by: Kevin Gosse <[email protected]>
Co-authored-by: Tony Redondo <[email protected]>
Co-authored-by: Gregory LEOCADIE <[email protected]>
agocs added a commit to DataDog/dd-trace-dotnet that referenced this pull request Oct 22, 2024
## Summary of changes

I've updated the Lambda extension so it is capable of returning a 128
bit trace ID when a tracer calls the `/lambda/start-invocation` endpoint
[in this PR](DataDog/datadog-agent#27988)

As per [the
RFC](https://datadoghq.atlassian.net/wiki/spaces/RUMP/pages/3545630931/RFC+Support+128+bit+trace+IDs+in+RUM+SDKs#:~:text=For%20Datadog%20headers%2C%20the%20128%20bit%20trace%20id%20is%20sent%20in%20two%20parts%2C%20lower%2064%20bits%20in%20x%2Ddatadog%2Dtrace%2Did%20(decimal)%20and%20the%20higher%2064%20bits%20in%20x%2Ddatadog%2Dtags%20header%20under%20_dd.p.tid%20(hex)%20tag),
the

> lower 64 bits in `x-datadog-trace-id` (decimal) and the higher 64 bits
in `x-datadog-tags` header under `_dd.p.tid` (hex) tag.

This change modifies the function that calls `/lambda/start-invocation`,
allowing it to pick out the upper 64 bits of the trace ID and set the
resulting 128-bit trace ID in the extracted context.


## Reason for change

The Lambda Extension may now return a 128 bit trace ID when a Step
Function invokes a Lambda Function.

## Implementation details

I rewrote LambdaCommon's `CreatePlaceholderScope` so it uses
`SpanContextPropagator.Instance.Extract` rather than extracting trace
context elements one by one.

## Test coverage

Added a unit test for 128 bit trace IDs. Fixed existing unit tests so
they pass a dictionary of headers to CreatePlaceholderScope. Removed a
unit test that only passes SamplingPriority, since a distributed trace
with only a sampling priority is hardly a distributed trace at all.

## Other details
Backported to 2.x in (TODO)
<!-- Fixes #{issue} -->

<!-- ⚠️ Note: where possible, please obtain 2 approvals prior to
merging. Unless CODEOWNERS specifies otherwise, for external teams it is
typically best to have one review from a team member, and one review
from apm-dotnet. Trivial changes do not require 2 reviews. -->

---------

Co-authored-by: Lucas Pimentel <[email protected]>
Co-authored-by: Andrew Lock <[email protected]>
Co-authored-by: Daniel Romano <[email protected]>
Co-authored-by: Steven Bouwkamp <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: NachoEchevarria <[email protected]>
Co-authored-by: William Conti <[email protected]>
Co-authored-by: Kevin Gosse <[email protected]>
Co-authored-by: Tony Redondo <[email protected]>
Co-authored-by: Gregory LEOCADIE <[email protected]>
bouwkast added a commit to DataDog/dd-trace-dotnet that referenced this pull request Oct 22, 2024
…6181)

## Summary of changes

I've updated the Lambda extension so it is capable of returning a 128
bit trace ID when a tracer calls the `/lambda/start-invocation` endpoint
[in this PR](DataDog/datadog-agent#27988)

As per [the

RFC](https://datadoghq.atlassian.net/wiki/spaces/RUMP/pages/3545630931/RFC+Support+128+bit+trace+IDs+in+RUM+SDKs#:~:text=For%20Datadog%20headers%2C%20the%20128%20bit%20trace%20id%20is%20sent%20in%20two%20parts%2C%20lower%2064%20bits%20in%20x%2Ddatadog%2Dtrace%2Did%20(decimal)%20and%20the%20higher%2064%20bits%20in%20x%2Ddatadog%2Dtags%20header%20under%20_dd.p.tid%20(hex)%20tag),
the

> lower 64 bits in `x-datadog-trace-id` (decimal) and the higher 64 bits
in `x-datadog-tags` header under `_dd.p.tid` (hex) tag.

This change modifies the function that calls `/lambda/start-invocation`,
allowing it to pick out the upper 64 bits of the trace ID and set the
resulting 128-bit trace ID in the extracted context.


## Reason for change

The Lambda Extension may now return a 128 bit trace ID when a Step
Function invokes a Lambda Function.

## Implementation details

I rewrote LambdaCommon's `CreatePlaceholderScope` so it uses
`SpanContextPropagator.Instance.Extract` rather than extracting trace
context elements one by one.

## Test coverage

Added a unit test for 128 bit trace IDs. Fixed existing unit tests so
they pass a dictionary of headers to CreatePlaceholderScope. Removed a
unit test that only passes SamplingPriority, since a distributed trace
with only a sampling priority is hardly a distributed trace at all.

## Other details
Backport to 2.x of #6041
<!-- Fixes #{issue} -->

<!-- ⚠️ Note: where possible, please obtain 2 approvals prior to
merging. Unless CODEOWNERS specifies otherwise, for external teams it is
typically best to have one review from a team member, and one review
from apm-dotnet. Trivial changes do not require 2 reviews. -->

---------

Co-authored-by: Lucas Pimentel <[email protected]>
Co-authored-by: Andrew Lock <[email protected]>
Co-authored-by: Daniel Romano <[email protected]>
Co-authored-by: Steven Bouwkamp <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: NachoEchevarria <[email protected]>
Co-authored-by: William Conti <[email protected]>
Co-authored-by: Kevin Gosse <[email protected]>
Co-authored-by: Tony Redondo <[email protected]>
Co-authored-by: Gregory LEOCADIE <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants