[SVLS-5034] Create trace context from Step Function execution details #27988

agocs · 2024-07-26T16:34:27Z

What does this PR do?

Deterministically create a trace context from Step Function execution ARN, state name, and state entered time for Universal Instrumentation runtimes.

A Step Function has no way of generating a Datadog trace context. We generate the trace context for a Step Function execution after the fact in the Logs to Traces reducer.

Trace ID: sha256(execution ARN)
Span ID: sha256(execution ARN + state name + state entered time)

When a Step Function invokes a Lambda Function, it can be made to include its execution context in the invocation. This change allows the extension to inspect a Step Function invocation event, extract the Step Function execution context, and use the same math to generate a Trace ID and Parent ID that match the Trace ID and Span ID that will be generated in Logs to Traces.

Of Note

The Trace ID generated in Logs to Traces is a 128-bit Trace ID. The common practice (Python, Node) to support 128 bit trace IDs is to split them in half. The lower 64 bits are expressed as the uint64 Trace ID, and the upper 64 bits are encoded as a hexadecimal string and written to the root span's _meta["_dd.p.tid"] tag.

How does it work?

I defined a Step Function event type in events.go. Then, in extractor.go, I added a case to the type switch for the StepFunctionEvent type. That case calls createTraceContextFromStepFunctionInput in carriers.go, createTraceContextFromStepFunctionInput extracts the Execution ID, the state name, and the state entered time, and uses them to create a trace context with a 128 bit trace ID.

In order to support the 128 bit trace ID, I had to modify the TraceContext struct. I added a string field, TraceIDUpper64Hex. Because it's a string, the zero value is "".

Anywho, the case statement I added in extractor.go returns the created trace context. lp.Extractor.Extract() is called in the startExecutionSpan function in trace.go. The only change I made there was to check for the existence of traceContext.TraceIDUpper64Hex, and set a Meta Tag in the lifecycleProcessor.requestHandler _dd.p.tid:{TraceIDUpper64Hex}.

Possible Drawbacks / Trade-offs

This change adds crypto/sha256 as a dependency in carriers.go, which might increase the size of the Extension binary if that library is not already linked somewhere else.

Describe how to test/QA your changes

Extensive unit tests
Figured out how to build the extension and tested it manually.

Screenshot of a Step Function invoking a .Net and a Java lambda function:

…nction_input

pr-commenter · 2024-07-26T17:24:28Z

Regression Detector

Regression Detector Results

Run ID: 1f0988c4-8ca3-4ffc-8487-2233241d0098 Metrics dashboard Target profiles

Baseline: a0be446
Comparison: c223273

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	pycheck_lots_of_tags	% cpu utilization	+0.65	[-1.92, +3.22]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.54	[-0.23, +1.31]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.00	[-0.00, +0.00]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	file_tree	memory utilization	-0.24	[-0.36, -0.12]	1	Logs
➖	basic_py_check	% cpu utilization	-0.28	[-3.16, +2.61]	1	Logs
➖	idle	memory utilization	-0.34	[-0.38, -0.30]	1	Logs
➖	otel_to_otel_logs	ingress throughput	-0.42	[-1.25, +0.41]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.72	[-0.77, -0.67]	1	Logs

Bounds Checks

perf	experiment	bounds_check_name	replicates_passed
✅	idle	memory_usage	10/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

pr-commenter · 2024-07-30T21:56:08Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=44720602 --os-family=ubuntu

Note: This applies to commit 1c5f028

pkg/serverless/trace/propagation/carriers.go

… case

nine5two7 · 2024-08-01T20:26:40Z

pkg/serverless/trace/propagation/extractor.go

@@ -112,6 +113,11 @@ func (e Extractor) extract(event interface{}) (*TraceContext, error) {
 		carrier, err = headersCarrier(ev.Headers)
 	case events.LambdaFunctionURLRequest:
 		carrier, err = headersCarrier(ev.Headers)
+	case events.StepFunctionEvent:


I am not a Go expert. Does it mean that if the event include three StepFunctions specific metadata, it matches the StepFunctionEvent?

I was mistaken -- that happens in serverless/trigger/events.go in https://github.com/DataDog/datadog-agent/pull/27988/files#diff-77b95cca617aa4d8daf59d6bf15540a3504815b769b963c5b78679dc1819fe21R277-R284

nine5two7 · 2024-08-01T20:28:37Z

pkg/serverless/trace/propagation/extractor.go

-	ParentID         uint64
-	SamplingPriority sampler.SamplingPriority
+	TraceID           uint64
+	TraceIDUpper64Hex string


So, Java, .NET, and Go tracers will automatically recognize this new attribute in the trace context, and add it to the span metadata?

That's a good question. My understanding is the extension starts the trace and adds the TraceID+ Upper 64. The Java, .Net, or Go tracer creates spans and then sends them to the extension. The extension adds them to the trace it created. I'll need to verify that the extension is doing enough with respect to the Upper 64 bits (i.e. do they need to be added to the _meta._dd.p.tid field for every span? Or just the root span?)

…tepfunction

…nction_input

pkg/serverless/invocationlifecycle/lifecycle.go

purple4reina · 2024-09-03T21:43:46Z

pkg/serverless/invocationlifecycle/init.go

@@ -189,3 +189,7 @@ func (lp *LifecycleProcessor) initFromLambdaFunctionURLEvent(event events.Lambda
 	lp.addTag(tagFunctionTriggerEventSourceArn, fmt.Sprintf("arn:aws:lambda:%v:%v:url:%v", region, accountID, functionName))
 	lp.addTags(trigger.GetTagsFromLambdaFunctionURLRequest(event))
 }
+
+func (lp *LifecycleProcessor) initFromStepFunctionEvent(event events.StepFunctionEvent) {
+	lp.requestHandler.event = event.Payload


In every other method in this file, we set

lp.requestHandler.event = event

Why are we setting it to event.Payload in this case?

That was the consequence of a hack to get around the fact that the trace context was wrapped in a Payload field, per the instructions here: https://docs.datadoghq.com/serverless/step_functions/installation/?tab=custom#:~:text=The%20JsonMerge%20intrinsic%20function%20merges%20the%20Step%20Functions%20context%20object%20(%24%24)%20with%20the%20original%20Lambda%E2%80%99s%20input%20payload%20(%24).%20Fields%20of%20the%20original%20payload%20overwrite%20the%20Step%20Functions%20context%20object%20if%20their%20keys%20are%20the%20same.

The actual event is a definitionless JSON object, so deserializing it and storing it is going to be slightly tricky. I'll think about it a little more.

Personally, I think it's best to store the event in its original form. This way, all the way through the call stack, we can assume that the event is the actual event, and not just a piece of it.

As for json deserialization, you've already made that a bit easier by adding definitions to pkg/serverless/trigger/events/events.go. I created that file so that during deserialization, we only need to store the keys we're actually going to use. It really cuts down the number of allocs we make.

I recommend you save the entire event, not just the payload. In addition to the reasons listed above, it makes your logic easier below.

Or maybe I'm miss understanding the order of operations here? Please correct me if I am wrong.

purple4reina · 2024-09-03T21:56:31Z

pkg/serverless/invocationlifecycle/trace.go

@@ -63,6 +63,9 @@ func (lp *LifecycleProcessor) startExecutionSpan(event interface{}, rawPayload [
 			inferredSpan.Span.TraceID = traceContext.TraceID
 			inferredSpan.Span.ParentID = traceContext.ParentID
 		}
+		if traceContext.TraceIDUpper64Hex != "" {
+			lp.requestHandler.SetMetaTag(ddTraceIDUpper64BitsHeader, traceContext.TraceIDUpper64Hex)


Am I understanding this correctly, the upper 64 bits are set as a span tag? They are not included in any in/outbound trace context headers?

Okay, this is an actual concern that I believe could become a actual problem.

Since the requestHandler is not reset on each invocation, it's feasible that one invocation has the upper 64 and then one later on does not. The later one would then still get the meta tag from the previous invocation. That's a problem.

Hm. That is a concern. A lambda function could be invoked by a Step Function and then a non-Step-Function-thing. That would be rare, but not impossible. Would it be reasonable to remove that Meta tag in OnInvokeEnd?

I've removed the meta tag here; instead I'm passing the upper 64 in the response that goes back to the tracer.

pkg/serverless/invocationlifecycle/lifecycle.go

purple4reina · 2024-09-04T18:46:45Z

pkg/serverless/trace/propagation/extractor.go

+		tc, err := createTraceContextFromStepFunctionInput(ev)
+		if err == nil {
+			return tc, nil
+		}


First looking at this, I thought there was a better way to do this. But after closer inspection, there is another way to do this, but I'm not so sure it's better.

My biggest concern is consistency. We want to defer as much work to dd-trace-go as possible. That's what all these carriers are for -- so we can pass the set of found headers to the dd-trace-go propagator for extraction.

I threw together what this could look like, here 8ed2a33. While this version makes the code more consistent, I'm not convinced this is any better, and in fact I think it is worse because it means creating the headers map, extracting from that map, just to later recreate it. Dumb.

So, in the end, this PR is good. I'm sad we can't be more consistent about how we are extracting trace context from step functions. Ultimately though, they are different enough to get their own way of doing things.

pkg/serverless/trace/propagation/extractor.go

…nction_input

purple4reina · 2024-09-13T19:16:58Z

pkg/serverless/invocationlifecycle/lifecycle.go

+		}
+		sfe := events.StepFunctionEvent{Payload: eventPayload}
+		ev = eventPayload
+		lp.initFromStepFunctionEvent(sfe)


Nice. I like this. This way, the event passed out of this method for both of these step function types will be the same.

purple4reina · 2024-09-13T19:21:37Z

pkg/serverless/invocationlifecycle/trace.go

@@ -63,6 +64,9 @@ func (lp *LifecycleProcessor) startExecutionSpan(event interface{}, rawPayload [
 			inferredSpan.Span.TraceID = traceContext.TraceID
 			inferredSpan.Span.ParentID = traceContext.ParentID
 		}
+		if traceContext.TraceIDUpper64Hex != "" {
+			executionContext.TraceIDUpper64Hex = traceContext.TraceIDUpper64Hex


nit.

I think you do not need the conditional here. You can just always set the upper 64 from the context.

On each invocation, the execution context is reset.

When execution reaches this point, executionContext.TraceIDUpper64Hex will therefore always be "".

Therefore, there is no need to avoid setting executionContext.TraceIDUpper64Hex to "".

What's more expensive, if traceContext.TraceIDUpper64Hex != "" or executionContext.TraceIDUpper64Hex = traceContext.TraceIDUpper64Hex?

Not sure actually. But I just realized that we shouldn't set the Upper64BitsTag tag if the upper 64 is empty. So we should leave the conditional, at least for that portion.

In fact 💡, instead of doing the del of the Upper64BitsTag key at end invocation, you could do it here. That would keep all logic that mutates the requestHander's meta tags in a single location.

purple4reina · 2024-09-13T19:27:50Z

Okay, correct me if I am wrong, but I think there are a couple things that we've missed. The extension creates several spans of its own (aws.lambda span and the inferred spans). These must also get the upper 64bits.

agocs · 2024-09-16T16:03:02Z

Okay, correct me if I am wrong, but I think there are a couple things that we've missed. The extension creates several spans of its own (aws.lambda span and the inferred spans). These must also get the upper 64bits.

Do you think we want to set meta._dd.p.tid in the Extension in onInvokeStart and unset it in onInvokeEnd

…t' of https://github.com/datadog/datadog-agent into chris.agocs/create_trace_context_from_stepfunction_input

agocs · 2024-09-18T21:13:21Z

@purple4reina I went ahead and added the upper64 bits to the meta tags, and then removed them after endExecutionSpan

purple4reina · 2024-09-19T16:11:06Z

pkg/serverless/invocationlifecycle/lifecycle.go

+			log.Debugf("Failed to unmarshal %s event: %s", stepFunction, err)
+			break
+		}
+		sfe := events.StepFunctionEvent{Payload: eventPayload}


If you set the event to the full event and not just event.Payload above, you wouldn't need to do this here.

Because Legacy and non-Legacy step function lambda invocations unmarshal into different types, we have to do a little awkward juggling regardless.

Ahhh, I think you mentioned this before. Can you point me again to what those two json blobs look like? I'm now wondering if there's a way we could unify them into a single event type, even if they look a bit different from one another.

Legacy:

{ "Payload": { "Execution": { "Id": "arn:aws:states:us-east-1:425362996713:execution:agocsTestSF:bc9f281c-3daa-4e5a-9a60-471a3810bf44", "Input": {}, "StartTime": "2024-07-30T19:55:52.976Z", "Name": "bc9f281c-3daa-4e5a-9a60-471a3810bf44", "RoleArn": "arn:aws:iam::425362996713:role/test-serverless-stepfunctions-dev-AgocsTestSFRole-tRkeFXScjyk4", "RedriveCount": 0 }, "StateMachine": { "Id": "arn:aws:states:us-east-1:425362996713:stateMachine:agocsTestSF", "Name": "agocsTestSF" }, "State": { "Name": "agocsTest1", "EnteredTime": "2024-07-30T19:55:53.018Z", "RetryCount": 0 } } }

Non-legacy

{ "Execution": { "Id": "arn:aws:states:us-east-1:425362996713:execution:agocsTestSF:bc9f281c-3daa-4e5a-9a60-471a3810bf44", "Input": {}, "StartTime": "2024-07-30T19:55:52.976Z", "Name": "bc9f281c-3daa-4e5a-9a60-471a3810bf44", "RoleArn": "arn:aws:iam::425362996713:role/test-serverless-stepfunctions-dev-AgocsTestSFRole-tRkeFXScjyk4", "RedriveCount": 0 }, "StateMachine": { "Id": "arn:aws:states:us-east-1:425362996713:stateMachine:agocsTestSF", "Name": "agocsTestSF" }, "State": { "Name": "agocsTest1", "EnteredTime": "2024-07-30T19:55:53.018Z", "RetryCount": 0 } }

…pper64Hex is empty

github-actions · 2024-09-19T17:07:01Z

Serverless Benchmark Results

BenchmarkStartEndInvocation comparison between 0df804c and fdbc451.

tl;dr

Use these benchmarks as an insight tool during development.

Skim down the vs base column in each chart. If there is a ~, then there was no statistically significant change to the benchmark. Otherwise, ensure the estimated percent change is either negative or very small.
The last row of each chart is the geomean. Ensure this percentage is either negative or very small.

What is this benchmarking?

The BenchmarkStartEndInvocation compares the amount of time it takes to call the start-invocation and end-invocation endpoints. For universal instrumentation languages (Dotnet, Golang, Java, Ruby), this represents the majority of the duration overhead added by our tracing layer.

The benchmark is run using a large variety of lambda request payloads. In the charts below, there is one row for each event payload type.

How do I interpret these charts?

The charts below comes from benchstat. They represent the statistical change in duration (sec/op), memory overhead (B/op), and allocations (allocs/op).

The benchstat docs explain how to interpret these charts.

Before the comparison table, we see common file-level configuration. If there are benchmarks with different configuration (for example, from different packages), benchstat will print separate tables for each configuration.

The table then compares the two input files for each benchmark. It shows the median and 95% confidence interval summaries for each benchmark before and after the change, and an A/B comparison under "vs base". ... The p-value measures how likely it is that any differences were due to random chance (i.e., noise). The "~" means benchstat did not detect a statistically significant difference between the two inputs. ...

Note that "statistically significant" is not the same as "large": with enough low-noise data, even very small changes can be distinguished from noise and considered statistically significant. It is, of course, generally easier to distinguish large changes from noise.

Finally, the last row of the table shows the geometric mean of each column, giving an overall picture of how the benchmarks changed. Proportional changes in the geomean reflect proportional changes in the benchmarks. For example, given n benchmarks, if sec/op for one of them increases by a factor of 2, then the sec/op geomean will increase by a factor of ⁿ√2.

I need more help

First off, do not worry if the benchmarks are failing. They are not tests. The intention is for them to be a tool for you to use during development.

If you would like a hand interpreting the results come chat with us in #serverless-agent in the internal DataDog slack or in #serverless in the public DataDog slack. We're happy to help!

Benchmark stats

goos: linux
goarch: amd64
pkg: github.com/DataDog/datadog-agent/pkg/serverless/daemon
cpu: AMD EPYC 7763 64-Core Processor                
                                      │ baseline/benchmark.log │        current/benchmark.log        │
                                      │         sec/op         │    sec/op     vs base               │
api-gateway-appsec.json                           84.50µ ± 15%    88.52µ ± 2%       ~ (p=0.143 n=10)
api-gateway-kong-appsec.json                      67.07µ ±  2%    71.43µ ± 4%  +6.50% (p=0.000 n=10)
api-gateway-kong.json                             65.19µ ±  1%    70.15µ ± 4%  +7.61% (p=0.000 n=10)
api-gateway-non-proxy-async.json                  104.9µ ±  2%    106.9µ ± 4%  +1.89% (p=0.023 n=10)
api-gateway-non-proxy.json                        106.0µ ±  1%    109.0µ ± 2%  +2.82% (p=0.000 n=10)
api-gateway-websocket-connect.json                70.23µ ±  2%    73.41µ ± 1%  +4.53% (p=0.000 n=10)
api-gateway-websocket-default.json                64.15µ ±  1%    65.90µ ± 1%  +2.72% (p=0.000 n=10)
api-gateway-websocket-disconnect.json             64.32µ ±  1%    65.41µ ± 2%  +1.68% (p=0.004 n=10)
api-gateway.json                                  117.9µ ±  1%    118.0µ ± 1%       ~ (p=0.912 n=10)
application-load-balancer.json                    64.10µ ±  1%    65.49µ ± 1%  +2.17% (p=0.002 n=10)
cloudfront.json                                   47.89µ ±  3%    50.27µ ± 2%  +4.97% (p=0.000 n=10)
cloudwatch-events.json                            38.81µ ±  2%    39.62µ ± 2%  +2.08% (p=0.001 n=10)
cloudwatch-logs.json                              66.63µ ±  1%    68.60µ ± 2%  +2.95% (p=0.002 n=10)
custom.json                                       31.49µ ±  2%    32.72µ ± 2%  +3.93% (p=0.000 n=10)
dynamodb.json                                     93.86µ ±  1%    97.45µ ± 2%  +3.83% (p=0.000 n=10)
empty.json                                        29.16µ ±  2%    31.12µ ± 4%  +6.71% (p=0.000 n=10)
eventbridge-custom.json                           42.30µ ±  2%    44.03µ ± 3%  +4.08% (p=0.001 n=10)
http-api.json                                     72.73µ ±  1%    74.62µ ± 2%  +2.60% (p=0.000 n=10)
kinesis-batch.json                                71.26µ ±  2%    72.74µ ± 1%  +2.07% (p=0.000 n=10)
kinesis.json                                      54.16µ ±  1%    56.33µ ± 2%  +3.99% (p=0.000 n=10)
s3.json                                           59.69µ ±  3%    61.42µ ± 2%  +2.88% (p=0.002 n=10)
sns-batch.json                                    89.64µ ±  1%    92.60µ ± 2%  +3.31% (p=0.000 n=10)
sns.json                                          64.56µ ±  2%    67.45µ ± 1%  +4.47% (p=0.000 n=10)
snssqs.json                                       109.8µ ±  1%    113.7µ ± 3%  +3.55% (p=0.000 n=10)
snssqs_no_dd_context.json                         99.00µ ±  2%   104.00µ ± 3%  +5.04% (p=0.002 n=10)
sqs-aws-header.json                               55.80µ ±  2%    57.97µ ± 2%  +3.90% (p=0.000 n=10)
sqs-batch.json                                    95.36µ ±  2%    97.64µ ± 2%  +2.39% (p=0.003 n=10)
sqs.json                                          69.30µ ±  2%    71.20µ ± 2%  +2.74% (p=0.002 n=10)
sqs_no_dd_context.json                            62.31µ ±  3%    63.44µ ± 1%  +1.81% (p=0.043 n=10)
stepfunction.json                                                 49.71µ ± 5%
geomean                                           67.20µ          68.78µ       +3.51%

                                      │ baseline/benchmark.log │        current/benchmark.log        │
                                      │          B/op          │     B/op      vs base               │
api-gateway-appsec.json                           37.26Ki ± 0%   37.27Ki ± 0%  +0.05% (p=0.001 n=10)
api-gateway-kong-appsec.json                      26.92Ki ± 0%   26.93Ki ± 0%  +0.07% (p=0.000 n=10)
api-gateway-kong.json                             24.41Ki ± 0%   24.44Ki ± 0%  +0.09% (p=0.000 n=10)
api-gateway-non-proxy-async.json                  48.03Ki ± 0%   48.08Ki ± 0%  +0.10% (p=0.007 n=10)
api-gateway-non-proxy.json                        47.26Ki ± 0%   47.31Ki ± 0%  +0.09% (p=0.000 n=10)
api-gateway-websocket-connect.json                25.47Ki ± 0%   25.50Ki ± 0%  +0.12% (p=0.000 n=10)
api-gateway-websocket-default.json                21.38Ki ± 0%   21.42Ki ± 0%  +0.19% (p=0.000 n=10)
api-gateway-websocket-disconnect.json             21.16Ki ± 0%   21.20Ki ± 0%  +0.21% (p=0.000 n=10)
api-gateway.json                                  49.56Ki ± 0%   49.60Ki ± 0%  +0.08% (p=0.000 n=10)
application-load-balancer.json                    22.35Ki ± 0%   22.38Ki ± 0%  +0.16% (p=0.000 n=10)
cloudfront.json                                   17.65Ki ± 0%   17.68Ki ± 0%  +0.13% (p=0.000 n=10)
cloudwatch-events.json                            11.69Ki ± 0%   11.72Ki ± 0%  +0.22% (p=0.001 n=10)
cloudwatch-logs.json                              53.37Ki ± 0%   53.39Ki ± 0%  +0.03% (p=0.022 n=10)
custom.json                                       9.736Ki ± 0%   9.760Ki ± 0%  +0.24% (p=0.005 n=10)
dynamodb.json                                     40.78Ki ± 0%   40.80Ki ± 0%       ~ (p=0.593 n=10)
empty.json                                        9.304Ki ± 0%   9.333Ki ± 0%  +0.31% (p=0.022 n=10)
eventbridge-custom.json                           13.44Ki ± 0%   13.44Ki ± 0%       ~ (p=0.255 n=10)
http-api.json                                     23.73Ki ± 0%   23.76Ki ± 0%  +0.12% (p=0.001 n=10)
kinesis-batch.json                                27.01Ki ± 0%   27.05Ki ± 0%  +0.13% (p=0.025 n=10)
kinesis.json                                      17.81Ki ± 0%   17.83Ki ± 0%  +0.08% (p=0.027 n=10)
s3.json                                           20.36Ki ± 0%   20.35Ki ± 0%       ~ (p=0.753 n=10)
sns-batch.json                                    38.66Ki ± 0%   38.74Ki ± 0%  +0.21% (p=0.001 n=10)
sns.json                                          24.01Ki ± 0%   24.08Ki ± 0%  +0.30% (p=0.005 n=10)
snssqs.json                                       50.55Ki ± 0%   50.60Ki ± 0%       ~ (p=0.105 n=10)
snssqs_no_dd_context.json                         44.81Ki ± 0%   44.87Ki ± 0%  +0.12% (p=0.008 n=10)
sqs-aws-header.json                               18.84Ki ± 0%   18.94Ki ± 1%  +0.49% (p=0.019 n=10)
sqs-batch.json                                    41.65Ki ± 0%   41.76Ki ± 0%  +0.27% (p=0.000 n=10)
sqs.json                                          25.63Ki ± 0%   25.64Ki ± 0%       ~ (p=0.671 n=10)
sqs_no_dd_context.json                            20.66Ki ± 0%   20.69Ki ± 1%       ~ (p=0.631 n=10)
stepfunction.json                                                14.17Ki ± 1%
geomean                                           25.72Ki        25.25Ki       +0.14%

                                      │ baseline/benchmark.log │        current/benchmark.log        │
                                      │       allocs/op        │ allocs/op   vs base                 │
api-gateway-appsec.json                             630.0 ± 0%   629.5 ± 0%       ~ (p=1.000 n=10)
api-gateway-kong-appsec.json                        488.0 ± 0%   488.0 ± 0%       ~ (p=1.000 n=10) ¹
api-gateway-kong.json                               466.0 ± 0%   466.0 ± 0%       ~ (p=1.000 n=10) ¹
api-gateway-non-proxy-async.json                    726.0 ± 0%   726.0 ± 0%       ~ (p=1.000 n=10)
api-gateway-non-proxy.json                          716.0 ± 0%   716.0 ± 0%       ~ (p=0.582 n=10)
api-gateway-websocket-connect.json                  453.0 ± 0%   453.0 ± 0%       ~ (p=1.000 n=10)
api-gateway-websocket-default.json                  379.0 ± 0%   379.0 ± 0%       ~ (p=1.000 n=10)
api-gateway-websocket-disconnect.json               370.0 ± 0%   370.0 ± 0%       ~ (p=1.000 n=10)
api-gateway.json                                    791.0 ± 0%   791.0 ± 0%       ~ (p=1.000 n=10)
application-load-balancer.json                      352.0 ± 0%   352.0 ± 0%       ~ (p=1.000 n=10) ¹
cloudfront.json                                     284.0 ± 0%   284.0 ± 0%       ~ (p=1.000 n=10) ¹
cloudwatch-events.json                              220.0 ± 0%   220.5 ± 0%       ~ (p=0.650 n=10)
cloudwatch-logs.json                                216.0 ± 0%   216.0 ± 0%       ~ (p=0.628 n=10)
custom.json                                         169.0 ± 1%   169.0 ± 1%       ~ (p=0.628 n=10)
dynamodb.json                                       589.0 ± 0%   589.0 ± 0%       ~ (p=0.697 n=10)
empty.json                                          160.0 ± 0%   160.0 ± 1%       ~ (p=0.211 n=10)
eventbridge-custom.json                             254.0 ± 0%   254.0 ± 0%       ~ (p=1.000 n=10)
http-api.json                                       433.0 ± 0%   432.0 ± 0%       ~ (p=0.370 n=10)
kinesis-batch.json                                  390.5 ± 0%   391.0 ± 0%       ~ (p=0.650 n=10)
kinesis.json                                        285.0 ± 0%   285.0 ± 0%       ~ (p=1.000 n=10)
s3.json                                             358.0 ± 0%   357.5 ± 0%       ~ (p=0.253 n=10)
sns-batch.json                                      454.5 ± 0%   455.0 ± 0%       ~ (p=0.059 n=10)
sns.json                                            323.5 ± 0%   324.0 ± 0%       ~ (p=0.068 n=10)
snssqs.json                                         438.0 ± 0%   438.0 ± 0%       ~ (p=0.810 n=10)
snssqs_no_dd_context.json                           400.0 ± 0%   400.0 ± 0%       ~ (p=0.230 n=10)
sqs-aws-header.json                                 274.0 ± 0%   275.0 ± 1%  +0.36% (p=0.047 n=10)
sqs-batch.json                                      504.0 ± 0%   505.0 ± 0%  +0.20% (p=0.012 n=10)
sqs.json                                            352.0 ± 0%   352.0 ± 1%       ~ (p=0.310 n=10)
sqs_no_dd_context.json                              324.0 ± 1%   325.0 ± 1%       ~ (p=0.806 n=10)
stepfunction.json                                                236.0 ± 1%
geomean                                             376.6        370.9       +0.04%
¹ all samples are equal

purple4reina · 2024-09-19T19:14:31Z

pkg/serverless/invocationlifecycle/trace.go

+			executionContext.TraceIDUpper64Hex = traceContext.TraceIDUpper64Hex
+			lp.requestHandler.SetMetaTag(Upper64BitsTag, traceContext.TraceIDUpper64Hex)
+		} else {
+			delete(lp.requestHandler.triggerTags, Upper64BitsTag)


cool cool, and I just confirmed that attempting to delete a key that does not exist does not panic. Nor does it panic if the map is actually nil. 👍🏽

Oh good lookin' out!

agocs · 2024-09-19T19:27:57Z

/merge

dd-devflow · 2024-09-19T19:28:11Z

🚂 MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

dd-devflow · 2024-09-19T19:42:24Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

…#27988)

## Summary of changes I've updated the Lambda extension so it is capable of returning a 128 bit trace ID when a tracer calls the `/lambda/start-invocation` endpoint [in this PR](DataDog/datadog-agent#27988) As per [the RFC](https://datadoghq.atlassian.net/wiki/spaces/RUMP/pages/3545630931/RFC+Support+128+bit+trace+IDs+in+RUM+SDKs#:~:text=For%20Datadog%20headers%2C%20the%20128%20bit%20trace%20id%20is%20sent%20in%20two%20parts%2C%20lower%2064%20bits%20in%20x%2Ddatadog%2Dtrace%2Did%20(decimal)%20and%20the%20higher%2064%20bits%20in%20x%2Ddatadog%2Dtags%20header%20under%20_dd.p.tid%20(hex)%20tag), the > lower 64 bits in `x-datadog-trace-id` (decimal) and the higher 64 bits in `x-datadog-tags` header under `_dd.p.tid` (hex) tag. This change modifies the function that calls `/lambda/start-invocation`, allowing it to pick out the upper 64 bits of the trace ID and set the resulting 128-bit trace ID in the extracted context. ## Reason for change The Lambda Extension may now return a 128 bit trace ID when a Step Function invokes a Lambda Function. ## Implementation details I rewrote LambdaCommon's `CreatePlaceholderScope` so it uses `SpanContextPropagator.Instance.Extract` rather than extracting trace context elements one by one. ## Test coverage Added a unit test for 128 bit trace IDs. Fixed existing unit tests so they pass a dictionary of headers to CreatePlaceholderScope. Removed a unit test that only passes SamplingPriority, since a distributed trace with only a sampling priority is hardly a distributed trace at all. ## Other details Backported to 2.x in (TODO)   --------- Co-authored-by: Lucas Pimentel <[email protected]> Co-authored-by: Andrew Lock <[email protected]> Co-authored-by: Daniel Romano <[email protected]> Co-authored-by: Steven Bouwkamp <[email protected]> Co-authored-by: Anna <[email protected]> Co-authored-by: NachoEchevarria <[email protected]> Co-authored-by: William Conti <[email protected]> Co-authored-by: Kevin Gosse <[email protected]> Co-authored-by: Tony Redondo <[email protected]> Co-authored-by: Gregory LEOCADIE <[email protected]>

…6181) ## Summary of changes I've updated the Lambda extension so it is capable of returning a 128 bit trace ID when a tracer calls the `/lambda/start-invocation` endpoint [in this PR](DataDog/datadog-agent#27988) As per [the RFC](https://datadoghq.atlassian.net/wiki/spaces/RUMP/pages/3545630931/RFC+Support+128+bit+trace+IDs+in+RUM+SDKs#:~:text=For%20Datadog%20headers%2C%20the%20128%20bit%20trace%20id%20is%20sent%20in%20two%20parts%2C%20lower%2064%20bits%20in%20x%2Ddatadog%2Dtrace%2Did%20(decimal)%20and%20the%20higher%2064%20bits%20in%20x%2Ddatadog%2Dtags%20header%20under%20_dd.p.tid%20(hex)%20tag), the > lower 64 bits in `x-datadog-trace-id` (decimal) and the higher 64 bits in `x-datadog-tags` header under `_dd.p.tid` (hex) tag. This change modifies the function that calls `/lambda/start-invocation`, allowing it to pick out the upper 64 bits of the trace ID and set the resulting 128-bit trace ID in the extracted context. ## Reason for change The Lambda Extension may now return a 128 bit trace ID when a Step Function invokes a Lambda Function. ## Implementation details I rewrote LambdaCommon's `CreatePlaceholderScope` so it uses `SpanContextPropagator.Instance.Extract` rather than extracting trace context elements one by one. ## Test coverage Added a unit test for 128 bit trace IDs. Fixed existing unit tests so they pass a dictionary of headers to CreatePlaceholderScope. Removed a unit test that only passes SamplingPriority, since a distributed trace with only a sampling priority is hardly a distributed trace at all. ## Other details Backport to 2.x of #6041   --------- Co-authored-by: Lucas Pimentel <[email protected]> Co-authored-by: Andrew Lock <[email protected]> Co-authored-by: Daniel Romano <[email protected]> Co-authored-by: Steven Bouwkamp <[email protected]> Co-authored-by: Anna <[email protected]> Co-authored-by: NachoEchevarria <[email protected]> Co-authored-by: William Conti <[email protected]> Co-authored-by: Kevin Gosse <[email protected]> Co-authored-by: Tony Redondo <[email protected]> Co-authored-by: Gregory LEOCADIE <[email protected]>

agocs and others added 2 commits July 26, 2024 09:31

WIP

277648d

Merge branch 'main' into chris.agocs/create_trace_context_from_stepfu…

c50f642

…nction_input

agocs added 2 commits July 29, 2024 13:16

Fix trace ID generation

86ad6f0

Add test case for extracting a Step Function inpu

e7c61be

agocs added the team/serverless label Jul 30, 2024

Make it pass the linter

a6512f6

agocs added 4 commits July 31, 2024 12:31

Add the upper 64 bits of the trace id as a meta tag

2ca8a1d

Add test for adding the upper64 to the requestHandler meta tags

fead3db

Add test for EndExecutionSpanWithStepFunctions

39d9966

Add release note

3f2c24f

agocs commented Aug 1, 2024

View reviewed changes

pkg/serverless/trace/propagation/carriers.go Show resolved Hide resolved

agocs added 2 commits August 1, 2024 10:29

Check for empty Step Function context and return error if that is the…

168ec31

… case

Fix formatting of the release note file

a6da89b

agocs marked this pull request as ready for review August 1, 2024 17:50

agocs requested review from a team as code owners August 1, 2024 17:50

cswatt approved these changes Aug 1, 2024

View reviewed changes

nine5two7 reviewed Aug 1, 2024

View reviewed changes

agocs and others added 3 commits August 27, 2024 13:22

Add a stepfunction event and figure out how to figure out if it's a s…

f4dd984

…tepfunction

Merge branch 'main' into chris.agocs/create_trace_context_from_stepfu…

e1274b4

…nction_input

Account for the fact that a stepfunction event is wrapped in Payload:{}

78a7dbc

purple4reina reviewed Sep 3, 2024

View reviewed changes

pkg/serverless/invocationlifecycle/lifecycle.go Outdated Show resolved Hide resolved

purple4reina reviewed Sep 3, 2024

View reviewed changes

DylanLovesCoffee reviewed Sep 4, 2024

View reviewed changes

pkg/serverless/invocationlifecycle/lifecycle.go Outdated Show resolved Hide resolved

purple4reina reviewed Sep 4, 2024

View reviewed changes

pkg/serverless/trace/propagation/extractor.go Outdated Show resolved Hide resolved

agocs and others added 2 commits September 13, 2024 11:36

Check to see if Upper64Hex is set before we write it to the headers

f59fde7

Merge branch 'main' into chris.agocs/create_trace_context_from_stepfu…

c223273

…nction_input

purple4reina reviewed Sep 13, 2024

View reviewed changes

agocs mentioned this pull request Sep 13, 2024

Parse 128 bit trace Id retuned by lambda extension DataDog/dd-trace-java#7620

Merged

5 tasks

agocs added 2 commits September 18, 2024 14:11

Add upper64 bits to meta tags, and remove after execution span

dd49c2b

Merge branch 'chris.agocs/create_trace_context_from_stepfunction_inpu…

dc24fd6

…t' of https://github.com/datadog/datadog-agent into chris.agocs/create_trace_context_from_stepfunction_input

purple4reina reviewed Sep 19, 2024

View reviewed changes

agocs added 2 commits September 19, 2024 09:55

Change signature of initFromStepFunctionEvent

5cf617e

Delete upper64 bits from span meta tags only if traceContext.TraceIDU…

1c5f028

…pper64Hex is empty

purple4reina reviewed Sep 19, 2024

View reviewed changes

purple4reina approved these changes Sep 19, 2024

View reviewed changes

dd-mergequeue bot merged commit a4534bc into main Sep 19, 2024
232 checks passed

dd-mergequeue bot deleted the chris.agocs/create_trace_context_from_stepfunction_input branch September 19, 2024 20:03

github-actions bot added this to the 7.59.0 milestone Sep 19, 2024

agocs mentioned this pull request Sep 30, 2024

Extract Upper64 bit trace ID from extension response DataDog/dd-trace-dotnet#6041

Merged

grantseltzer pushed a commit that referenced this pull request Oct 2, 2024

[SVLS-5034] Create trace context from Step Function execution details (…

4aecb9a

…#27988)

grantseltzer pushed a commit that referenced this pull request Oct 4, 2024

[SVLS-5034] Create trace context from Step Function execution details (…

bd72756

…#27988)

agocs mentioned this pull request Oct 9, 2024

Backport #6041 to v2 DataDog/dd-trace-dotnet#6136

Closed

agocs mentioned this pull request Oct 22, 2024

Extract Upper64 bit trace ID from extension response (#6041 => v2) DataDog/dd-trace-dotnet#6181

Merged

[SVLS-5034] Create trace context from Step Function execution details #27988

[SVLS-5034] Create trace context from Step Function execution details #27988

Conversation

agocs commented Jul 26, 2024 • edited Loading

What does this PR do?

How does it work?

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Jul 26, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Bounds Checks

Explanation

pr-commenter bot commented Jul 30, 2024 • edited Loading

Test changes on VM

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purple4reina commented Sep 13, 2024

agocs commented Sep 16, 2024

agocs commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 19, 2024

Serverless Benchmark Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agocs commented Sep 19, 2024

dd-devflow bot commented Sep 19, 2024

dd-devflow bot commented Sep 19, 2024

agocs commented Jul 26, 2024 •

edited

Loading

pr-commenter bot commented Jul 26, 2024 •

edited

Loading

pr-commenter bot commented Jul 30, 2024 •

edited

Loading