[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891

crobertson-conga · 2022-05-10T15:46:47Z

Description:
Modifies spanmetrics processor to handle negative durations by clamping to 0 duration. Original ticket described issue as being caused when the time delay is huge, but the duration would have to be over 290years....which doesn't make sense. The bug is consistently replicated if the start time is after the end time. While ideally this would be fixed at the source, this adds protection in case the problem is missed upstream. See open-telemetry/opentelemetry-js-contrib#1013 for related upstream issue.

Link to tracking Issue: #7250

Testing: Added a new span where the start and end timestamps were swapped.

Documentation: None.

TylerHelmuth · 2022-05-10T17:08:38Z

What is the implication of clamping the data point to the 0 bucket, does it significantly skew data (is this issue happening a lot)? Would it be better to drop the data point?

crobertson-conga · 2022-05-10T17:31:02Z

It doesn't happen a lot, but when it does it crashes the collector.
In my experience with the browser one, it swaps it by a few milliseconds. I'm guessing it's a rounding error in the browsers that is causing it, but am unsure. There doesn't seem to be guards in place anywhere to prevent this form happening.
Capturing the call count is probably more important than dropping the datapoint entirely....a zero in place of negative duration seems sane there.

dehaansa · 2022-05-16T20:02:48Z

processor/spanmetricsprocessor/processor.go

@@ -381,7 +381,13 @@ func (p *processorImp) aggregateMetricsForServiceSpans(rspans ptrace.ResourceSpa
 }

 func (p *processorImp) aggregateMetricsForSpan(serviceName string, span ptrace.Span, resourceAttr pcommon.Map) {
-	latencyInMilliseconds := float64(span.EndTimestamp()-span.StartTimestamp()) / float64(time.Millisecond.Nanoseconds())
+	// Protect against end timestamps before start timestamps. Assume 0 duration.
+	latencyInMilliseconds := float64(0)


It seems like doing a math.Max with 0 and the calculated value would be more straightforward & easier to understand for future maintainers.

@dehaansa Part of the issue is that the timestamps are uint64, so the result of subtraction is always going to be a positive integer. math.Max will not solve the issue in that case.

Makes sense!

Part of the issue is that the timestamps are uint64, so the result of subtraction is always going to be a positive integer. math.Max will not solve the issue in that case.

This doesn't make sense.

This calculation:

latencyInMilliseconds := float64(span.EndTimestamp()-span.StartTimestamp()) / float64(time.Millisecond.Nanoseconds())

will result in a float64 that is >0 if the end time is after the start time and <0 if it is before the start time. So, in that case can't we follow up with math.Max() since we're now in float64 land?

latencyInMilliseconds = math.Max(0, latencyInMilliseconds)

We could rewrite this alternately:

max := func(a, b int) int { if a > b { return a } return b } intDelta := span.EndTimestamp()-span.SpartTimestamp() clampedDelta := max(0, intDelta) latencyInMs := float64(clampedDelta) / float64(time.Millisecond) // time.Millisecond is already an integer number of nanoseconds // same as clampedDelta

@Aneurysm9 Let me clarify,

If you have a negative result from subtraction, using uint64 would mean the sign bit is ignored, resulting in a large positive number instead of a negative number. When you attempt the clamp, the value is already going to be positive.

You can see this with: https://go.dev/play/p/mHQ3_zjFzlq

This is why we need to validate that end time is after start time instead.

@Aneurysm9 @dehaansa @bogdandrutu I would really like to get this in, its been over a month and a half since I provided this solution. The current implementation crashes the collector anytime problematic spans are provided.

Treating timestamps as uint64 is a mistake and can't be assumed to be honored by all languages. The spec anticipates that timestamps will be represented using language-preferred mechanisms, which in Go is time.Time, aka int64 nanoseconds from the UNIX epoch. The proto uses fixed64, which is presumably why pdata uses uint64, but even pdata assumes that values can be cast to int64 sanely.

If you use span.EndTimestamp().AsTime() you'll get a time.Time value which will work correctly.

I agree that treating the timestamps as a uint64 is a mistake generally, however, this is golang and the current underlying implementation for this data is uint64. The calculation is only in this specific scope geared towards incrementing a specific bucket and isn't bleeding out of this function.

Both methods will result in the same effect. Converting to timestamps and using Sub function has additional overhead associated with it that a subtraction and if statement doesn't.

I'm not against doing the switch, it's pretty simple, but is it worth the additional overhead for a throwaway calculation to be the correct data type?

pinging @Aneurysm9

If I'm understanding things correctly:

This PR resolves a user-facing correctness problem.

Whether or not pdata.Timestamp should be uint64 is a question of code quality / quality of life. It applies to a larger context than this PR.

If so, I suggest this PR should be merged now, and a separate proposal can be made to discuss the uint64 question.

github-actions · 2022-06-02T05:20:42Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

crobertson-conga · 2022-06-10T17:33:18Z

This is not stale

github-actions · 2022-06-25T05:15:48Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

crobertson-conga · 2022-06-29T19:23:17Z

As a note, it looks like accepting negative duration spans is semi-intentional. They warn about it, but don't discard. See https://github.com/open-telemetry/opentelemetry-js/blob/6eca6d4e4c3cf63a2b80ab0b95e4292f916d0437/packages/opentelemetry-sdk-trace-base/src/Span.ts#L183

TylerHelmuth · 2022-07-07T18:41:39Z

@crobertson-conga please add a new changelog entry following these new instructions

TylerHelmuth · 2022-07-07T18:42:25Z

pinging @albertteoh as code owner

…her project issues in the list.

crobertson-conga · 2022-07-18T13:04:22Z

@Aneurysm9 @TylerHelmuth @albertteoh

I would like to get this merged, please let me know if the change to using time is insisted on given the other considerations surrounding additional overhead.

albertteoh

LGTM

crobertson-conga · 2022-08-01T18:40:50Z

@Aneurysm9 @TylerHelmuth still looking to get this merged

bogdandrutu · 2022-08-02T21:22:14Z

processor/spanmetricsprocessor/processor.go

+	if endTime > startTime {
+		latencyInMilliseconds = float64(endTime-startTime) / float64(time.Millisecond.Nanoseconds())


Is the correct behavior to treat a malformed span (end time > start time) as a latency 0? I think we should actually skip these spans.

I'm conflicted here. I think the count of those spans is useful. Ideally this gets fixed upstream of it happening though.

crobertson-conga · 2022-08-04T17:18:08Z

🎉 🎉

Clamp negative duration times to 0 bucket.

26e7922

crobertson-conga requested review from a team and bogdandrutu May 10, 2022 15:46

github-actions bot assigned Aneurysm9 May 10, 2022

crobertson-conga changed the title ~~processor/spanmetrics Clamp negative duration times to 0 bucket.~~ [processor/spanmetrics] Clamp negative duration times to 0 bucket. May 10, 2022

dehaansa reviewed May 16, 2022

View reviewed changes

dehaansa approved these changes May 18, 2022

View reviewed changes

github-actions bot added the Stale label Jun 2, 2022

TylerHelmuth removed the Stale label Jun 10, 2022

github-actions bot added the Stale label Jun 25, 2022

github-actions bot removed the Stale label Jul 3, 2022

crobertson-conga added 3 commits July 8, 2022 10:51

Merge branch 'open-telemetry:main' into spanmetrics_7250

632bee5

Added changelog entry.

5391d32

Fixed tracking issues since it doesn't seem like you can reference ot…

fc70cae

…her project issues in the list.

albertteoh approved these changes Jul 19, 2022

View reviewed changes

bogdandrutu reviewed Aug 2, 2022

View reviewed changes

bogdandrutu approved these changes Aug 3, 2022

View reviewed changes

bogdandrutu merged commit facab6e into open-telemetry:main Aug 3, 2022

crobertson-conga mentioned this pull request Aug 4, 2022

[processor/spanmetrics] Possibility to acces index out of range and crash #7250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891

[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891

crobertson-conga commented May 10, 2022

TylerHelmuth commented May 10, 2022

crobertson-conga commented May 10, 2022 •

edited

Loading

dehaansa May 16, 2022

crobertson-conga May 17, 2022

dehaansa May 18, 2022

Aneurysm9 Jun 6, 2022

crobertson-conga Jun 8, 2022

crobertson-conga Jun 29, 2022

Aneurysm9 Jul 7, 2022

crobertson-conga Jul 8, 2022

TylerHelmuth Jul 18, 2022

djaglowski Aug 2, 2022

github-actions bot commented Jun 2, 2022

crobertson-conga commented Jun 10, 2022

github-actions bot commented Jun 25, 2022

crobertson-conga commented Jun 29, 2022

TylerHelmuth commented Jul 7, 2022

TylerHelmuth commented Jul 7, 2022

crobertson-conga commented Jul 18, 2022

albertteoh left a comment

crobertson-conga commented Aug 1, 2022

bogdandrutu Aug 2, 2022

crobertson-conga Aug 3, 2022

crobertson-conga commented Aug 4, 2022

		if endTime > startTime {
		latencyInMilliseconds = float64(endTime-startTime) / float64(time.Millisecond.Nanoseconds())

[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891

[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891

Conversation

crobertson-conga commented May 10, 2022

TylerHelmuth commented May 10, 2022

crobertson-conga commented May 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 2, 2022

crobertson-conga commented Jun 10, 2022

github-actions bot commented Jun 25, 2022

crobertson-conga commented Jun 29, 2022

TylerHelmuth commented Jul 7, 2022

TylerHelmuth commented Jul 7, 2022

crobertson-conga commented Jul 18, 2022

albertteoh left a comment

Choose a reason for hiding this comment

crobertson-conga commented Aug 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crobertson-conga commented Aug 4, 2022

crobertson-conga commented May 10, 2022 •

edited

Loading