-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891
[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891
Conversation
What is the implication of clamping the data point to the 0 bucket, does it significantly skew data (is this issue happening a lot)? Would it be better to drop the data point? |
It doesn't happen a lot, but when it does it crashes the collector. |
@@ -381,7 +381,13 @@ func (p *processorImp) aggregateMetricsForServiceSpans(rspans ptrace.ResourceSpa | |||
} | |||
|
|||
func (p *processorImp) aggregateMetricsForSpan(serviceName string, span ptrace.Span, resourceAttr pcommon.Map) { | |||
latencyInMilliseconds := float64(span.EndTimestamp()-span.StartTimestamp()) / float64(time.Millisecond.Nanoseconds()) | |||
// Protect against end timestamps before start timestamps. Assume 0 duration. | |||
latencyInMilliseconds := float64(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like doing a math.Max with 0 and the calculated value would be more straightforward & easier to understand for future maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dehaansa Part of the issue is that the timestamps are uint64, so the result of subtraction is always going to be a positive integer. math.Max will not solve the issue in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the issue is that the timestamps are uint64, so the result of subtraction is always going to be a positive integer. math.Max will not solve the issue in that case.
This doesn't make sense.
This calculation:
latencyInMilliseconds := float64(span.EndTimestamp()-span.StartTimestamp()) / float64(time.Millisecond.Nanoseconds())
will result in a float64
that is >0
if the end time is after the start time and <0
if it is before the start time. So, in that case can't we follow up with math.Max()
since we're now in float64
land?
latencyInMilliseconds = math.Max(0, latencyInMilliseconds)
We could rewrite this alternately:
max := func(a, b int) int {
if a > b {
return a
}
return b
}
intDelta := span.EndTimestamp()-span.SpartTimestamp()
clampedDelta := max(0, intDelta)
latencyInMs := float64(clampedDelta) / float64(time.Millisecond)
// time.Millisecond is already an integer number of nanoseconds
// same as clampedDelta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aneurysm9 Let me clarify,
If you have a negative result from subtraction, using uint64 would mean the sign bit is ignored, resulting in a large positive number instead of a negative number. When you attempt the clamp, the value is already going to be positive.
You can see this with: https://go.dev/play/p/mHQ3_zjFzlq
This is why we need to validate that end time is after start time instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aneurysm9 @dehaansa @bogdandrutu I would really like to get this in, its been over a month and a half since I provided this solution. The current implementation crashes the collector anytime problematic spans are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treating timestamps as uint64
is a mistake and can't be assumed to be honored by all languages. The spec anticipates that timestamps will be represented using language-preferred mechanisms, which in Go is time.Time
, aka int64
nanoseconds from the UNIX epoch. The proto uses fixed64
, which is presumably why pdata
uses uint64
, but even pdata
assumes that values can be cast to int64
sanely.
If you use span.EndTimestamp().AsTime()
you'll get a time.Time
value which will work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that treating the timestamps as a uint64
is a mistake generally, however, this is golang and the current underlying implementation for this data is uint64
. The calculation is only in this specific scope geared towards incrementing a specific bucket and isn't bleeding out of this function.
Both methods will result in the same effect. Converting to timestamps and using Sub function has additional overhead associated with it that a subtraction and if statement doesn't.
I'm not against doing the switch, it's pretty simple, but is it worth the additional overhead for a throwaway calculation to be the correct data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pinging @Aneurysm9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding things correctly:
- This PR resolves a user-facing correctness problem.
- Whether or not
pdata.Timestamp
should beuint64
is a question of code quality / quality of life. It applies to a larger context than this PR.
If so, I suggest this PR should be merged now, and a separate proposal can be made to discuss the uint64
question.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This is not stale |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
As a note, it looks like accepting negative duration spans is semi-intentional. They warn about it, but don't discard. See https://github.com/open-telemetry/opentelemetry-js/blob/6eca6d4e4c3cf63a2b80ab0b95e4292f916d0437/packages/opentelemetry-sdk-trace-base/src/Span.ts#L183 |
@crobertson-conga please add a new changelog entry following these new instructions |
pinging @albertteoh as code owner |
@Aneurysm9 @TylerHelmuth @albertteoh I would like to get this merged, please let me know if the change to using time is insisted on given the other considerations surrounding additional overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Aneurysm9 @TylerHelmuth still looking to get this merged |
if endTime > startTime { | ||
latencyInMilliseconds = float64(endTime-startTime) / float64(time.Millisecond.Nanoseconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the correct behavior to treat a malformed span (end time > start time) as a latency 0? I think we should actually skip these spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm conflicted here. I think the count of those spans is useful. Ideally this gets fixed upstream of it happening though.
🎉 🎉 |
Description:
Modifies spanmetrics processor to handle negative durations by clamping to 0 duration. Original ticket described issue as being caused when the time delay is huge, but the duration would have to be over 290years....which doesn't make sense. The bug is consistently replicated if the start time is after the end time. While ideally this would be fixed at the source, this adds protection in case the problem is missed upstream. See open-telemetry/opentelemetry-js-contrib#1013 for related upstream issue.
Link to tracking Issue: #7250
Testing: Added a new span where the start and end timestamps were swapped.
Documentation: None.