Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid timings in span events #4486

Merged

Conversation

Abinet18
Copy link
Contributor

@Abinet18 Abinet18 commented Feb 15, 2024

Which problem is this PR solving?

Fix invalid timings in span events

Short description of the changes

add reference time to check if the value to be reported for the event is sane (use max of time and reference time)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Checklist:

  • Followed the style guidelines of this project
  • Unit tests have been added
  • Documentation has been updated

@Abinet18 Abinet18 requested a review from a team February 15, 2024 21:00
Copy link

codecov bot commented Feb 15, 2024

Codecov Report

Merging #4486 (98c1496) into main (5231aa2) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4486      +/-   ##
==========================================
+ Coverage   92.80%   92.82%   +0.01%     
==========================================
  Files         328      328              
  Lines        9512     9519       +7     
  Branches     2047     2050       +3     
==========================================
+ Hits         8828     8836       +8     
+ Misses        684      683       -1     
Files Coverage Δ
packages/opentelemetry-sdk-trace-web/src/utils.ts 93.92% <100.00%> (+0.24%) ⬆️

... and 1 file with indirect coverage changes

@Abinet18
Copy link
Contributor Author

@MSNev , @martinkuba , @dyladan , can you review this ?

span.addEvent(performanceName, entries[performanceName]);
let perfTime = entries[performanceName];
const refName = refPerfName || PTN.FETCH_START;
// Use a reference time whcih is the earliest possible value so that the performance timing are earlier can be corrected to this reference time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might come down to opinions but I think just dropping (ignoring) timings that are earlier than expected value is better and matches the intention of 0 values (= value not available), while this converts it to a value that makes sense but not real

eg. considering 2 easily known cases of 0-value:

  • secureConnectionStart is 0 when loading an insecure (http) resource because you don't have a secure connection there
  • most of the timings when a cross-origin request is done and there's no Timing-Allow-Origin header

This would also make it easier to avoid buggy processing of data - eg. if someone processes incoming data to get the connecting time per url by connectEnd - connectStart, they would always get 0ms to connect (which is a valid value when re-using existing connection!) for non-TAO header cross-origin requests (as this would fix them to equal fetchStart), while if it's missing there'd be an easy way to know that you shouldn't calculate a value from this span

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can discuss on which option to take, either avoiding reporting the invalid timing, or correcting it as done here. In the current condition, all timings are getting reported whether valid or invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also lean toward conditionally adding the timing only if it's valid, and if it's not valid, don't add it. This helps prevent skewing of results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t2t2 , @JamieDanielson I have made the changes to conditionally add the timings

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@JamieDanielson JamieDanielson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@pichlermarc pichlermarc merged commit e01f493 into open-telemetry:main Apr 3, 2024
20 checks passed
trentm added a commit to trentm/opentelemetry-js-contrib that referenced this pull request Apr 24, 2024
The change in open-telemetry/opentelemetry-js#4486
means that a addSpanNetworkEvent() in  v1.24.0 and
later might get dropped -- if its time value is before the fetchStart time.
Typically this happens if the event time value is 0.
trentm added a commit to open-telemetry/opentelemetry-js-contrib that referenced this pull request Apr 25, 2024
…nts (#2145)

The change in open-telemetry/opentelemetry-js#4486
means that a addSpanNetworkEvent() in  v1.24.0 and
later might get dropped -- if its time value is before the fetchStart time.
Typically this happens if the event time value is 0.
Zirak pushed a commit to Zirak/opentelemetry-js that referenced this pull request Sep 14, 2024
* fix: use reference value to avoid invalid timings in span

* add comment, add to changelog

* donot report invalid timing in span

* lint fix

* fix failing test

* fix failing test 2

* Add tests

* add removed test

* suggested changes

---------

Co-authored-by: Marc Pichler <[email protected]>
chancancode added a commit to tildeio/opentelemetry-js that referenced this pull request Jan 11, 2025
Background:

1. For historical reasons, the perf/resource timing spec uses 0 as a
   special value to denote when some timing information is either
   not applicable (e.g. no redirect occured) or not available (when
   producing an opaque resource timing object for CORS resources).

2. However, in some limited cases, 0 can also be a legitimate value
   for these timing events. Note that this is rare in real life –
   these fields are high-res performance timestamp relative to the
   performance time-origin, which is typically the navigation event
   for the initial page load.

   For a _resource_ timing to have a 0 timestamp, it would have to
   be initiated simultaneously with page load, it is unclear if this
   can actually happen IRL. Reportedly (open-telemetry#1769), at one point this
   was observed in some browsers during automated tests where things
   happen very fast and the browser artifically supress the timing
   resolution. It was unclear if the report was about the navigation
   timing entry or resource timing entries.

   It is also unclear if these utilities are intended for anything
   other than the internal fetch/XHR instrumentation, but they are
   public API, so if someone uses these functions on the initial
   page navigation event, then it is possible for the function to
   see legitimate 0-value inputs.

3. When creating span events, we do not use the timeOrigin-relative
   perf timestamps as-is. Rather, we convert them back to the UNIX
   epoch timestamps. When this conversion is applied to those 0
   timestamps, it creates nonsensical/misleading events that are
   quite difficult to distinguish for downstream consumers.

It would be nice if the W3C specs would have left the N/A values as
`undefined`, but that's not the world we live in and so we have to
work with what we've got.

History:

1. Initially, the code ignored 0-value timestamps.
2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed
   the check.
3. This causes the other category of bugs where we created the
   nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478)
4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way
   of tapping into auxiliary metadata
5. That appoach cannot be generalized for the other cases, so
   open-telemetry#4486 added some rather convoluted checks.
6. As part of refactoring the tests to use service workers, a new
   bug open-telemetry#5314 was identified.

Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to
just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends
up breaking some of those use cases anyway.

Options:

1. Keep the patch from open-telemetry#4486 largely as-is but change the reference
   time from `fetchStart` to `startTime`.
2. Revert to the very original intent of the code and drop everything
   with 0-values.
3. This commit attempts a compromise position, when `startTime === 0`
   (which probably means we are called with the initial navigation
   event), 0-values are plausibly valid so we include them by default,
   but in any other cases (`startTime > 0`), legitimate 0-values
   should not be possible so we drop them by default.

Part of the issue here is that it's unclear how these utilities are
being used since they are public APIs. In core, these functions are
only called by the fetch/xhr instrumentation with resource timing
entries, where legitimate 0-value seems extremely unlikely.

In my opinion, the rarity of legitimate 0-value timing values in the
real world (recall that it doesn't just mean something happened very
fast, but something happened very fast _at the same instant that the
page loaded_) should yield to the conflicting interst (to avoid
non-sensical events that are difficult to process downstream) here,
especially when it appears that the only observed cases are from
automated testsing.

Personally I would be in favor of the stronger position (2), but
attempted to strike a balance here to keep things useful the other
cases.

In open-telemetry#2457, @johnbley argued:

> I think that for this piece of tech, our telemetry should report
> whatever the browser says with minimal processing, and let
> downstream/backend/easier-to-modify components deal with
> normalization/truncation/processing if desired (e.g., throwing
> away absurd times like "that page took 6 years or -5 hours to
> load", or declaring that "no redirects happened so the redirect
> timing is not 0 but non-existent"). Does that not work in your
> world for some reason?

I could see this perspective if the "minimal processing" means that
we sends the 0-values to the backends as 0s. It's still unfortunate
but at least it would be easy to write a query/filter to process
them away.

However, given we are actively normalizes the 0-values into the
absolute epoch timestamp of `performance.timeOrigin`, I think this
becomes actively hostile to backend/downstream processing, as it
turns a relatively clear signal (value=0) into something that would
require stateful machinary to reverse engineer (infer the timeOrigin
from a trace), or imperfect huristics (ignore things that seems to
take way too long).

It also takes up bytesize/bandwidth to transmit in the vast majority
of cases where it would be actively unhelpful.

Instead, I'd argue it creates way less harm to drop the 0-values.
When a 0-value is expected but not sent, it may skew the distrubtion
on some aggregrate metrics slightly (e.g. making "average time for
SSL handshake" appears large), but probably nothing would break. If
we are expecting backends to do work to normalize data anyway, then
IMO it would also be way easier for them to notice the missing items
and insert it back than the other way around.

Fixes open-telemetry#3199
Fixes open-telemetry#5314

See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486
chancancode added a commit to tildeio/opentelemetry-js that referenced this pull request Jan 11, 2025
Background:

1. For historical reasons, the perf/resource timing spec uses 0 as a
   special value to denote when some timing information is either
   not applicable (e.g. no redirect occured) or not available (when
   producing an opaque resource timing object for CORS resources).

2. However, in some limited cases, 0 can also be a legitimate value
   for these timing events. Note that this is rare in real life –
   these fields are high-res performance timestamp relative to the
   performance time-origin, which is typically the navigation event
   for the initial page load.

   For a _resource_ timing to have a 0 timestamp, it would have to
   be initiated simultaneously with page load, it is unclear if this
   can actually happen IRL. Reportedly (open-telemetry#1769), at one point this
   was observed in some browsers during automated tests where things
   happen very fast and the browser artifically supress the timing
   resolution. It was unclear if the report was about the navigation
   timing entry or resource timing entries.

   It is also unclear if these utilities are intended for anything
   other than the internal fetch/XHR instrumentation, but they are
   public API, so if someone uses these functions on the initial
   page navigation event, then it is possible for the function to
   see legitimate 0-value inputs.

3. When creating span events, we do not use the timeOrigin-relative
   perf timestamps as-is. Rather, we convert them back to the UNIX
   epoch timestamps. When this conversion is applied to those 0
   timestamps, it creates nonsensical/misleading events that are
   quite difficult to distinguish for downstream consumers.

It would be nice if the W3C specs would have left the N/A values as
`undefined`, but that's not the world we live in and so we have to
work with what we've got.

History:

1. Initially, the code ignored 0-value timestamps.
2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed
   the check.
3. This causes the other category of bugs where we created the
   nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478)
4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way
   of tapping into auxiliary metadata
5. That appoach cannot be generalized for the other cases, so
   open-telemetry#4486 added some rather convoluted checks.
6. As part of refactoring the tests to use service workers, a new
   bug open-telemetry#5314 was identified.

Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to
just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends
up breaking some of those use cases anyway.

Options:

1. Keep the patch from open-telemetry#4486 largely as-is but change the reference
   time from `fetchStart` to `startTime`.
2. Revert to the very original intent of the code and drop everything
   with 0-values.
3. This commit attempts a compromise position, when `startTime === 0`
   (which probably means we are called with the initial navigation
   event), 0-values are plausibly valid so we include them by default,
   but in any other cases (`startTime > 0`), legitimate 0-values
   should not be possible so we drop them by default.

Part of the issue here is that it's unclear how these utilities are
being used since they are public APIs. In core, these functions are
only called by the fetch/xhr instrumentation with resource timing
entries, where legitimate 0-value seems extremely unlikely.

In my opinion, the rarity of legitimate 0-value timing values in the
real world (recall that it doesn't just mean something happened very
fast, but something happened very fast _at the same instant that the
page loaded_) should yield to the conflicting interst (to avoid
non-sensical events that are difficult to process downstream) here,
especially when it appears that the only observed cases are from
automated testsing.

Personally I would be in favor of the stronger position (2), but
attempted to strike a balance here to keep things useful the other
cases.

In open-telemetry#2457, @johnbley argued:

> I think that for this piece of tech, our telemetry should report
> whatever the browser says with minimal processing, and let
> downstream/backend/easier-to-modify components deal with
> normalization/truncation/processing if desired (e.g., throwing
> away absurd times like "that page took 6 years or -5 hours to
> load", or declaring that "no redirects happened so the redirect
> timing is not 0 but non-existent"). Does that not work in your
> world for some reason?

I could see this perspective if the "minimal processing" means that
we sends the 0-values to the backends as 0s. It's still unfortunate
but at least it would be easy to write a query/filter to process
them away.

However, given we are actively normalizes the 0-values into the
absolute epoch timestamp of `performance.timeOrigin`, I think this
becomes actively hostile to backend/downstream processing, as it
turns a relatively clear signal (value=0) into something that would
require stateful machinary to reverse engineer (infer the timeOrigin
from a trace), or imperfect huristics (ignore things that seems to
take way too long).

It also takes up bytesize/bandwidth to transmit in the vast majority
of cases where it would be actively unhelpful.

Instead, I'd argue it creates way less harm to drop the 0-values.
When a 0-value is expected but not sent, it may skew the distrubtion
on some aggregrate metrics slightly (e.g. making "average time for
SSL handshake" appears large), but probably nothing would break. If
we are expecting backends to do work to normalize data anyway, then
IMO it would also be way easier for them to notice the missing items
and insert it back than the other way around.

Fixes open-telemetry#3199
Fixes open-telemetry#5314

See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486
chancancode added a commit to tildeio/opentelemetry-js that referenced this pull request Jan 14, 2025
Background:

1. For historical reasons, the perf/resource timing spec uses 0 as a
   special value to denote when some timing information is either
   not applicable (e.g. no redirect occured) or not available (when
   producing an opaque resource timing object for CORS resources).

2. However, in some limited cases, 0 can also be a legitimate value
   for these timing events. Note that this is rare in real life –
   these fields are high-res performance timestamp relative to the
   performance time-origin, which is typically the navigation event
   for the initial page load.

   For a _resource_ timing to have a 0 timestamp, it would have to
   be initiated simultaneously with page load, it is unclear if this
   can actually happen IRL. Reportedly (open-telemetry#1769), at one point this
   was observed in some browsers during automated tests where things
   happen very fast and the browser artifically supress the timing
   resolution. It was unclear if the report was about the navigation
   timing entry or resource timing entries.

   It is also unclear if these utilities are intended for anything
   other than the internal fetch/XHR instrumentation, but they are
   public API, so if someone uses these functions on the initial
   page navigation event, then it is possible for the function to
   see legitimate 0-value inputs.

3. When creating span events, we do not use the timeOrigin-relative
   perf timestamps as-is. Rather, we convert them back to the UNIX
   epoch timestamps. When this conversion is applied to those 0
   timestamps, it creates nonsensical/misleading events that are
   quite difficult to distinguish for downstream consumers.

It would be nice if the W3C specs would have left the N/A values as
`undefined`, but that's not the world we live in and so we have to
work with what we've got.

History:

1. Initially, the code ignored 0-value timestamps.
2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed
   the check.
3. This causes the other category of bugs where we created the
   nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478)
4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way
   of tapping into auxiliary metadata
5. That appoach cannot be generalized for the other cases, so
   open-telemetry#4486 added some rather convoluted checks.
6. As part of refactoring the tests to use service workers, a new
   bug open-telemetry#5314 was identified.

Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to
just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends
up breaking some of those use cases anyway.

Options:

1. Keep the patch from open-telemetry#4486 largely as-is but change the reference
   time from `fetchStart` to `startTime`.
2. Revert to the very original intent of the code and drop everything
   with 0-values.
3. This commit attempts a compromise position, when `startTime === 0`
   (which probably means we are called with the initial navigation
   event), 0-values are plausibly valid so we include them by default,
   but in any other cases (`startTime > 0`), legitimate 0-values
   should not be possible so we drop them by default.

Part of the issue here is that it's unclear how these utilities are
being used since they are public APIs. In core, these functions are
only called by the fetch/xhr instrumentation with resource timing
entries, where legitimate 0-value seems extremely unlikely.

In my opinion, the rarity of legitimate 0-value timing values in the
real world (recall that it doesn't just mean something happened very
fast, but something happened very fast _at the same instant that the
page loaded_) should yield to the conflicting interst (to avoid
non-sensical events that are difficult to process downstream) here,
especially when it appears that the only observed cases are from
automated testsing.

Personally I would be in favor of the stronger position (2), but
attempted to strike a balance here to keep things useful the other
cases.

In open-telemetry#2457, @johnbley argued:

> I think that for this piece of tech, our telemetry should report
> whatever the browser says with minimal processing, and let
> downstream/backend/easier-to-modify components deal with
> normalization/truncation/processing if desired (e.g., throwing
> away absurd times like "that page took 6 years or -5 hours to
> load", or declaring that "no redirects happened so the redirect
> timing is not 0 but non-existent"). Does that not work in your
> world for some reason?

I could see this perspective if the "minimal processing" means that
we sends the 0-values to the backends as 0s. It's still unfortunate
but at least it would be easy to write a query/filter to process
them away.

However, given we are actively normalizes the 0-values into the
absolute epoch timestamp of `performance.timeOrigin`, I think this
becomes actively hostile to backend/downstream processing, as it
turns a relatively clear signal (value=0) into something that would
require stateful machinary to reverse engineer (infer the timeOrigin
from a trace), or imperfect huristics (ignore things that seems to
take way too long).

It also takes up bytesize/bandwidth to transmit in the vast majority
of cases where it would be actively unhelpful.

Instead, I'd argue it creates way less harm to drop the 0-values.
When a 0-value is expected but not sent, it may skew the distrubtion
on some aggregrate metrics slightly (e.g. making "average time for
SSL handshake" appears large), but probably nothing would break. If
we are expecting backends to do work to normalize data anyway, then
IMO it would also be way easier for them to notice the missing items
and insert it back than the other way around.

Fixes open-telemetry#3199
Fixes open-telemetry#5314

See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid timings sent by instrumentations (xhr,fetch,resource)
5 participants