Fix invalid timings in span events #4486

Abinet18 · 2024-02-15T21:00:46Z

Which problem is this PR solving?

Fix invalid timings in span events

Fixes Invalid timings sent by instrumentations (xhr,fetch,resource) #4478

Short description of the changes

add reference time to check if the value to be reported for the event is sane (use max of time and reference time)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Checklist:

Followed the style guidelines of this project
Unit tests have been added
Documentation has been updated

codecov · 2024-02-15T21:04:03Z

Codecov Report

Merging #4486 (98c1496) into main (5231aa2) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4486      +/-   ##
==========================================
+ Coverage   92.80%   92.82%   +0.01%     
==========================================
  Files         328      328              
  Lines        9512     9519       +7     
  Branches     2047     2050       +3     
==========================================
+ Hits         8828     8836       +8     
+ Misses        684      683       -1

Files	Coverage Δ
packages/opentelemetry-sdk-trace-web/src/utils.ts	`93.92% <100.00%> (+0.24%)`	⬆️

... and 1 file with indirect coverage changes

Abinet18 · 2024-02-16T17:12:58Z

@MSNev , @martinkuba , @dyladan , can you review this ?

t2t2 · 2024-03-04T21:40:13Z

packages/opentelemetry-sdk-trace-web/src/utils.ts

-    span.addEvent(performanceName, entries[performanceName]);
+    let perfTime = entries[performanceName];
+    const refName = refPerfName || PTN.FETCH_START;
+    // Use a reference time whcih is the earliest possible value so that the performance timing are earlier can be corrected to this reference time


This might come down to opinions but I think just dropping (ignoring) timings that are earlier than expected value is better and matches the intention of 0 values (= value not available), while this converts it to a value that makes sense but not real

eg. considering 2 easily known cases of 0-value:

secureConnectionStart is 0 when loading an insecure (http) resource because you don't have a secure connection there

most of the timings when a cross-origin request is done and there's no Timing-Allow-Origin header

This would also make it easier to avoid buggy processing of data - eg. if someone processes incoming data to get the connecting time per url by connectEnd - connectStart, they would always get 0ms to connect (which is a valid value when re-using existing connection!) for non-TAO header cross-origin requests (as this would fix them to equal fetchStart), while if it's missing there'd be an easy way to know that you shouldn't calculate a value from this span

Yes, we can discuss on which option to take, either avoiding reporting the invalid timing, or correcting it as done here. In the current condition, all timings are getting reported whether valid or invalid.

I would also lean toward conditionally adding the timing only if it's valid, and if it's not valid, don't add it. This helps prevent skewing of results.

@t2t2 , @JamieDanielson I have made the changes to conditionally add the timings

CHANGELOG.md

packages/opentelemetry-sdk-trace-web/src/utils.ts

JamieDanielson

Thanks for fixing this!

The change in open-telemetry/opentelemetry-js#4486 means that a addSpanNetworkEvent() in v1.24.0 and later might get dropped -- if its time value is before the fetchStart time. Typically this happens if the event time value is 0.

…nts (#2145) The change in open-telemetry/opentelemetry-js#4486 means that a addSpanNetworkEvent() in v1.24.0 and later might get dropped -- if its time value is before the fetchStart time. Typically this happens if the event time value is 0.

* fix: use reference value to avoid invalid timings in span * add comment, add to changelog * donot report invalid timing in span * lint fix * fix failing test * fix failing test 2 * Add tests * add removed test * suggested changes --------- Co-authored-by: Marc Pichler <[email protected]>

@johnbley

Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486

@johnbley

Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486

@johnbley

Background: 1. For historical reasons, the perf/resource timing spec uses 0 as a special value to denote when some timing information is either not applicable (e.g. no redirect occured) or not available (when producing an opaque resource timing object for CORS resources). 2. However, in some limited cases, 0 can also be a legitimate value for these timing events. Note that this is rare in real life – these fields are high-res performance timestamp relative to the performance time-origin, which is typically the navigation event for the initial page load. For a _resource_ timing to have a 0 timestamp, it would have to be initiated simultaneously with page load, it is unclear if this can actually happen IRL. Reportedly (open-telemetry#1769), at one point this was observed in some browsers during automated tests where things happen very fast and the browser artifically supress the timing resolution. It was unclear if the report was about the navigation timing entry or resource timing entries. It is also unclear if these utilities are intended for anything other than the internal fetch/XHR instrumentation, but they are public API, so if someone uses these functions on the initial page navigation event, then it is possible for the function to see legitimate 0-value inputs. 3. When creating span events, we do not use the timeOrigin-relative perf timestamps as-is. Rather, we convert them back to the UNIX epoch timestamps. When this conversion is applied to those 0 timestamps, it creates nonsensical/misleading events that are quite difficult to distinguish for downstream consumers. It would be nice if the W3C specs would have left the N/A values as `undefined`, but that's not the world we live in and so we have to work with what we've got. History: 1. Initially, the code ignored 0-value timestamps. 2. open-telemetry#1769 identified cases of valid 0-value timestamps and removed the check. 3. This causes the other category of bugs where we created the nonsensical events (open-telemetry#2457, open-telemetry#3848, open-telemetry#4478) 4. open-telemetry#3897 added a special-case for `secureConnectionStart` by way of tapping into auxiliary metadata 5. That appoach cannot be generalized for the other cases, so open-telemetry#4486 added some rather convoluted checks. 6. As part of refactoring the tests to use service workers, a new bug open-telemetry#5314 was identified. Presumably, the patch in open-telemetry#4486 was writen that way (as opposed to just checking for 0) to avoid breaking open-telemetry#1769, but I suspect it ends up breaking some of those use cases anyway. Options: 1. Keep the patch from open-telemetry#4486 largely as-is but change the reference time from `fetchStart` to `startTime`. 2. Revert to the very original intent of the code and drop everything with 0-values. 3. This commit attempts a compromise position, when `startTime === 0` (which probably means we are called with the initial navigation event), 0-values are plausibly valid so we include them by default, but in any other cases (`startTime > 0`), legitimate 0-values should not be possible so we drop them by default. Part of the issue here is that it's unclear how these utilities are being used since they are public APIs. In core, these functions are only called by the fetch/xhr instrumentation with resource timing entries, where legitimate 0-value seems extremely unlikely. In my opinion, the rarity of legitimate 0-value timing values in the real world (recall that it doesn't just mean something happened very fast, but something happened very fast _at the same instant that the page loaded_) should yield to the conflicting interst (to avoid non-sensical events that are difficult to process downstream) here, especially when it appears that the only observed cases are from automated testsing. Personally I would be in favor of the stronger position (2), but attempted to strike a balance here to keep things useful the other cases. In open-telemetry#2457, @johnbley argued: > I think that for this piece of tech, our telemetry should report > whatever the browser says with minimal processing, and let > downstream/backend/easier-to-modify components deal with > normalization/truncation/processing if desired (e.g., throwing > away absurd times like "that page took 6 years or -5 hours to > load", or declaring that "no redirects happened so the redirect > timing is not 0 but non-existent"). Does that not work in your > world for some reason? I could see this perspective if the "minimal processing" means that we sends the 0-values to the backends as 0s. It's still unfortunate but at least it would be easy to write a query/filter to process them away. However, given we are actively normalizes the 0-values into the absolute epoch timestamp of `performance.timeOrigin`, I think this becomes actively hostile to backend/downstream processing, as it turns a relatively clear signal (value=0) into something that would require stateful machinary to reverse engineer (infer the timeOrigin from a trace), or imperfect huristics (ignore things that seems to take way too long). It also takes up bytesize/bandwidth to transmit in the vast majority of cases where it would be actively unhelpful. Instead, I'd argue it creates way less harm to drop the 0-values. When a 0-value is expected but not sent, it may skew the distrubtion on some aggregrate metrics slightly (e.g. making "average time for SSL handshake" appears large), but probably nothing would break. If we are expecting backends to do work to normalize data anyway, then IMO it would also be way easier for them to notice the missing items and insert it back than the other way around. Fixes open-telemetry#3199 Fixes open-telemetry#5314 See also open-telemetry#1769, open-telemetry#2457, open-telemetry#3848, open-telemetry#3879, open-telemetry#4478, open-telemetry#4486

fix: use reference value to avoid invalid timings in span

0efb5da

Abinet18 requested a review from a team February 15, 2024 21:00

Abinet18 and others added 2 commits February 22, 2024 12:39

add comment, add to changelog

03e84e3

Merge branch 'main' into fix-invalid-timings-in-spans

0e90f68

t2t2 reviewed Mar 4, 2024

View reviewed changes

Abinet18 and others added 8 commits March 6, 2024 11:04

donot report invalid timing in span

18fa4c0

lint fix

f7880ce

Merge branch 'main' into fix-invalid-timings-in-spans

4e87873

fix failing test

6dfef84

fix failing test 2

24c7f6c

Add tests

c773b8f

Merge branch 'main' into fix-invalid-timings-in-spans

b9bf4ea

add removed test

441a301

svetlanabrennan reviewed Mar 20, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pichlermarc reviewed Mar 20, 2024

View reviewed changes

packages/opentelemetry-sdk-trace-web/src/utils.ts Show resolved Hide resolved

Abinet18 and others added 4 commits March 20, 2024 09:02

suggested changes

4f98c14

Merge branch 'main' into fix-invalid-timings-in-spans

a3ccf1e

Merge branch 'main' into fix-invalid-timings-in-spans

6ac28d4

Merge branch 'main' into fix-invalid-timings-in-spans

cec2c3b

pichlermarc approved these changes Apr 2, 2024

View reviewed changes

JamieDanielson approved these changes Apr 2, 2024

View reviewed changes

Merge branch 'main' into fix-invalid-timings-in-spans

98c1496

pichlermarc merged commit e01f493 into open-telemetry:main Apr 3, 2024
20 checks passed

pichlermarc mentioned this pull request Apr 19, 2024

chore: prepare release 1.24.0/0.51.0 #4650

Merged

blumamir mentioned this pull request Apr 24, 2024

chore: experimental 0.51.0, remove instrumentations generic type to align with upstream open-telemetry/opentelemetry-js-contrib#2091

Merged

trentm mentioned this pull request Apr 24, 2024

test(instr-document-load): fix test to allow missing network span events open-telemetry/opentelemetry-js-contrib#2145

Merged

moishimwe mentioned this pull request Jun 30, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.17.1 to 1.25.0 moishimwe/realworld-microservice-project#4

Open

marcelineepi mentioned this pull request Jun 30, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.17.1 to 1.25.0 marcelineepi/realworld-microservice-project#4

Open

keemraheem mentioned this pull request Jun 30, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.12.0 to 1.25.0 keemraheem/realworld-microservice-project#5

Open

fakure2020 mentioned this pull request Jul 3, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.17.1 to 1.25.0 fakure2020/realworld-microservice-project01#4

Open

This was referenced Jul 17, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.22.0 to 1.25.1 DevMockSec/microservices-demo#4

Open

[Snyk] Upgrade @opentelemetry/semantic-conventions from 1.22.0 to 1.25.1 DevMockSec/microservices-demo#5

Open

rathorevs mentioned this pull request Jul 18, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.17.1 to 1.25.1 rathorevs/10-Tier-MicroService-Appliction#5

Open

harshakrithvikk mentioned this pull request Jul 24, 2024

[Snyk] Upgrade @opentelemetry/sdk-trace-base from 1.17.1 to 1.25.1 harykryshnan-Master/10-MicroService-Appliction#4

Open

krischarbonneau mentioned this pull request Jul 30, 2024

[Snyk] Upgrade @opentelemetry/sdk-metrics from 1.23.0 to 1.25.1 DTS-STN/senior-journey#677

Closed

chancancode mentioned this pull request Jan 10, 2025

[inst-xhr/fetch] Network events incorrectly dropped #5314

Open

chancancode mentioned this pull request Jan 11, 2025

fix(inst-fetch,inst-xhr) Ignore network events with zero-timing #5332

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix invalid timings in span events #4486

Fix invalid timings in span events #4486

Abinet18 commented Feb 15, 2024 •

edited by JamieDanielson

Loading

codecov bot commented Feb 15, 2024 •

edited

Loading

Abinet18 commented Feb 16, 2024

t2t2 Mar 4, 2024

Abinet18 Mar 5, 2024

JamieDanielson Mar 6, 2024

Abinet18 Mar 7, 2024

JamieDanielson left a comment

Fix invalid timings in span events #4486

Fix invalid timings in span events #4486

Conversation

Abinet18 commented Feb 15, 2024 • edited by JamieDanielson Loading

Which problem is this PR solving?

Short description of the changes

Type of change

How Has This Been Tested?

Checklist:

codecov bot commented Feb 15, 2024 • edited Loading

Codecov Report

Abinet18 commented Feb 16, 2024

t2t2 Mar 4, 2024

Choose a reason for hiding this comment

Abinet18 Mar 5, 2024

Choose a reason for hiding this comment

JamieDanielson Mar 6, 2024

Choose a reason for hiding this comment

Abinet18 Mar 7, 2024

Choose a reason for hiding this comment

JamieDanielson left a comment

Choose a reason for hiding this comment

Abinet18 commented Feb 15, 2024 •

edited by JamieDanielson

Loading

codecov bot commented Feb 15, 2024 •

edited

Loading