Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Fetch and XHR instrumentation to use anchored clock #3320

Closed

Conversation

gregolsen
Copy link
Contributor

@gregolsen gregolsen commented Oct 12, 2022

Which problem is this PR solving?

Originally reported here

After introducing anchored clock in this PR XHR and Fetch instrumentation started producing spans with negative duration. hrTime() is used to set end time of the spans but it is not aligned with an instance of an anchored clock inside the spans.

Short description of the changes

Instead of using hrTime() a new method currentTime() introduced to a Span. It returns the time from the anchored clock ensuring consistency with the start time of the span. This new method is used in both Fetch and XHR instrumentations to set end time of the span.
The fix is inspired by this fix for a similar problem in express instrumentation https://github.com/open-telemetry/opentelemetry-js-contrib/pull/1210/files

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I've been testing this by producing spans with negative duration after putting computer to sleep for a few minutes.
In Honeycomb spans with negative duration are overflow and have long duration:
image

With the issue fixed, I couldn't produce spans with negative duration anymore. Notice how spans have clock_drift over 500 seconds. clock_drift was set on the client with Date.now - (performance.timeOrigin + performance.now()):
image

Steps to reproduce:

  • Open a browser tab with an app using OTEL XHR/Fetch instrumentation (1.7.0+)
  • Put computer into sleep for a few minutes
  • "Wake" computer up and reload the page in the same tab to produce new traces
    Expected: span duration for XHR/Fetch spans should be positive
    Actual: span duration for XHR/Fetch spans is negative

Checklist:

I'm not sure how to write a unit test for it.

  • Followed the style guidelines of this project
  • Unit tests have been added
  • Documentation has been updated

cc @dyladan as the one with context on the clock drift

@gregolsen gregolsen requested a review from a team October 12, 2022 08:42
@codecov
Copy link

codecov bot commented Oct 12, 2022

Codecov Report

Merging #3320 (3a4c1e5) into main (ef6fcdc) will decrease coverage by 0.05%.
The diff coverage is 0.00%.

❗ Current head 3a4c1e5 differs from pull request most recent head 19d855e. Consider uploading reports for the commit 19d855e to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3320      +/-   ##
==========================================
- Coverage   93.04%   92.99%   -0.06%     
==========================================
  Files         226      226              
  Lines        6517     6521       +4     
  Branches     1360     1360              
==========================================
  Hits         6064     6064              
- Misses        453      457       +4     
Impacted Files Coverage Δ
api/src/trace/NonRecordingSpan.ts 85.71% <0.00%> (-9.03%) ⬇️
packages/opentelemetry-sdk-trace-base/src/Span.ts 97.61% <0.00%> (-1.58%) ⬇️
...emetry-core/src/platform/node/RandomIdGenerator.ts 87.50% <0.00%> (-6.25%) ⬇️
...-trace-base/src/platform/node/RandomIdGenerator.ts 93.75% <0.00%> (+6.25%) ⬆️

@dyladan dyladan added the api:traces Issues and PRs related to the Traces API label Oct 13, 2022
@dyladan
Copy link
Member

dyladan commented Oct 13, 2022

@gregolsen thanks for the work. In the SIG meeting in recent weeks we've actually been talking about this and had decided to go a different way. We were going to switch to using Date instead of performance for calculating times.

Advantages:

  • Simple call to system clock leaves no room for miscalculations
  • Resistant to runtime pauses
  • Resistant to drift if NTP is set up
  • No more susceptible than performance to drift if NTP is not set up

Disadvantages:

  • If an NTP correction happens during a span it is possible to get inconsistent times. This is not expected to happen often.
  • If an NTP correction happens during a trace it is possible for a span to be shifted forward or back

You can see the discussion in #3279

@gregolsen
Copy link
Contributor Author

Thanks for context @dyladan I haven't seen that issue when making this PR!

Question: if we start using Date.now() to set span start/end timestamps how that will work with the instrumentation that uses browser performance API that returns the number of milliseconds from the origin?
e.g. document load instrumentation relies on browser performance API to get the entries and populate the trace with spans on different resources that were loaded.
I'm thinking about a case when document load is used alongside XHR/Fetch and some custom instrumentation. Ideally all instrumentations should be using the same way to set the timestamps for the traces to be consistent.
What you think? Am I missing something?

There's an interesting discussion in another PR I have: open-telemetry/opentelemetry-js-contrib#1092

@dyladan
Copy link
Member

dyladan commented Oct 14, 2022

Thanks for context @dyladan I haven't seen that issue when making this PR!

Yeah I figured. Hard to stay on top of all the things ongoing if you're not in it every day so i thought a link might be helpful. It's also not a settled question by any means so we're definitely interested in any input you have.

Question: if we start using Date.now() to set span start/end timestamps how that will work with the instrumentation that uses browser performance API that returns the number of milliseconds from the origin?

The performance.now time still needs to be converted to a timestamp (usually by adding to timeOrigin), so in the end you end up with 2 timestamps from different clocks. The performance clock is subject to pauses, but the system clock is not. This definitely is a problem if you're using the web performance API. It actually affects the anchored clock too in its current form as the times from the web timing api are not adjusted to the anchor.

I don't have a great solution except that maybe we need to consider two different methods of gathering time, one for web (or short lived environments) using the performance clock and one for node using Date. Another option would be one @Flarna proposed at one point which would be to have a global anchored clock which has its anchor updated periodically. That clock would also have to provide an adjust method or similar so that times from the performance timings api can be adjusted to the anchor. You still have problems when the anchor is changed by a significant amount, but hopefully that is rare enough.


One point of feedback on this PR: times will need to be consistent between traces, metrics, logs, and whatever else is added in the future. To that end, if we have a global anchored clock I feel it should be its own package which can be consumed by any API package. The api package is really just the trace API and in order to keep bundle sizes down there is a desire to keep unrelated API packages separated.

@github-actions
Copy link

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Dec 19, 2022
@github-actions
Copy link

github-actions bot commented Jan 9, 2023

This PR was closed because it has been stale for 14 days with no activity.

@github-actions github-actions bot closed this Jan 9, 2023
@legendecas legendecas reopened this Jan 10, 2023
@legendecas legendecas closed this Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:traces Issues and PRs related to the Traces API stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants