Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tracing): make spans resilient to performance clock drift #3384

Closed
wants to merge 20 commits into from

Conversation

dyladan
Copy link
Member

@dyladan dyladan commented Nov 4, 2022

Relates to #3279 particularly #3279 (comment)

Fixes #3355

@dyladan dyladan requested a review from a team November 4, 2022 20:01
@codecov
Copy link

codecov bot commented Nov 10, 2022

Codecov Report

Merging #3384 (eb4c2ae) into main (bd56547) will decrease coverage by 0.70%.
The diff coverage is 91.07%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3384      +/-   ##
==========================================
- Coverage   93.27%   92.57%   -0.71%     
==========================================
  Files         247      132     -115     
  Lines        7348     3152    -4196     
  Branches     1512      656     -856     
==========================================
- Hits         6854     2918    -3936     
+ Misses        494      234     -260     
Impacted Files Coverage Δ
packages/opentelemetry-core/src/common/time.ts 95.45% <75.00%> (+0.06%) ⬆️
packages/opentelemetry-sdk-trace-base/src/Span.ts 97.87% <93.75%> (-2.13%) ⬇️
api/src/platform/node/globalThis.ts 0.00% <0.00%> (-100.00%) ⬇️
...s/opentelemetry-core/src/platform/node/sdk-info.ts 0.00% <0.00%> (-100.00%) ⬇️
...opentelemetry-core/src/platform/node/globalThis.ts 0.00% <0.00%> (-100.00%) ⬇️
...pentelemetry-core/src/platform/node/performance.ts 0.00% <0.00%> (-100.00%) ⬇️
.../packages/api-logs/src/platform/node/globalThis.ts 0.00% <0.00%> (-100.00%) ⬇️
...lemetry-resources/src/detectors/ProcessDetector.ts 31.81% <0.00%> (-68.19%) ⬇️
...perimental/packages/otlp-exporter-base/src/util.ts 78.94% <0.00%> (-15.79%) ⬇️
...ckages/opentelemetry-exporter-zipkin/src/zipkin.ts 84.48% <0.00%> (-15.52%) ⬇️
... and 119 more

@t2t2
Copy link
Contributor

t2t2 commented Nov 14, 2022

Made a sandbox with this patch, most frequently used otel web instrumentations (document-load, user-interaction, xhr and fetch) and an artificial time drift simulator:

https://codesandbox.io/s/modest-fermat-4eu09e?file=/src/otel-fixes.ts

It doesn't improve this test case a lot (user interaction's in right time scale, as it's auto + auto), pretty much all cases of performance timestamps are still affected by drift

  1. document load: performance time start + performance time end:

image

  1. fetch: auto start + performance time end (duration becomes negative -> set to 0)

image

@dyladan
Copy link
Member Author

dyladan commented Nov 14, 2022

It doesn't improve this test case a lot (user interaction's in right time scale, as it's auto + auto), pretty much all cases of performance timestamps are still affected by drift

These instrumentations all use auto start and manual end times.

duration becomes negative -> set to 0

If the duration is negative it should be getting the drift offset correction applied, not just set to 0. If you look at the diagram in the linked comment there is no situation where it should be set to 0, so there must be some bad code path.

@t2t2
Copy link
Contributor

t2t2 commented Nov 15, 2022

Updated codesandbox with pr changes as of now

This was the root of the 0-duration spans in your example too

Yep fixed on auto start + perf end, however span events (input: perf timestamp) also need correcting (now they end up outside the span w/ drift)

image

These instrumentations all use auto start and manual end times.

document-load spans don't, as they use manual (performance API) times for both start and end, and these still get drifted:

image

@dyladan
Copy link
Member Author

dyladan commented Nov 15, 2022

document-load spans don't, as they use manual (performance API) times for both start and end, and these still get drifted

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

@t2t2
Copy link
Contributor

t2t2 commented Nov 15, 2022

If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

There is currently an assumption that any numeric value lower than timeOrigin is a performance timestamp:

// Must be a performance.now() if it's smaller than process start time.

So it's actually impossible to create spans in the past (w/ accounting for drift) anyway:

image

@Flarna
Copy link
Member

Flarna commented Nov 15, 2022

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

We can document how we create timestamps to allow users to keep in sync. Maybe provide some helper APIs.
And use this mechanism at least in the instrumentations hosted here.

@dyladan
Copy link
Member Author

dyladan commented Nov 15, 2022

If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

There is currently an assumption that any numeric value lower than timeOrigin is a performance timestamp:

// Must be a performance.now() if it's smaller than process start time.

So it's actually impossible to create spans in the past (w/ accounting for drift) anyway:

Hmm so there is. That is disappointing and actually strikes me as a bug. The time code in core attempts to do way too much magic IMO and really limits us. That specific one can probably be fixed by changing it so that the heuristic is < performance.now() instead of < performance.timeOrigin. It would at least limit impossible times to epoch + time the page has been open.

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

We can document how we create timestamps to allow users to keep in sync. Maybe provide some helper APIs. And use this mechanism at least in the instrumentations hosted here.

I agree here documentation is a bare minimum and providing a helper API of some kind to correct drift would be a good idea

@@ -148,21 +170,19 @@ export class Span implements api.Span, ReadableSpan {
api.diag.warn('Dropping extra events.');
this.events.shift();
}
if (isTimeInput(attributesOrStartTime)) {
if (typeof startTime === 'undefined') {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undefined was actually impossible here

Comment on lines 103 to 113
const now = Date.now();
this._performanceStartTime = otperformance.now();
this._performanceOffset = now - (this._performanceStartTime + otperformance.timeOrigin);

// if startTime is a number smaller than the start of the process
// assume it is a performance API timestamp and apply correction as needed
if (typeof startTime === 'number' && startTime < otperformance.timeOrigin) {
startTime += this._performanceOffset;
}

this.startTime = timeInputToHrTime(startTime ?? now);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @t2t2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated codesandbox with these changes, looks like all are working properly now 👍

Oh it might be better to replace both references to otperformance.timeOrigin with getTimeOrigin from otel/core because it has additional backwards compat checks for older browsers (IE, safari <15)

function getTimeOrigin(): number {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. also had to export that function


// if startTime is a number smaller than the start of the process
// assume it is a performance API timestamp and apply correction as needed
if (typeof startTime === 'number' && startTime < getTimeOrigin()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if startTime is an HrTime? IIUC it should be compared to the time origin too.

Copy link
Member Author

@dyladan dyladan Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only 2 ways I can think of for the user to provide a HrTime

  1. They use our core.hrTime() function to get a time
  2. They use process.hrTime() in node

I think we should discourage (1) and maybe even mark it deprecated. We know it is going to be shifted in the past in many cases. (2) seems extremely unlikely.

The reason I push back is because I'm actually currently reworking the startTime to check for < performance.now() instead of < getTimeOrigin(). We know all performance entries should be in the past, and it is a much more restrictive test. It would also allow creation of spans in the past if the user provides a timestamp from the past (as long as it isn't too close to jan 1 1970).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL here #3279 (comment)

if (provided != null) {
const duration = hrTimeDuration(this.startTime, provided);
if (duration[0] < 0) {
return addHrTimes(provided, [0, this._performanceOffset * 1000000]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return addHrTimes(provided, [0, this._performanceOffset * 1000000]);
return addHrTimes(provided, [0, this._performanceOffset * 1e6]);

or

Suggested change
return addHrTimes(provided, [0, this._performanceOffset * 1000000]);
return addHrTimes(provided, [0, this._performanceOffset * MILLISECOND_TO_NANOSECONDS]);

}

const msDuration = otperformance.now() - this._performanceStartTime;
return addHrTimes(this.startTime, [0, msDuration * 1000000]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return addHrTimes(this.startTime, [0, msDuration * 1000000]);
return addHrTimes(this.startTime, [0, msDuration * 1e6]);

Or

Suggested change
return addHrTimes(this.startTime, [0, msDuration * 1000000]);
return addHrTimes(this.startTime, [0, msDuration * MILLISECOND_TO_NANOSECONDS]);

this.name = spanName;
this._spanContext = spanContext;
this.parentSpanId = parentSpanId;
this.kind = kind;
this.links = links;
this.startTime = timeInputToHrTime(startTime ?? clock.now());

const now = Date.now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AnchoredClock can still be applied here to get a high-resolution start time by default.

@github-actions
Copy link

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Jan 23, 2023
@legendecas
Copy link
Member

Superseded by #3434.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Span start and end times in the distant past on chrome
6 participants