Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a supported way to get anchored clock times #3279

Closed
dyladan opened this issue Sep 28, 2022 · 52 comments
Closed

Provide a supported way to get anchored clock times #3279

dyladan opened this issue Sep 28, 2022 · 52 comments

Comments

@dyladan
Copy link
Member

dyladan commented Sep 28, 2022

Ref open-telemetry/opentelemetry-js-contrib#1193
Ref open-telemetry/opentelemetry-js-contrib#1209

I wanted to make an issue where we can discuss this separately from the contrib bugs and solve it in a more general way. Here is the relevant conversation so far:

@dyladan Any idea how we could expose the anchored clock to allow use cases like that one in express?
@Flarna

We could provide it on the span but that would be an API change. We could also provide an SDK function which takes a span and returns its clock but that would be quite confusing to end users who just want a current time.

We should see if we can solve this more generally as express may not be the only place this happens and it is quite a difficult problem to solve.
@dyladan

From an outside point of view, it seems indeed the most simple approach would be to let the SDK handle the clock internally and use span.end().

I am planning to use OpenTelemetry on a future project and experimenting different aspects of observability with a toy project, glad to see the community is reactive and things are sorted out quickly.
@alexd6631

We could provide it on the span but that would be an API change. We could also provide an SDK function which takes a span and returns its clock but that would be quite confusing to end users who just want a current time.

I think it should be on the API and quite independent of spans. Best would be something like a TimeProvider interface. SDK can plug in the implementation matching to the platform or even user might override it during SDK init.
@Flarna

I think it should be on the API and quite independent of spans. Best would be something like a TimeProvider interface. SDK can plug in the implementation matching to the platform or even user might override it during SDK init.

How do you make sure you're getting the clock that is the correct anchored clock? I guess the time provider can use whatever clock the currently active span is using?
@dyladan

@alexd6631
Copy link

Just to have more context/understanding on the background issue you are addressing, is it related that the wall clock would be non monotonic (for instance a NTP update, could adjust the clock negatively) and you could end up with negative spans ?

@dyladan
Copy link
Member Author

dyladan commented Sep 28, 2022

Just to have more context/understanding on the background issue you are addressing, is it related that the wall clock would not be monotonic (for instance a NTP update, could adjust the clock negatively) and you could end up with negative spans ?
@alexd6631

There are a couple issues at play here. Originally we used the monotonic clock in order to ensure spans always end after they start even if the system clock is modified during an operation. This can cause issues because the monotonic clock can be paused for various reasons including computer sleeping, lambda runtime pausing, and possibly others. This caused the clock to shift farther and farther into the past as the process lived longer. The monotonic clock also drifts naturally from the system clock as it is not NTP sync'd.

The solution we came up with was to use the system clock to determing the "real" time at the start of a trace, and use the monotonic clock for the rest of the trace. This ensures the whole trace is never drifted too far from real time, but also that times are monotonic within a trace.

@alexd6631
Copy link

Ok the solution makes a lot of sense, especially if you consider the spans to be short-lived (few seconds) compared to the process.

So I guess it would not be appropriate for the API to expose a single current time, as you have a current time per trace (absolute start time of trace + delta in monotonic time), it would be natural to scope it by trace in the API.

@alexd6631
Copy link

Also as a basic / first-time user of Otel I don't see any particular use case where I would want to mess with the underlying clock. Do you have any specific use cases in mind, or is it to let flexibility on the user API (or just to be able to switch the strategy easily at a later point internally) ?

The sole use case (I can think of) where I would set manually the start/end time of spans is if I am getting logs / events in a legacy db and I want to convert them to traces afterward, but the current trace time would not help here.

@Flarna
Copy link
Member

Flarna commented Sep 28, 2022

How do you make sure you're getting the clock that is the correct anchored clock? I guess the time provider can use whatever clock the currently active span is using?

Agree that is not possible as the anchor is on the span currently and propagated to children. So one anchor per trace part within a process,

We could move away from using one anchor per process local trace by using the global anchored clock always. if anchor moves between span start/end SDK should ensure that end >= start.
In the end it's an exchange of adjusting per spans vs adjusting per process local trace part.

If timestamps are used also in logs, metrics it's currently a bit confusing to have spans adjusted but the other parts not.

@Flarna
Copy link
Member

Flarna commented Sep 28, 2022

Ok the solution makes a lot of sense, especially if you consider the spans to be short-lived (few seconds) compared to the process.

It's not that unusual to have spans with duration of hours. e.g. some CI run consisting of several builds and tests.

@dyladan
Copy link
Member Author

dyladan commented Sep 28, 2022

We could move away from using one anchor per process local trace by using the global anchored clock always.

In that case why not just use Date?

@Flarna
Copy link
Member

Flarna commented Sep 28, 2022

As far as I remember high precision timing (better then ms) was one initial targets.
But not sure if this is still applicable - in special in browsers since they reduced the precision a while ago to offer protection against timing attacks and fingerprinting. So maybe using Date is the way to go.

Should we should add some hint in the API docs that specifying start/end/... times is likely a bad idea?

Should we also change span.end() to ensure that end>=start instead of only logging but keeping the invalid value?

@dyladan
Copy link
Member Author

dyladan commented Sep 28, 2022

As far as I remember high precision timing (better then ms) was one initial targets.
But not sure if this is still applicable - in special in browsers since they reduced the precision a while ago to offer protection against timing attacks and fingerprinting. So maybe using Date is the way to go.

That's what I remember too. Wouldn't be too hard to change all the existing clock methods to use Date behind the scenes. Possible to get bad span times during NTP syncs and other things like that but it shouldn't be that many bad spans/traces. We already have the problem where different hosts might have different times.

Should we also change span.end() to ensure that end>=start instead of only logging but keeping the invalid value?

I think we should at least do this. There is no case where an end<start is a good thing.

Should we should add some hint in the API docs that specifying start/end/... times is likely a bad idea?

We should at least document the possible problems if inconsistent times are somehow used.

@alexd6631
Copy link

Just a suggestion/thinking out loud (and probably not applicable due to the existing API ^^), but why not having a "hot" span API intended for live tracing where you cannot specify the times, and a "cold" span API where you are responsible to set both start and end dates.

@dyladan
Copy link
Member Author

dyladan commented Sep 28, 2022

Just a suggestion/thinking out loud (and probably not applicable due to the existing API ^^), but why not having a "hot" span API intended for live tracing where you cannot specify the times, and a "cold" span API where you are responsible to set both start and end dates.

Mostly because that is not what was in the API specification

@dyladan
Copy link
Member Author

dyladan commented Oct 13, 2022

@gregolsen provided an alternative suggested solution in #3320 where he makes the anchored clock available in the API

@dyladan
Copy link
Member Author

dyladan commented Oct 13, 2022

Should we also change span.end() to ensure that end>=start instead of only logging but keeping the invalid value?

I think we should at least do this. There is no case where an end<start is a good thing.

Created a PR for this #3327

@dyladan
Copy link
Member Author

dyladan commented Oct 14, 2022

Copying some conversation with @gregolsen in #3320 as it is relevant here:

Thanks for context @dyladan I haven't seen that issue when making this PR!

Question: if we start using Date.now() to set span start/end timestamps how that will work with the instrumentation that uses browser performance API that returns the number of milliseconds from the origin? e.g. document load instrumentation relies on browser performance API to get the entries and populate the trace with spans on different resources that were loaded. I'm thinking about a case when document load is used alongside XHR/Fetch and some custom instrumentation. Ideally all instrumentations should be using the same way to set the timestamps for the traces to be consistent. What you think? Am I missing something?

There's an interesting discussion in another PR I have: open-telemetry/opentelemetry-js-contrib#1092

@gregolsen

Thanks for context @dyladan I haven't seen that issue when making this PR!

Yeah I figured. Hard to stay on top of all the things ongoing if you're not in it every day so i thought a link might be helpful. It's also not a settled question by any means so we're definitely interested in any input you have.

Question: if we start using Date.now() to set span start/end timestamps how that will work with the instrumentation that uses browser performance API that returns the number of milliseconds from the origin?

The performance.now time still needs to be converted to a timestamp (usually by adding to timeOrigin), so in the end you end up with 2 timestamps from different clocks. The performance clock is subject to pauses, but the system clock is not. This definitely is a problem if you're using the web performance API. It actually affects the anchored clock too in its current form as the times from the web timing api are not adjusted to the anchor.

I don't have a great solution except that maybe we need to consider two different methods of gathering time, one for web (or short lived environments) using the performance clock and one for node using Date. Another option would be one @Flarna proposed at one point which would be to have a global anchored clock which has its anchor updated periodically. That clock would also have to provide an adjust method or similar so that times from the performance timings api can be adjusted to the anchor. You still have problems when the anchor is changed by a significant amount, but hopefully that is rare enough.

@dyladan

From the above conversation, I actually think it is worth considering an @opentelemetry/time package used by the SDK and instrumentation which provides a global anchored clock which is periodically updated.

@gregolsen
Copy link
Contributor

gregolsen commented Oct 14, 2022

Just a bit of interesting context on system clock drift.
Here was provided a little JS snippet to measure how system clock differs from the monotonic one. I opened it in Firefox (as it glitches in chrome sometimes) and let it run for about an hour occasionally running sudo sntp -sS pool.ntp.org to "fix" system clock – big spikes of delta returning back to 0 is that NTP adjustment.
image

I also ran an experiment in production – I added start/end timestamps based on system clocks to all the spans.
Unfortunately most of the spans we have are produced by XHR/Fetch instrumentations – they delay span end via setTimeout to capture useful data from performance API about the request – hence Date.now() on span end doesn't represent the actual end time of the span.
But we do have custom spans which are simply started with tracer.startSpan and ended span.end() w/o custom timestamps. Hence comparing their existing duration with the one computed based on the system clock timestamps is fair.

P999 (99.9 percentile) is 3.9 milliseconds however there are lots of outliers – 2 on logarithmic scale means 100 milliseconds difference between the original duration and the one computed based on the system clock.
image

Filtering out to only the spans with over 10ms difference in measured durations we can see that the relative mismatch is not as bad: with AVG duration of 900ms the AVG difference is 28ms ~3.1%.

image

I also found an interesting span (link is private, just storing for myself) which duration was recorded as 900ms (we are using OTE 1.6.0 at the moment) but, based on system clock, its duration is 0. I would trust system clock here as those cache lookups like are typically under a millisecond in duration. So, somehow, monotonic clock measured 900ms out of nowhere:

Screenshot 2022-10-14 at 17 58 17

@gregolsen
Copy link
Contributor

This trace from the comment above is actually really interesting – looks like there was a system clock adjustment by 915ms – it happened between the long span of 914.7ms and the previous span. But the fact that the long span is exactly 914.7ms means that the performance API monotonic duration also jumped by 915ms (hence the span duration).

How is that possible that both the system clock and values returned by performance.now() are jumped by the same 915ms forward? I thought performance.now() is not susceptible to these adjustments.

We also have performance.timeOrigin value at the time of the span creation stored in current_origin and it hasn't changed - so it must be the performance.now() value that suddenly was adjusted by 915ms:

Screenshot 2022-10-14 at 18 33 24

@dyladan
Copy link
Member Author

dyladan commented Oct 14, 2022

Thanks for doing the research. Great to have hard numbers.

Contingent on my correct understanding of the graph, here are the conclusions I think we can draw:

  • Your Date.now clock is quite slow and drifting a lot from "real" time which is interesting
  • Your performance.now clock is fast compared to "real" time
  • Your performance.now clock is more accurate than Date.now
  • Neither clock got more than about 100ms out of sync

My understanding of NTP under normal operation is that unless a manual adjustment is made like you did there, it doesn't make large jumps, but instead modifies the system clock to run slower or faster as needed. Obviously in a browser environment you have no idea what the user will do with their system clock.

I ran the script in safari for admittedly much less time but it seems to stay within about 1ms. I'll let it keep running for a while and do an ntp update in about 30 minutes to see if that changes things.

image

I know browser vendors play with clocks in order to prevent fingerprinting, so I'm not sure what effect that may have on either or both clocks. I also wouldn't put it past browsers to implement these sorts of things themselves instead of relying on the actual system clock, so this behavior could easily vary from browser to browser.

I thought performance.now() is not susceptible to these adjustments.

I was also under the impression that it was not subject to adjustments. This is interesting information. are you sure that span didn't actually take 914ms? That's quite a large jump. It also looks like the next few spans took less time than the spans leading up to it. Not sure what happened there.

@gregolsen
Copy link
Contributor

I, of course, can't be 100% sure but those spans are cache lookups where the cache is simply an in-memory lru-cache – they are extremely fast and 914ms seems unlikely. But also what are the odds of a system clock jumping for the exact same amount before the span has ended? If the _internal.system_clock_start was smaller then the _internal.system_clock_end by 914ms for that span - it would make sense. But in this case system clock start and end are exactly the same for that span – in other words, system clocked has already jumped at the time the span was started. So the only logical explanation is that hrTime() returned two values with 914.7ms difference.

@dyladan
Copy link
Member Author

dyladan commented Oct 14, 2022

After running quite a bit longer safari seems... consistent

image

@gregolsen
Copy link
Contributor

Oh, sorry - I forgot to mention, I'm not sure if safari is a good candidate for investigations as it doesn't seem like suffering from a difference between system clock and performance.timeOrigin + performance.now():
image

@gregolsen
Copy link
Contributor

We are measuring clock drift on every span by running Date.now() - (performance.timeOrigin + performance.now()). So as you can see safari pretty much doesn't have any difference at all. Max registered difference is 1ms (within margin of error) for safari:
image

@dyladan
Copy link
Member Author

dyladan commented Oct 14, 2022

I, of course, can't be 100% sure but those spans are cache lookups where the cache is simply an in-memory lru-cache – they are extremely fast and 914ms seems unlikely. But also what are the odds of a system clock jumping for the exact same amount before the span has ended? If the _internal.system_clock_start was smaller then the _internal.system_clock_end by 914ms for that span - it would make sense. But in this case system clock start and end are exactly the same for that span – in other words, system clocked has already jumped at the time the span was started. So the only logical explanation is that hrTime() returned two values with 914.7ms difference.

Yeah I'm really not sure what to make of that. I'm not very familiar with the particular performance characteristics of that module and even if I was I don't know much about your use of it.

What units are you measuring drift in? 24 million anythings for chrome seems like quite a gap from the system clock. Your graphs definitely seem to show chrome as the "problem child"

@dyladan
Copy link
Member Author

dyladan commented Oct 14, 2022

Also, it's not really clear which if either of the clocks is more correct. We only know the difference between them

@gregolsen
Copy link
Contributor

I just found a counter example to my idea that performance.now() has jumped – long cache lookup was sandwiched in between two network requests. I found logs for both of them and can confirm that there was a 1.6s gap between the requests. So, at least in this case, the 1.5s duration is not a clock jump but rather a valid gap in the trace. Whether cache lookup itself is measured correctly is still an open question as I can't believe it took that much.
Screenshot 2022-10-14 at 20 35 58

Chrome is indeed a problem child - those units are milliseconds, so 2,709,469,466 milliseconds is 2,709,469,466 / 1000 / 3600 / 24 = 31 day of a difference between Date.now() and performance.timeOrigin + performance.now().
It is hard to believe, however just today I asked my colleagues to measure the difference in their browsers – the maximum result I got was 7.5 days!

We have instrumented spans with many timestamps: we have the original timestamp set by OTEL (comes from hrTime()), proxy timestamp (we proxy all spans via our backend API) and also Honeycomb provides ingestion timestamp when it got the span into their system. Proxy timestamp, system clock and ingestion time are in alignment, while the timestamp set by OTEL is far in the past thanks to the computers sleeping.
At the moment we are correcting span timestamps on the backend – we measure the difference Date.now() - (performance.timeOrigin + performance.now()) at the time of the trace start. Then set it as a global attribute to every span of that trace - so that every span of a trace is shifted by the exact same amount.
We only shift the spans if the difference is positive, as in when Date.now() is larger then the time via performance API.
Negative values indicate system clock drift (the one we saw produced by that JS snippet).

Here are two heatmaps - for positive clock drift and negative. Negative clock drift is tiny while positive drift has gigantic outliers.

image

image

@dyladan
Copy link
Member Author

dyladan commented Oct 20, 2022

Chrome is indeed a problem child - those units are milliseconds, so 2,709,469,466 milliseconds is 2,709,469,466 / 1000 / 3600 / 24 = 31 day of a difference between Date.now() and performance.timeOrigin + performance.now(). It is hard to believe, however just today I asked my colleagues to measure the difference in their browsers – the maximum result I got was 7.5 days!

This is unfortunately not that uncommon from my (admittedly recently acquired) understanding. I think chrome's timeorigin might be calculated from the browser startup not from each tab start?

We have instrumented spans with many timestamps: we have the original timestamp set by OTEL (comes from hrTime()), proxy timestamp (we proxy all spans via our backend API) and also Honeycomb provides ingestion timestamp when it got the span into their system. Proxy timestamp, system clock and ingestion time are in alignment, while the timestamp set by OTEL is far in the past thanks to the computers sleeping. At the moment we are correcting span timestamps on the backend – we measure the difference Date.now() - (performance.timeOrigin + performance.now()) at the time of the trace start. Then set it as a global attribute to every span of that trace - so that every span of a trace is shifted by the exact same amount. We only shift the spans if the difference is positive, as in when Date.now() is larger then the time via performance API. Negative values indicate system clock drift (the one we saw produced by that JS snippet).

Here are two heatmaps - for positive clock drift and negative. Negative clock drift is tiny while positive drift has gigantic outliers.

image

image

Interesting. Even with the new anchored clock implementation you have drifts this bad? Or is this just direct hrTime measuring not the span time?

@dyladan
Copy link
Member Author

dyladan commented Oct 25, 2022

Created #3359 as a proposal to show how we can remove the anchored clock from the tracer while we work on a more full solution. I really don't want to remove the anchored clock as we will just reintroduce those old problems, but the new problems seem worse IMO

/cc @Flarna we had discussions which led to the original anchored clock. if you see a better way to resolve these issues without reverting I would appreciate it

@Flarna
Copy link
Member

Flarna commented Oct 25, 2022

Unfortunately I have no idea how to fix this easy. I was about proposing to revert but seems you were faster.

API allows to pass start/end time independent of each other using any timestamp source user wants. And actually this happens in the wild. I think this is a major design flow in API.
Seems all we can do as of now is to exchange one problem with another.

To my understanding there are use cases (at least in browser) where performance must be used - but we know it's drifting.

Maybe there is a solution for non browser environments by creating an API to get the current time from the one and only (configurable) time source and dissallow the use of any other time sources. But no idea how to disallow this (besides via doc).

@t2t2
Copy link
Contributor

t2t2 commented Oct 26, 2022

For browsers considering that

  1. Date.now is less likely to drift / has less drift compared to performance.now
  2. You can measure how much performance API has drifted from Date.now via Date.now() - (performance.timeOrigin + performance.now())

The easiest fix for data coming from performance API would be to add the drift to timeorigin that is used to convert relative performance timestamp to absolute timestamp

function getTimeOrigin(): number {

return timeOrigin + (Date.now() - (timeOrigin + performance.now()));
// which can be math'd into
return timeOrigin + Date.now() - timeOrigin - performance.now();
// which simplifies to
return Date.now() - performance.now();

This way also if user records timestamps manually somewhere

  1. Date.now() would match as it'd be the default
  2. performance.now() as is would get corrected
  3. performance.timeOrigin + performance.now() would be the only one having issues, but I think documentation pointing to performance API being subject to time drifting would be good enough

@Flarna
Copy link
Member

Flarna commented Oct 27, 2022

I'm not a browers expert so asking here. which category of the above 3 is used by instrumentation-fetch here?

@t2t2
Copy link
Contributor

t2t2 commented Oct 27, 2022

corsPreFlightRequest comes from performance API, I'll check my sleeping laptop tomorrow but even the values provided by browser should come from same (drifting) clock as performance.now = 2nd

@gregolsen
Copy link
Contributor

@t2t2 please correct me if I'm wrong but if the getTimeOrigin() modified to return Date.now() - performance.now() then it will return different values for different invocations. While performance.now() increases monotonically Date.now() is not as it is corrected via NTP. As a result, when Date.now() correction happens mid-trace, spans of that given trace will be arbitrarily "shifted" within a trace.
Unless I'm missing something, safer approach would be to ensure that for any given trace all span timings are adjusted by the same amount to make sure overall trace duration and composition is correct.

Some data to back this hypothesis up (unfortunately you won't be able to access the links provided, I'm adding them for myself to have original queries/traces in case more info needed):
All our spans are instrumented with a clock drift (Date.now() - performance.timeOrigin - performance.now()) which is measured during trace start (we use it to shift spans on the backend). We also record Date.now() on every span start too. Having both span timestamp (performance.timeOrigin + performance.now()) and Date.now() at span start we can calculate the current clock drift for every span.
We have a derived column _internal.system_clock_start_formatted equals to difference between the current clock drift and the original clock drift measured at trace start.
This chart shows spans where the difference between current clock drift and the original one is more than a second (query):
image

Here's a sample trace. Here you can see how _internal.clock_drift_drift is mostly close to 0 milliseconds at the beginning of the trace which means both Date.now() and performance.now() are increasing at the same rate.
Then towards the end it jumps to 96.545 seconds (Date.now() was adjusted) which is reflected in a 97 seconds mismatch between the span timestamp (measured as performance.timeOrigin + performance.now()) and the _internal.system_clock_start_formatted (Date.now() at the time of the span creation).

Screenshot 2022-10-27 at 23 54 47

@johnbley
Copy link
Member

A man with one clock knows what time it is.  A man with two clocks is 
never sure.

To summarize the facts of these two APIs as I understand them:

  • Date.now may change forwards or backwards because of NTP, but is
    otherwise much more reliably synched to "reality" than performance.now
  • performance.now is very unreliable for browser activity (more so than
    server-side timestmaps) because of the "laptop closed" issue. However, it
    gives more reliable deltas/durations than Date.now.
  • Additionally, performance.now is no longer high-resolution in browsers
    due to Spectre mitigation.

performance.now is sometimes the only source we have for a timestamp:

  • CORS preflight start and end times in opentelemetry-instrumentation-fetch
  • start and end timestamps as well as all sub-event times on document
    load, document fetch, and resource fetch in
    opentelemetry-instrumentation-document-load
  • CORS preflight start and end timestamps in opentelemetry-instrumentation-xml-http-request
  • start and end timestamps for long-task events in
    opentelemetry-instrumentation-long-task

While I generally like @gregolsen's ideas around keeping fields for all
the subtleties, we have to choose a single start and end time to align
with the otel spec. I would propose the following:

  • All calculations that are forced to involve performance.now timestamps
    should use a startTime based on eventPerformanceNowStart + performance.timeOrigin + (Date.now()-(performance.timeOrigin+performance.now())) which simplifies
    out to eventPerformanceNowStart + Date.now() - performance.now().

  • All "regular" spans with startSpan and end using the current
    time should set startTime to Date.now and
    duration based on the delta between calls to performance.now(). This
    will mean keeping two start times (though not necessarily exposing them
    via api) as an internal complication. I could likewise understand or
    enterain a desire to use the same idea but tie endTime (rather than
    startTime) to Date.now().

@dyladan
Copy link
Member Author

dyladan commented Oct 31, 2022

John's suggestion is one that has been brought up in the past. The biggest downside is that it is susceptible to ntp changes mid trace

@johnbley
Copy link
Member

John's suggestion is one that has been brought up in the past. The biggest downside is that it is susceptible to ntp changes mid trace

Yep. For the purposes of lining up with other data streams (e.g.,log timestamps) we need at least one timestamp as close to reality as we can get. Whether NTP resets backwards or forwards it'll make everything (not just traces) a bit weird to interpret. We could have systemTimeAtSpanStart, duration, and systemTimeAtSpanEnd as fields, with the caveat that sometimes end < start, but I don't think that would go over well.

@t2t2
Copy link
Contributor

t2t2 commented Oct 31, 2022

please correct me if I'm wrong but if the getTimeOrigin() modified to return Date.now() - performance.now() then it will return different values for different invocations. While performance.now() increases monotonically Date.now() is not as it is corrected via NTP. As a result, when Date.now() correction happens mid-trace, spans of that given trace will be arbitrarily "shifted" within a trace.

So this would likely not be an issue for resource timings as they all would be translated in one batch at the end of span (and tbh for fetch and xhr it'd be more accurate to have both span start and end time from resource timings), however there probably could be nanosecond level rounding errors between calls (since performance.now and Date.now have different precisions rounding could give different values, which could be solved by memoizing the value as long as it hasn't changed more than eg. ms) and accepting that the data accuracy is going to be around ms (which considering anything actionable - be it long blocking tasks or slow http calls - would have duration in 10 - 1000ms anyway, it'd be good enough considering the situation)

For custom spans it's definitely more suspectible to mid-span NTP updates, but here I think just the reality of the amount of garbage data now from performance API time drift is >>>> the amount of data garbaged by NTP, that personally it would go "yeah 0.0...01% of RUM data is just gonna end up being gabrage, but at least it has many zeroes"

eg one weekend later:

image

Code used to calculate above
function getCurrentTimestamps() {
  const dnow = Date.now()
  const pnow = performance.now()

  return [dnow, performance.timeOrigin + pnow, new Date(dnow), new Date(performance.timeOrigin + pnow)]
}

const url = 'https://httpbin.org/response-headers?timing-allow-origin=*'
const obs = new PerformanceObserver((list) => {
  list.getEntries().forEach(entry => {
    if (entry.name !== url) {
      return false
    }

    const table = {}
    Object.keys(entry.toJSON()).forEach(key => {
      table[key] = {}
      if (typeof entry[key] !== 'number' || ['duration', 'transferSize', 'encodedBodySize', 'decodedBodySize'].includes(key)) {
        table[key].value = entry[key]
      } else {
        table[key].value = entry[key]
        table[key].withOrigin = performance.timeOrigin + entry[key]
        table[key].withDrift = (Date.now() - performance.now()) + entry[key]
        table[key].withOriginHuman = new Date(performance.timeOrigin + entry[key])
        table[key].withDriftHuman = new Date((Date.now() - performance.now()) + entry[key])
      }
    })
    console.table(table)
  })
})
obs.observe({ entryTypes: ['resource'] })

console.log('drift check', ...getCurrentTimestamps())
fetch(url, {headers: {'x-custom': 'create-OPTIONS-req'}}).then(() => {
  console.log('res time', ...getCurrentTimestamps())
})

@dyladan
Copy link
Member Author

dyladan commented Oct 31, 2022

however there probably could be nanosecond level rounding errors between calls

I think this is not a problem since the timing API is intentionally jittered anyway in order to prevent timing attacks

@t2t2
Copy link
Contributor

t2t2 commented Oct 31, 2022

image

Code
const start = performance.now()
const microtasker = Promise.resolve()
function getDiff() {
    microtasker.then(() => {
        console.log('diff', Date.now() - performance.now())
        if (performance.now() - start <= 1) {
            getDiff()
        }
    })
}
getDiff()

I can see there being an issue with one call saying add 49.4 and next 50.3

@dyladan
Copy link
Member Author

dyladan commented Nov 3, 2022

Proposed timestamp resolution flow for span durations:

// onStart and onEnd provided only for purposes of defining the terms used in the flow chart
// they do not indicate new methods or callbacks on the span

onStart() {
  perf_start = performance.now()
  perf_offset = Date.now() - this.perf_start
}

onEnd() {
  perf_duration = performance.now() - perf_start
}
flowchart TD
	id1(start_time provided by user?)
	id2(end_time provided by user?)
	id3(duration = provided_end_time - provided_start_time)
	id4(duration = Date.now - provided_start_time \n see note 1 below)
	id5(start_time = Date.now)
	id6(end_time provided by user?)
	id7(duration = perf_duration\nsee note 2 below)
	id8(provided_end_time - start_time > 0)
	id9(end_time = provided_end_time)
	id10(end_time = provided_end_time + perf_offset\nsee note 3 below)
	id11(start_time = provided_start_time)
	id1-- yes -->id11
	id11 --> id2
	id2-- yes -->id3
	id2-- no --> id4
	id1 -- no --> id5
	id5 --> id6
	id6 -- no --> id7
	id6 -- yes --> id8
	id8 -- yes --> id9
	id8 -- no --> id10
Loading

Note 1: Since the user has provided a the start time but not the end time, we assume the operation was started somewhere out of process and we are just ending it. Date.now is used because it is likely to be the closest to the correct time and high-resolution times don't make sense in this context

Note 2: Neither start time or end time is provided. The duration uses the monotonic clock to guard against NTP shifts, but the start time is calculated from Date.now to guard against performance clock drift

Note 3: this time most likely comes from the browser performance timing API which has been shifted.

@johnbley
Copy link
Member

johnbley commented Nov 6, 2022

I mostly agree with @dyladan's proposal diagram above, with a few exceptions, both on the "we provide start time, user provides end time" path:

  • Note 3 (the provided_end_time < start_time implicitly assuming that the provided time came from performance and adjusting): we don't have any current code paths that do that in our own instrumentation (all performance-based code that I found used performance timestamps for both start and end, and will need to properly adjust to have a "reality-based" start time). So, this would only apply to instrumentation developed outside the opentelemetry organization. Imagine I'm an independent developer who mistakenly passes in some random other timestamp as my endTime (perhaps I've done some math wrong, or gotten two timestamps mixed up). Now I'm very surprised to find that what I passed in is not what shows up in my data stream. It's hard enough to debug my first mistake when I'm unaware of this data change. I think this behavior would be a mistake on our part. As an aside, a debug log warning of endTime < startTime might help.
  • If I can't convince us to drop the whole clause, then at least I would allow zero-width spans, so provided_end_time - start_time >= 0 (rather than >) to use the provided end time unaltered.

@dyladan
Copy link
Member Author

dyladan commented Nov 8, 2022

  • Note 3 (the provided_end_time < start_time implicitly assuming that the provided time came from performance and adjusting): we don't have any current code paths that do that in our own instrumentation

What do you mean by "our" instrumentation? The OTel contrib instrumentation does this for xhr, fetch, and user interactoin plugins

  • all performance-based code that I found used performance timestamps for both start and end, and will need to properly adjust to have a "reality-based" start time

Sorry if it wasn't clear but that's what the perf offset is. In that case, the start time comes from Date and the end time is shifted to be consistent with it

  • Imagine I'm an independent developer who mistakenly passes in some random other timestamp as my endTime (perhaps I've done some math wrong, or gotten two timestamps mixed up). Now I'm very surprised to find that what I passed in is not what shows up in my data stream. It's hard enough to debug my first mistake when I'm unaware of this data change. I think this behavior would be a mistake on our part. As an aside, a debug log warning of endTime < startTime might help.

This would only happen if you provided an end time which is earlier than the start time, which is very unlikely if you use any clock other than the performance clock. If you are describing some outside operation where you read timestamps from logs you would most likely provide a start AND end timestamp and not trigger this path.

  • If I can't convince us to drop the whole clause, then at least I would allow zero-width spans, so provided_end_time - start_time >= 0 (rather than >) to use the provided end time unaltered.

Seems reasonable. Exactly 0 seems almost impossible though if the start time is automatically generated and the end time is provided by the user.

@dyladan
Copy link
Member Author

dyladan commented Nov 14, 2022

@johnbley did that satisfy your concerns?

@dyladan
Copy link
Member Author

dyladan commented Nov 17, 2022

As an aside, a debug log warning of endTime < startTime might help.

'Inconsistent start and end time, startTime > endTime. Setting span duration to 0ms.',

@johnbley
Copy link
Member

@johnbley did that satisfy your concerns?

I think it's better than we had before, though I would still prefer simpler. For automatic timing I think it looks good.

@dyladan
Copy link
Member Author

dyladan commented Nov 18, 2022

What about this? It's quite a bit simpler IMO. Basically we assume any timestamp < performance.now() to be a performance entry and apply the offset. In all other cases the user provided a timestamp we use it directly. If user does not provide a timestamp we use Date.now to determine spam start time and performance timer to calculate duration (and derive end time from that).

const origin = getTimeOrigin();

start(time_input: TimeInput) {
  const now = Date.now();
  this.perf_start = performance.now()
  this.perf_offset = now - this.perf_start
  this.start_time_input = time_input;

  if (typeof time_input === 'number' && time_input < performance.now()) {
    // must be a performance entry. apply correction
    this.start_time = msToHrTime(time_input + origin + this.perf_offset);
  } else if (time_input != null) {
    // do not apply correction in any other cases
    this.start_time = convertToHrTime(time_input);
  } else {
    this.start_time = convertToHrTime(Date.now());
  }
}

end(time_input: TimeInput) {
  if (time_input != null) {
    if (typeof time_input === 'number' && time_input < performance.now()) {
      // must be a performance entry. apply correction
      this.end_time = msToHrTime(time_input + origin + this.perf_offset);
    } else {
      // do not apply correction in any other cases
      this.end_time = convertToHrTime(time_input);
    }
    this.duration = hrTimeDuration(this.start_time, this.end_time);
  } else {
    if (this.start_time_input != null) {
      // user provided a start time input but not an end time input
      // best we can do is use system time because we don't know if the span started when it was constructed or before
      this.end_time = msToHrTime(Date.now());
      this.duration = hrTimeDuration(this.start_time, this.end_time);
    } else {
      // user did not provide start or end time
      // calculate end time from start time using performance timer
      this.duration = hrTimeDuration(performance.now() - this.perf_start);
      this.end_time = addHrTimes(this.start_time, this.duration)
    }
  }

  // it is possible to have a negative duration if:
  //    user provides start time in the future and does not provide an end time
  //    user does not provide start time and provides an end time in the past (but not small enough to be considered a performance entry)
  //    user provides both start and end time and they are inconsistent
  if (hrTimeIsNegative(this.duration)) {
    this.duration = [0,0];
    this.end_time = this.start_time;
  }
}

CC @t2t2 @legendecas @Flarna this is slightly different from the PR I opened PTAL and let me know which you think is better

IMPORTANT: breaking change?

Should this be considered a breaking change? The current behavior if the user provides a timestamp before the process start (< timeOrigin) it is assumed to be a performance time and the origin is added to it potentially resulting in a far future time. I think this can be considered a bug since the resulting time doesn't make any sense anyway but it is a behavior change.

My perspective is that this should be considered non-breaking

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Jan 23, 2023
@legendecas
Copy link
Member

The anchored clock is not used in the span anymore. Superseded by #3434.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants