-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clock Skew Adjuster considered harmful #1459
Comments
+1
As someone looking for the root cause of a problem, how would you like this info to be presented to you?
From what I remember, this is particularly useful for mobile clients, where the clock might be really off. Perhaps clients where the clock might be off should be expected to send a hint within the span? |
In Java for example you would have all kinds of problems. So i.e. having a span in red would be fine. This should be rare, which means that I wouldn't probably care about nanosecond differences, especially spans that are closed slightly too late. It's like having log entries (which usually have timestamps) and you see that B is happening before A, although impossible. Pointing this out is usually sufficient.
Yes, that would be “opt-in”. One could also add a “ping collector” command, which could give an idea about the clock difference and latency, assuming there is no jitter or other factors. I'm unsure if this is really worth the effort. Still auto-fixing might be strange too, since you'll always run into the “B before A” case. And the algorithm does not seem to be very successful on mobile clients, see issue #722. |
After jaegertracing/jaeger-operator#871, we now have a process tag So it might be beneficial to this improvement direction to make it possible to configure the tag used to identify "same host as parent". |
We keep getting comments like “I spent a few hours trying to figure out ...”. The point of this issue is that the default behaviour is unexpected, confusing and harmful. If anyone wants adjustments, fine. But this should be opt-in and off by default. “Better” unexpected and a little less confusing adjustments are still harmful… |
It seems like a controversial feature. We should consider making it configurable. Would it also help people if we were labeling the spans which are candidates for clock adjusting? |
I was definitely one of those "I spent a few hours" people. In my case, I did have Clock Skew, but lack of an I'd very-much like to see it turn off-able, but I don't feel strongly that it should be burned to the ground. And if it's not removed entirely, making it more useful for environments where And probably lower-hanging fruit in terms of getting a PR up and delivered for disabling the feature entirely, or changing it to off-by-default? Or maybe not, I haven't really looked at the Jaeger codebase. I also have a use-case for the Clock Skew Adjustor, since I have a down-the-track need to collect spans from web-based javascript pages, and even other random PC users. And experience suggests "random PC users" are awful at keeping their PC clocks accurate. |
Again, I do not vote for “burning to the ground”. It might help people, and thanks to the guys developing and improving it. The point i'm trying to make is that a (changing) heuristic how to improve faulty data can be helpful, but should be off by default and opt-in, nothing more. You have some machines with broken clocks which you can't fix - fine. The problem is that you don't know how broken they are. Instead of a simple skew it might be jumping - this happens. So, for people who are sure it is a simple skew, but are not able to fix that - turn it on, it might help. For everyone else: I might hide your problem or damage otherwise perfectly correct data. And this is what I'm seeing in most of the cases. Also, as mentioned above, it destroys data when it's just a programming error. the heuristic could be improved, but this experience should tell you that using it by default is not a very good idea - it is hard enough to understand a system as is, but this adjustment makes it even harder. |
I also want to add that we ran into this problem, even though all the services were running on the same host (in containers). We initially had the IP field empty and it was adjusted, but using the IP from inside the container also would not have helped. As a workaround we have for now hardcoded spans to all report the same IP. It seems that in our case the clock skew adjustment happened because:
|
How about the following?
This way it's a no-op for good spans, and bad ones get the choice, which either points at the issue in their tracing or at clock skew. |
I added this issue as a topic for the next bi-weekly. Perhaps we can move forward while discussing this with everyone in the same "room": https://docs.google.com/document/d/1ZuBAwTJvQN7xkWVvEFXj5WU9_JmS5TPiNbxCJSvPqX0/edit#heading=h.u5x27f1oevg9 |
We're currently limited by front end. A simple temporary solution, for people who are particularly annoyed by the adjustment, is to add a cli flag to permanently disable adjustment, until someone contributes a respective UI change to toggle it on and off. |
Given that this feature breaks most people's expectation, shouldn't it be the other way around, even though it means a change in the current behavior? |
Might be ok too. I think if we go with such change, we need to still run the adjustment logic but only record a warning saying "had the adjustment been enabled, it would've moved the timestamps by (Delta)". |
Upon more thought, we may want three-valued flag: |
Just in case it's not apparent to someone searching (like me earlier), PR #2119 added the ability to set --query.max-clock-skew-adjustment=0s which has the effect of disabling adjustments. |
I think we should really consider setting the default value for the clock-skew to 0s for Jaeger 1.21.x or 2.x. It's simply wrong with all async scenarios. Example, which looks very similar to the plentiful of other examples we've seen related to this issue: Clocks out of sync are a real problem, but I'm afraid we have more async spans out there than server clocks out of sync. |
fine with me, we just need to mark is as a breaking change in the log |
Edit: duplicate of #961.
Requirement - what kind of business use case are you trying to solve?
The current behavior of clock skew adjustments is surprising and, why well-meant, not beneficial.
Problem - what in Jaeger blocks you from solving the requirement?
Excuse the pun, I'm aware of the existing issues.
The clock skew adjuster currently makes the following assumptions:
Considering 1.:
The algorithm uses the not well documented
ip
tag of the process to distinguish between hosts and assumes a missing tag indicates a different host.People using OpenTracing or OpenCensus usually only know about the service name, so a missing
ip
tag is expected, meaning the algorithm compensates nearly all spans. SeeThen it compares durations, which makes only sense one assumes that spans roughly have the same size.
Most child spans would be shorter, so this is rarely the case. If a child span is a few nanoseconds longer I don't care, if it is really longer it should be flagged, but not shifted. It's wrong data and really hard to diagnose if it is processed afterwards.
Then the algorithm checks if the end time of the child span is after the end time of the parent. While technically wrong, it is pretty easy to close the child span a couple of nanoseconds too late - probably a non-issue considering tracing, and at most something that should be a warning instead of moving the span around.
Lastly it assumes that span duration somehow correlates to network latency.
Proposal - what do you suggest to solve the problem or improve the existing situation?
If clock skew is an issue, the span context should try to compensate.
Considering 2. and 3. it would IMHO be best to not mess with the data by default and offer beautification as an opt-in workaround, which is different than the currently solutions by turning it on by default and searching for a way to get rid of it.
Also, it might be useful to see clock skew by default, because it may affect other areas too.
Any open questions to address
I saw very few comments where this skew adjustent is actually wanted. Maybe some of the supporters can chime in?
The text was updated successfully, but these errors were encountered: