-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix inflated "overhead" in statement timings #53846
Conversation
Fixes cockroachdb#40675 Fixes cockroachdb#50108 Release justification: Low-risk, high-reward fix to existing functionality. Release note (admin ui change): In some cases, the Execution Stats page would show a confusingly high Overhead latency for a statement. This could happen due to multiple statements being parsed together or due to statement execution being retried. To avoid this, we stop considering the time between when parsing ends and execution begins when determining service latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely in favour of changing the service latency definition to remove time spent retrying a statement -- often this can be big, and it seems wrong to swallow it here. Additionally, this change would make the overhead bucket always be zero (as overhead bucket = service latency - (parse latency + plan latency + execution latency) ), which also seems wrong.
Considering the end goal is to remove confusion around this overhead bucket, I propose:
- Let's keep the service latency as is.
- Let's start recording the retry latency for statements as well (which is the same as what we record for the implicit transaction comprising the statement).
- Change the overhead bucket calculation to be
service latency - (parse latency + plan latency + exec latency + retry latency
The only thing this doesn't consider is the case where there are multiple statements on the same line -- but we don't handle that properly anyway. Once we consider them to be part of the same transaction, the service timings of A;B;C
would be no different from BEGIN;A;B;C;COMMIT
, so I think it's fine to have slightly weird service timings for multi-line statements in the meantime.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
I was thinking about this a bit more, and this might be too late for this release, but I wonder if it makes sense to even track service latency at statement level. Service latency and overhead bucket should instead be tracked on a per-transaction basis. Instead of having a statements service latency, we should track the statement's processing latency ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think tracking retry latency really makes sense at the statement level, since we only retry at the transaction level. For example consider this (contrived) example:
BEGIN;
SELECT 1;
SELECT crdb_internal.force_retry('5s');
END;
Both statements are going to get retried for 5 seconds. But it's confusing to tell users that SELECT 1
has a high retry latency since it never actually fails.
Ultimately I don't think we should be reporting retries at the statement level. It would make more sense to report how many times the statement fails with a retryable error.
I also think it's fine to make Overhead near-zero. (It's not quite zero after this change but I think it should be close.) I'm pretty sure the original point of the Overhead bucket was just to alert us if there was latency which was not being caught by the other buckets. The intention is for it to be very small.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with your second comment and I think that's basically what this commit does. :) I think the remaining "overhead" would be the time between when planning ends and execution begins.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, ideally I would've liked service latency be called something different, so as to not confuse from the notion of servicing transactions. I'm not sure if that's a realistic ask, so if it isn't, this change is fine by me!
Also, do we need an overhead bucket for transaction metrics?
Reviewable status: complete! 0 of 0 LGTMs obtained
Yeah I'm definitely open to renaming. Any suggestions? I think it would be pretty easy but maybe I'll open a separate PR for it if it touches a lot of files. I think adding a concept of overhead to transactions would be more confusing than helpful right now. |
If we do decide to rename, it'll require changes to |
OK, merging this and let's revisit the name change. TFTR! bors r+ |
Build succeeded: |
Fixes #40675
Fixes #50108
Release justification: Low-risk, high-reward fix to existing
functionality.
Release note (admin ui change): In some cases, the Execution Stats page
would show a confusingly high Overhead latency for a statement. This
could happen due to multiple statements being parsed together or due to
statement execution being retried. To avoid this, we stop considering
the time between when parsing ends and execution begins when determining
service latency.