Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix inflated "overhead" in statement timings #53846

Merged
merged 1 commit into from
Sep 3, 2020

Conversation

solongordon
Copy link
Contributor

Fixes #40675
Fixes #50108

Release justification: Low-risk, high-reward fix to existing
functionality.

Release note (admin ui change): In some cases, the Execution Stats page
would show a confusingly high Overhead latency for a statement. This
could happen due to multiple statements being parsed together or due to
statement execution being retried. To avoid this, we stop considering
the time between when parsing ends and execution begins when determining
service latency.

Fixes cockroachdb#40675
Fixes cockroachdb#50108

Release justification: Low-risk, high-reward fix to existing
functionality.

Release note (admin ui change): In some cases, the Execution Stats page
would show a confusingly high Overhead latency for a statement. This
could happen due to multiple statements being parsed together or due to
statement execution being retried. To avoid this, we stop considering
the time between when parsing ends and execution begins when determining
service latency.
@solongordon solongordon requested review from arulajmani and a team September 2, 2020 18:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely in favour of changing the service latency definition to remove time spent retrying a statement -- often this can be big, and it seems wrong to swallow it here. Additionally, this change would make the overhead bucket always be zero (as overhead bucket = service latency - (parse latency + plan latency + execution latency) ), which also seems wrong.

Considering the end goal is to remove confusion around this overhead bucket, I propose:

  • Let's keep the service latency as is.
  • Let's start recording the retry latency for statements as well (which is the same as what we record for the implicit transaction comprising the statement).
  • Change the overhead bucket calculation to be service latency - (parse latency + plan latency + exec latency + retry latency

The only thing this doesn't consider is the case where there are multiple statements on the same line -- but we don't handle that properly anyway. Once we consider them to be part of the same transaction, the service timings of A;B;C would be no different from BEGIN;A;B;C;COMMIT, so I think it's fine to have slightly weird service timings for multi-line statements in the meantime.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)

@arulajmani
Copy link
Collaborator

I was thinking about this a bit more, and this might be too late for this release, but I wonder if it makes sense to even track service latency at statement level. Service latency and overhead bucket should instead be tracked on a per-transaction basis. Instead of having a statements service latency, we should track the statement's processing latency (parseLat + planLat + execLat).

Copy link
Contributor Author

@solongordon solongordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think tracking retry latency really makes sense at the statement level, since we only retry at the transaction level. For example consider this (contrived) example:

BEGIN;
SELECT 1;
SELECT crdb_internal.force_retry('5s');
END;

Both statements are going to get retried for 5 seconds. But it's confusing to tell users that SELECT 1 has a high retry latency since it never actually fails.

Ultimately I don't think we should be reporting retries at the statement level. It would make more sense to report how many times the statement fails with a retryable error.

I also think it's fine to make Overhead near-zero. (It's not quite zero after this change but I think it should be close.) I'm pretty sure the original point of the Overhead bucket was just to alert us if there was latency which was not being caught by the other buckets. The intention is for it to be very small.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@solongordon solongordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with your second comment and I think that's basically what this commit does. :) I think the remaining "overhead" would be the time between when planning ends and execution begins.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ideally I would've liked service latency be called something different, so as to not confuse from the notion of servicing transactions. I'm not sure if that's a realistic ask, so if it isn't, this change is fine by me!

Also, do we need an overhead bucket for transaction metrics?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@solongordon
Copy link
Contributor Author

Yeah I'm definitely open to renaming. Any suggestions? I think it would be pretty easy but maybe I'll open a separate PR for it if it touches a lot of files.

I think adding a concept of overhead to transactions would be more confusing than helpful right now.

@arulajmani
Copy link
Collaborator

If we do decide to rename, it'll require changes to crdb_internal.node_statement_statistics -- wasn't sure that was something we could do or not. As for suggestions, maybe something like runLatency? I think somewhere in the code we call this processingLatency as well, but I prefer runLatency more. Either way, +1 on the separate PR! :)

@solongordon
Copy link
Contributor Author

OK, merging this and let's revisit the name change. TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 3, 2020

Build succeeded:

@craig craig bot merged commit fcb30c7 into cockroachdb:master Sep 3, 2020
@solongordon solongordon deleted the overhead-fix branch September 14, 2020 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: internal retries cause wacky statement timings sql: statement timings in transactions are suspicious
3 participants