-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix inflated "overhead" in statement timings #65480
Conversation
@maryliag I think the issue is solely with the portals - there we set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will play well with CLI timings, specifically the SHOW LAST QUERY STATISTICS
thing that powers them. For some context, the CLI times the query on the client and issues a SHOW LAST QUERY STATISTICS
query to get the server side timings. Of particular interest is the service latency, which it then uses to calculate the "execution" and "network" portions of the timings displayed.
As this change removes the time spent retrying a statement from the service latency, the CLI will now (incorrectly) attribute that to the network latency.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/executor_statement_metrics.go, line 80 at r1 (raw file):
// are submitted together, e.g. "SELECT 1; SELECT 2". // // If we're executing a portal, parsing end time will be zero, so
nit: can you add ... portal, both parsing start and end time will be zero, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we keep this function like this and create one to calculate the retry times? And on the CLI those two values can be added back
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
I think we just need to have two different implementations: one for the metrics which excludes the retrying time and has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/executor_statement_metrics.go, line 80 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
nit: can you add
... portal, both parsing start and end time will be zero, ...
Done.
This is 6371d37 plus the fix for when the query is actually a portal in which case the parsing end time is left unset. This commit also incorporates the learnings of 056d5c5 so that the observer statements work well. This is done by splitting up the getter method into two since we want to have two ways of retrieving the query service latency. Release note (ui change): In some cases, the Execution Stats page would show a confusingly high Overhead latency for a statement. This could happen due to multiple statements being parsed together or due to statement execution being retried. To avoid this, we stop considering the time between when parsing ends and execution begins when determining service latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change, in terms of correctness, looks good to me now.
@yuzefovich, how do you feel about the approach 6371d37 took to solving this overhead bucket problem? I had some initial reservations on the original PR, and thinking about it a bit more today, I feel more strongly.
I think it's slightly dubious for us to pretend time spent retrying a statement doesn't exist. Instead, we should be increasing observability into time spent retrying a statement and recording it as a metric. Then, OverHead = Service - (Exec + Plan + Parse + Retry)
becomes more palatable.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
I think I'm in "Solon camp" on this issue. The way I see the situation is as follows. Currently quite a few things are broken about how we track the latency of a single statement
To me, the third bullet from above in isolation seems to have the highest priority as well as the easiest fix (which is what this PR delivers). I also think that this PR doesn't make bullets 1 and 2 worse since it is only a "definition" change for the UI. True, now the statements page will show reduced service latency if the stmt was retried, however, it will contain a red number of retries, which makes it kinda acceptable in my mind overall. I think your hesitation comes from the desire to address the second bullet, but IMO it should be a separate work item, and I imagine it'll be more involved than this PR. |
(Simply fixed the commit message.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're partly right in that's where my hesitation comes from. The other part is this comment from @knz on the issue:
Correct (I wrote that code). It ensures that the sum of latencies balances out. It also attracts our
attention to measurement bugs if the overhead every gets too large compared to other things.
Specifically the part about attracting our attention to measurement bugs if the overhead bucket gets too large. This thing seems to have done its job, we just found one. But by not fixing the root cause, and applying this bandaid, we're just making it harder to discover these issues in the future.
I agree it's a separate work item, and very likely will be more involved. I'm not asking you to make the change, I'm just asking you if it's worth not merging this one in favour of something that actually fixes the problem. Does my "harder to discover these issues" argument convince you?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
I see, Raphael and you are making a good point. I'll leave it up to @cockroachdb/sql-observability team to decide whether it is worth merging this bandaid. My feeling is that it is still worth proceeding forward (and probably backporting to 21.1 branch) because in the end this PR should make our users' lives easier. Think of it this way: we have a tool to find bugs, it has already found one, and the question is - is it worth to disable the tool now in order to hide the bug from the user at the cost of not finding other bugs? I think it is, and fixing the bug proper should be evaluated against other items on SQL Observability roadmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling on this is that we should improve our docs about what the overhead bucket entails until we have a true fix, but I see your point as well. I don't want to hold this up, so if the Observability team doesn't share my point of view here, feel free to merge without me.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
I like the idea of documenting that "overhead" includes stmt retries. We could even possibly account for retries in a separate budget: at every stmt retry, deduct the time spent in the previous retry iteration from the current overhead sum, and add it to the new/separate retry budget. Hopefully this way we get a better breakdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to differentiate the retry time just for the specific statement? Part of the issue was the latency time was including retry times not only for the current, but for all other statements on the same batch and that is something we don't want.
If we can separate with "this is the latency without any overhead and this is the retry time" I like that idea.
Also, should we be calling this "retry time", since this extra time could be more than just retry time? Do we have a way to get more details on this "extra" time?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
I think an investigation is needed to answer these questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to differentiate the retry time just for the specific statement? Part of the issue was the latency time was including retry times not only for the current, but for all other statements on the same batch and that is something we don't want.
I'm not entirely convinced that this isn't something we want. Currently, we don't handle statement batches well, as described in #44803. In particular, multi-statement batches are treated as a single transaction in Postgres. CRDB treats them as individual implicit transactions. Hopefully we'll fix that, but considering this thing is already broken, I don't think we should be designing to cater to a bug.
So let's constrain ourselves to explicit transactions containing multiple statements. In the original patch, Solon gave a contrived example that looks something like:
BEGIN;
INSERT INTO t VALUES (1);
SELECT crdb_internal.force_retry('5s');
COMMIT;
The implication is that attributing retry latency to the first statement is confusing. I don't think I fully agree with this. We already increment the number of retries for the first statement today. I don't think it's confusing to attribute some latency in the retry bucket for the first statement in this scenario -- it did spend some time retrying.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cases that I have in my mind are:
- We have several statements on a transaction, but no retries: the time of statement 1 is count for the overhead of statement 2. I don't think we should be doing that, because I do believe can be confusing for the user (if they check directly on the statement details page, not necessarily they notice is part of a transaction, so it just looks weird that single statement took so long). We could have a total statement and overtime per transaction, that is something I can get behind and we should do it (it won't be as simple as adding overhead of all statements because we would be double counting, so it's a little more complex item).
- We have several statements and some have retries: the retry time of one statement is being counted on all others, so similar as case 1, I think we shouldn't be doing that.
- We have several statement inline, but not on the same transaction (e.g.
> select pg_sleep(2); select 1;
) : currently the wait time of statement 1 is being counted also on statement 2, and that is just wrong.
This PR addresses 1 and 3, and part of 2. The disadvantage is that on item 2 it could hide the retry times of the statement itself, but as Yahor mention, we do have the retries count metrics (in red, so they won't miss).
I think the "loss of details" on item 2 is worth if we fix 1 and 3., since currently those cases are confusing the users.
And we create an issue to handle item 2, which doesn't seem like a simple task.
@jordanlewis thoughts?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
Will this patch reintroduce these bugs? If not, I think that this approach is superior to doing nothing and documenting that "overhead" is weird and could contain all sort of things. We need to be taking a user-centric view of the world for things like this that are fundamentally output for users. If being user-centric in this case comes at the cost of precision of the different timings, so be it. The behavior we have today is just too confusing and inconsistent. |
Looking at the 3 reported incidents, it also comes down to being consistent across tools being used to measure latency. It seems like customers (and our support team) are using the trace as the source of truth here and then seeing that the DB console is far from this value. I know it may not be an apples to apples comparison for the customer using the trace and the DB console since the console aggregates multiple statement executions but this does seem like an improvement which will help us get closer to more consistent times since we'd be removing other statement executions and other statement retry times also not in the trace. I'll create a separate git issue where we can surface statement level retry time (and count in same view). The 'Other Execution Statistics' portion of the console might be a good place here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per messages above, I believe you can go ahead with the merge
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)
Thanks for the review and the discussion! bors r+ |
Build succeeded: |
This is 6371d37 plus the fix for when
the query is actually a portal in which case the parsing end time is
left unset.
This commit also incorporates the learnings of 056d5c5
so that the observer statements work well. This is done by splitting up
the getter method into two since we want to have two ways of retrieving
the query service latency.
Fixes #40675
Release note (ui change): In some cases, the Execution Stats page
would show a confusingly high Overhead latency for a statement. This
could happen due to multiple statements being parsed together or due to
statement execution being retried. To avoid this, we stop considering
the time between when parsing ends and execution begins when determining
service latency.