sql: fix inflated "overhead" in statement timings #65480

yuzefovich · 2021-05-19T20:48:42Z

This is 6371d37 plus the fix for when
the query is actually a portal in which case the parsing end time is
left unset.

This commit also incorporates the learnings of 056d5c5
so that the observer statements work well. This is done by splitting up
the getter method into two since we want to have two ways of retrieving
the query service latency.

Fixes #40675

Release note (ui change): In some cases, the Execution Stats page
would show a confusingly high Overhead latency for a statement. This
could happen due to multiple statements being parsed together or due to
statement execution being retried. To avoid this, we stop considering
the time between when parsing ends and execution begins when determining
service latency.

cockroach-teamcity · 2021-05-19T20:48:49Z

This change is

yuzefovich · 2021-05-19T20:52:20Z

@maryliag I think the issue is solely with the portals - there we set sessionStartParse and sessionEndParse phase to time.Time{} (zero) but sessionQueryReceived phase is getting the actual time. I think this might be the only thing that is missing from Solon's original PR.

arulajmani

I'm not sure this will play well with CLI timings, specifically the SHOW LAST QUERY STATISTICS thing that powers them. For some context, the CLI times the query on the client and issues a SHOW LAST QUERY STATISTICS query to get the server side timings. Of particular interest is the service latency, which it then uses to calculate the "execution" and "network" portions of the timings displayed.

As this change removes the time spent retrying a statement from the service latency, the CLI will now (incorrectly) attribute that to the network latency.

maryliag

pkg/sql/executor_statement_metrics.go, line 80 at r1 (raw file):

	// are submitted together, e.g. "SELECT 1; SELECT 2".
	//
	// If we're executing a portal, parsing end time will be zero, so

nit: can you add ... portal, both parsing start and end time will be zero, ...

maryliag

What if we keep this function like this and create one to calculate the retry times? And on the CLI those two values can be added back

yuzefovich · 2021-05-20T00:00:16Z

I think we just need to have two different implementations: one for the metrics which excludes the retrying time and has sessionQueryServiced unset; another for the last query stats which includes everything and has sessionQueryServiced phase set.

yuzefovich

Refactored, PTAL.

pkg/sql/executor_statement_metrics.go, line 80 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

Done.

This is 6371d37 plus the fix for when the query is actually a portal in which case the parsing end time is left unset. This commit also incorporates the learnings of 056d5c5 so that the observer statements work well. This is done by splitting up the getter method into two since we want to have two ways of retrieving the query service latency. Release note (ui change): In some cases, the Execution Stats page would show a confusingly high Overhead latency for a statement. This could happen due to multiple statements being parsed together or due to statement execution being retried. To avoid this, we stop considering the time between when parsing ends and execution begins when determining service latency.

arulajmani

The change, in terms of correctness, looks good to me now.

@yuzefovich, how do you feel about the approach 6371d37 took to solving this overhead bucket problem? I had some initial reservations on the original PR, and thinking about it a bit more today, I feel more strongly.

I think it's slightly dubious for us to pretend time spent retrying a statement doesn't exist. Instead, we should be increasing observability into time spent retrying a statement and recording it as a metric. Then, OverHead = Service - (Exec + Plan + Parse + Retry) becomes more palatable.

yuzefovich · 2021-05-20T00:42:20Z

I think I'm in "Solon camp" on this issue. The way I see the situation is as follows.

Currently quite a few things are broken about how we track the latency of a single statement

multiple statements parsed at once are really messed up
if a statement is retried, then all unsuccessful attempts at the stmt's execution will end up in the "overhead" bucket
that overhead bucket might grow sufficiently large to cause lots of confusion for our users
(possibly something else I missed).

To me, the third bullet from above in isolation seems to have the highest priority as well as the easiest fix (which is what this PR delivers). I also think that this PR doesn't make bullets 1 and 2 worse since it is only a "definition" change for the UI. True, now the statements page will show reduced service latency if the stmt was retried, however, it will contain a red number of retries, which makes it kinda acceptable in my mind overall.

I think your hesitation comes from the desire to address the second bullet, but IMO it should be a separate work item, and I imagine it'll be more involved than this PR.

yuzefovich · 2021-05-20T00:44:19Z

(Simply fixed the commit message.)

arulajmani

You're partly right in that's where my hesitation comes from. The other part is this comment from @knz on the issue:

Correct (I wrote that code). It ensures that the sum of latencies balances out. It also attracts our
attention to measurement bugs if the overhead every gets too large compared to other things.

Specifically the part about attracting our attention to measurement bugs if the overhead bucket gets too large. This thing seems to have done its job, we just found one. But by not fixing the root cause, and applying this bandaid, we're just making it harder to discover these issues in the future.

I agree it's a separate work item, and very likely will be more involved. I'm not asking you to make the change, I'm just asking you if it's worth not merging this one in favour of something that actually fixes the problem. Does my "harder to discover these issues" argument convince you?

yuzefovich · 2021-05-20T01:45:21Z

I see, Raphael and you are making a good point. I'll leave it up to @cockroachdb/sql-observability team to decide whether it is worth merging this bandaid.

My feeling is that it is still worth proceeding forward (and probably backporting to 21.1 branch) because in the end this PR should make our users' lives easier. Think of it this way: we have a tool to find bugs, it has already found one, and the question is - is it worth to disable the tool now in order to hide the bug from the user at the cost of not finding other bugs? I think it is, and fixing the bug proper should be evaluated against other items on SQL Observability roadmap.

arulajmani

My feeling on this is that we should improve our docs about what the overhead bucket entails until we have a true fix, but I see your point as well. I don't want to hold this up, so if the Observability team doesn't share my point of view here, feel free to merge without me.

knz · 2021-05-20T08:48:39Z

I like the idea of documenting that "overhead" includes stmt retries.

We could even possibly account for retries in a separate budget: at every stmt retry, deduct the time spent in the previous retry iteration from the current overhead sum, and add it to the new/separate retry budget. Hopefully this way we get a better breakdown.

maryliag

Are we able to differentiate the retry time just for the specific statement? Part of the issue was the latency time was including retry times not only for the current, but for all other statements on the same batch and that is something we don't want.
If we can separate with "this is the latency without any overhead and this is the retry time" I like that idea.
Also, should we be calling this "retry time", since this extra time could be more than just retry time? Do we have a way to get more details on this "extra" time?

knz · 2021-05-20T13:41:52Z

I think an investigation is needed to answer these questions.

arulajmani

I'm not entirely convinced that this isn't something we want. Currently, we don't handle statement batches well, as described in #44803. In particular, multi-statement batches are treated as a single transaction in Postgres. CRDB treats them as individual implicit transactions. Hopefully we'll fix that, but considering this thing is already broken, I don't think we should be designing to cater to a bug.

So let's constrain ourselves to explicit transactions containing multiple statements. In the original patch, Solon gave a contrived example that looks something like:

BEGIN; 
INSERT INTO t VALUES (1);
SELECT crdb_internal.force_retry('5s'); 
COMMIT;

The implication is that attributing retry latency to the first statement is confusing. I don't think I fully agree with this. We already increment the number of retries for the first statement today. I don't think it's confusing to attribute some latency in the retry bucket for the first statement in this scenario -- it did spend some time retrying.

maryliag

The cases that I have in my mind are:

We have several statements on a transaction, but no retries: the time of statement 1 is count for the overhead of statement 2. I don't think we should be doing that, because I do believe can be confusing for the user (if they check directly on the statement details page, not necessarily they notice is part of a transaction, so it just looks weird that single statement took so long). We could have a total statement and overtime per transaction, that is something I can get behind and we should do it (it won't be as simple as adding overhead of all statements because we would be double counting, so it's a little more complex item).
We have several statements and some have retries: the retry time of one statement is being counted on all others, so similar as case 1, I think we shouldn't be doing that.
We have several statement inline, but not on the same transaction (e.g. > select pg_sleep(2); select 1;) : currently the wait time of statement 1 is being counted also on statement 2, and that is just wrong.

This PR addresses 1 and 3, and part of 2. The disadvantage is that on item 2 it could hide the retry times of the statement itself, but as Yahor mention, we do have the retries count metrics (in red, so they won't miss).
I think the "loss of details" on item 2 is worth if we fix 1 and 3., since currently those cases are confusing the users.
And we create an issue to handle item 2, which doesn't seem like a simple task.

@jordanlewis thoughts?

jordanlewis · 2021-05-20T23:23:11Z

Will this patch reintroduce these bugs?

#45011
#54087

If not, I think that this approach is superior to doing nothing and documenting that "overhead" is weird and could contain all sort of things. We need to be taking a user-centric view of the world for things like this that are fundamentally output for users.

If being user-centric in this case comes at the cost of precision of the different timings, so be it. The behavior we have today is just too confusing and inconsistent.

yuzefovich · 2021-05-20T23:26:38Z

No, I believe it will not. I was able to reproduce the negative values for latencies mentioned in those issues and tracked them down to portals not having the parse end time set. This PR accounts for that scenario.

kevin-v-ngo · 2021-05-21T02:44:10Z

Looking at the 3 reported incidents, it also comes down to being consistent across tools being used to measure latency. It seems like customers (and our support team) are using the trace as the source of truth here and then seeing that the DB console is far from this value. I know it may not be an apples to apples comparison for the customer using the trace and the DB console since the console aggregates multiple statement executions but this does seem like an improvement which will help us get closer to more consistent times since we'd be removing other statement executions and other statement retry times also not in the trace.

I'll create a separate git issue where we can surface statement level retry time (and count in same view). The 'Other Execution Statistics' portion of the console might be a good place here.

maryliag

Per messages above, I believe you can go ahead with the merge

yuzefovich · 2021-05-21T16:17:16Z

Thanks for the review and the discussion!

bors r+

craig · 2021-05-21T17:00:51Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich requested review from maryliag, arulajmani and a team May 19, 2021 20:48

arulajmani reviewed May 19, 2021

View reviewed changes

maryliag reviewed May 19, 2021

View reviewed changes

yuzefovich force-pushed the time-overhead branch from 8e8be1e to a1d27c0 Compare May 20, 2021 00:09

yuzefovich commented May 20, 2021

View reviewed changes

arulajmani reviewed May 20, 2021

View reviewed changes

yuzefovich force-pushed the time-overhead branch from a1d27c0 to 3ed94ee Compare May 20, 2021 00:44

arulajmani reviewed May 20, 2021

View reviewed changes

maryliag reviewed May 20, 2021

View reviewed changes

arulajmani reviewed May 20, 2021

View reviewed changes

maryliag reviewed May 20, 2021

View reviewed changes

maryliag reviewed May 21, 2021

View reviewed changes

kevin-v-ngo mentioned this pull request May 21, 2021

Surface retry time and count in execution statistics in the statement details page #65564

Closed

craig bot merged commit 6469dab into cockroachdb:master May 21, 2021

yuzefovich deleted the time-overhead branch May 21, 2021 17:15

yuzefovich mentioned this pull request Jun 4, 2021

release-21.1: sql: fix inflated "overhead" in statement timings #66108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: fix inflated "overhead" in statement timings #65480

sql: fix inflated "overhead" in statement timings #65480

yuzefovich commented May 19, 2021 •

edited

Loading

cockroach-teamcity commented May 19, 2021

yuzefovich commented May 19, 2021

arulajmani left a comment

maryliag left a comment

maryliag left a comment

yuzefovich commented May 20, 2021

yuzefovich left a comment

arulajmani left a comment

yuzefovich commented May 20, 2021

yuzefovich commented May 20, 2021

arulajmani left a comment •

edited by yuzefovich

Loading

yuzefovich commented May 20, 2021

arulajmani left a comment

knz commented May 20, 2021

maryliag left a comment

knz commented May 20, 2021

arulajmani left a comment

maryliag left a comment

jordanlewis commented May 20, 2021

yuzefovich commented May 20, 2021

kevin-v-ngo commented May 21, 2021

maryliag left a comment

yuzefovich commented May 21, 2021

craig bot commented May 21, 2021

sql: fix inflated "overhead" in statement timings #65480

sql: fix inflated "overhead" in statement timings #65480

Conversation

yuzefovich commented May 19, 2021 • edited Loading

cockroach-teamcity commented May 19, 2021

yuzefovich commented May 19, 2021

arulajmani left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

yuzefovich commented May 20, 2021

yuzefovich left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

yuzefovich commented May 20, 2021

yuzefovich commented May 20, 2021

arulajmani left a comment • edited by yuzefovich Loading

Choose a reason for hiding this comment

yuzefovich commented May 20, 2021

arulajmani left a comment

Choose a reason for hiding this comment

knz commented May 20, 2021

maryliag left a comment

Choose a reason for hiding this comment

knz commented May 20, 2021

arulajmani left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

jordanlewis commented May 20, 2021

yuzefovich commented May 20, 2021

kevin-v-ngo commented May 21, 2021

maryliag left a comment

Choose a reason for hiding this comment

yuzefovich commented May 21, 2021

craig bot commented May 21, 2021

yuzefovich commented May 19, 2021 •

edited

Loading

arulajmani left a comment •

edited by yuzefovich

Loading