-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: statement timings in transactions are suspicious #40675
Comments
@piyush-singh thanks for writing this up! It's missing one crucial measurement that I was trying to explain in person - you included the timings as reported by the SQL shell (which just measures the time until response was finished from the server) for the implicit txn, but not for the explicit txn. Can you include those timings? In particular, the crucial measurement is whether the select within the transaction returns quickly or slowly, and whether the commit statement returns quickly or slowly. Thank you! |
No problem - so when running the explicit transaction, I only get a single timing if I don't have results returned for each statement:
If I execute the statements one at a time, I see:
One thing I wonder about this second set of timings is whether entering these slowly, manually impacts the timings vs executing them rapidly programmatically (without waiting to type them out in between). Perhaps if replication is able to take place while I'm typing it might impact reporting latencies? |
What is "overhead" in the latency breakdown referring to? |
Overhead is a catch-all bucket as I understand it. Essentially we subtract the parse, plan, and run latencies from the overall latency and the remainder is just stuck under the overhead label. |
There is additional complexity in the SQL code with explicit transactions. We'd need some advanced investigation, but I would not be surprised if it was truly slower.
Correct (I wrote that code). It ensures that the sum of latencies balances out. It also attracts our attention to measurement bugs if the overhead every gets too large compared to other things. |
@piyush-singh Is it possible that the leaseholder for the table descriptors or namespaces were remote and that the latency was due to lookups in these metadata tables? That would be consistent with your finding that the latency was attributed to "planning". It would also explain why the first access of table |
It would also explain this:
|
It's "expected" that the 1st planning time will be very large, because we have to fetch sql schema in order to plan. In fact, even subsequent planning times can be large, because you might hit different gateway nodes, and each node then has to fetch schema. |
There could be more than one issue at play here but it looks to me like our latency accounting is just plain wrong when multiple statements are submitted in a single request. The execution time of the earlier statements is counted as "Overhead" in the later statements. This affects both implicit and explicit transactions. SQL:
I'll look into how to fix the accounting for this in conn_executor. |
53800: server: defer gossip start to when absolutely needed r=irfansharif a=irfansharif This was pulled out of #52526 to keep the diff there focussed on the introduction of the RPC (and to see if that alone shaked out any failures). That change lets us make this one, were we can defer gossip start until right before we start up Node. Release justification: low risk, high benefit change to existing functionality Release note: None 53846: sql: fix inflated "overhead" in statement timings r=solongordon a=solongordon Fixes #40675 Fixes #50108 Release justification: Low-risk, high-reward fix to existing functionality. Release note (admin ui change): In some cases, the Execution Stats page would show a confusingly high Overhead latency for a statement. This could happen due to multiple statements being parsed together or due to statement execution being retried. To avoid this, we stop considering the time between when parsing ends and execution begins when determining service latency. 53906: roachtest: Unskip disk-full roachtest r=tbg a=itsbilal The disk-full roachtest no longer fails now that Pebble is the default storage engine. Remove the skipped marker. Fixes #35328. Release justification: Roachtest-only change. Release note: None. 53907: geos: bump to include seg fault fix r=sumeerbhola a=otan Resolves #53254. Release justification: bug fix to new functionality Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Oliver Tan <[email protected]>
@solongordon will sync up with the Observability team and move ownership over to them. |
@jordanlewis can we get this on the next milestone? |
From Solon's last comments, the solution for this will be just "stop displaying this "overhead" on the statement details page"? |
Describe the problem
Roko noted that we had a customer with unexpected latencies in their Admin UI. The customers' latencies were different when measured from their client compared to the latencies in the UI. Roko and I did a simple experiment to determine whether wrapping statements in txns was impacting timings for the statements. It seems that we are inaccurately reporting latencies in the below situation.
To Reproduce
Setup steps:
SQL:
When I check the UI, I see the read took 800+ ms despite being connected to the leaseholder as my gateway node:
This is unexpected. When executing this via an implicit txn, I see the statement executing much faster:
(side note: why is the second run so much faster?)
Subsequent executions yield some strange results in other ways. I see that latencies for the planning phase on the insert are extremely high (this is also true on the read, but closer to 100ms instead of 400+):
It is unclear why the timings for the reads is so different inside of vs outside of transactions. cc @rkruze and @jordanlewis as we spoke about this in person.
gz#8246
The text was updated successfully, but these errors were encountered: