-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: record CPU seconds per statement #87213
Comments
Drew has a prototype here: #87407 |
Thanks Jordan, I wasn't aware of this issue. I'll mention that the prototype is more meant for RU estimation in testing environments for serverless users - it wouldn't really work in a production environment, and definitely not for passively monitoring query cpu usage. I think for that we do need to use the per-goroutine running time (the prototype just runs a simple experiment to see how running the query affects cpu usage for the process). I'll have to look into whether doing that is much more complicated than what I did for #87407. |
@DrewKimball from a quick glance at #87407, just try replacing your uses of wall time ( |
@irfansharif thanks for the quick response. Do you foresee any issues with backporting if we were to use |
Ah, yea this is not backportable -- it's only going to be available in 22.2+. |
I took a slightly different approach to this idea in this draft PR just to prototype it: #89122 It's a bit confusing - there are a lot of possible quantities we could track, same as the trouble with calculating "statement phases". One trouble is that since the beginning of statement execution does cross between goroutines, it's not going to quite be the same as our "service latency" concept which starts as soon as the statement is received. Drew's PR also takes a stab at calculating the CPU spent in async components, and the prototype I tried out doesn't. I think we'd want to ask the Queries team to take a look at doing that because it's a bit subtle to get it right. Honestly I think that starting with just CPU time spent on the gateway would be a really good first stab. |
Are there any plans to change this? / Do you have an idea of how difficult it would be to fix it? |
@rickystewart would know best. If you're asking primarily for your own ease of development on Macs, I've used something like #93206 to point to my own mirror. |
Thanks. It's mostly for ease of development. I'll take a look at #93206 |
This commit adds tracking for CPU time spent during SQL execution. The CPU time is tracked at the operator granularity when statistics collection is enabled, similar to execution time. For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE` variants. A future PR will add support for logging this value in the statement statistics. Informs: cockroachdb#87213 Release note (sql change): CPU time spent during SQL execution is now visible in the output of queries run with `EXPLAIN ANALYZE`. This measure does not include CPU time spent while serving KV requests. This can be useful for diagnosing performance issues and optimizing SQL queries.
This commit adds tracking for CPU time spent during SQL execution. The CPU time is tracked at the operator granularity when statistics collection is enabled, similar to execution time. For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE` variants. A future PR will add support for logging this value in the statement statistics. Note that the main goroutines of the streamer, hash router, and outbox are not currently tracked by this work. However, it is expected that these should be relatively cheap, and shouldn't significantly impact the measurement. Informs: cockroachdb#87213 Release note (sql change): CPU time spent during SQL execution is now visible in the output of queries run with `EXPLAIN ANALYZE`. This measure does not include CPU time spent while serving KV requests. This can be useful for diagnosing performance issues and optimizing SQL queries.
This commit adds tracking for CPU time spent during SQL execution. The CPU time is tracked at the operator granularity when statistics collection is enabled, similar to execution time. For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE` variants. A future PR will add support for logging this value in the statement statistics. Note that the main goroutines of the streamer, hash router, and outbox are not currently tracked by this work. However, it is expected that these should be relatively cheap, and shouldn't significantly impact the measurement. Additionally, KV work is performed on a SQL goroutine in some cases (e.g. when there is a single-range request for a local range). This makes it necessary to track CPU time spent fulfilling KV requests on a SQL goroutine so it can be subtracted from the total measured CPU time. This logic is handled by the `kvBatchFetcherHelper` for the operators that only perform reads (e.g. lookup joins and scans). Because mutations do not record stats, they currently have no way to differentiate KV CPU time from SQL CPU time. For this reason, a plan that contains mutations will not output CPU time. Informs: cockroachdb#87213 Release note (sql change): CPU time spent during SQL execution is now visible in the output of queries run with `EXPLAIN ANALYZE`. This measure does not include CPU time spent while serving KV requests, and CPU time is not shown for queries that perform mutations. This can be useful for diagnosing performance issues and optimizing SQL queries.
This commit adds tracking for CPU time spent during SQL execution. The CPU time is tracked at the operator granularity when statistics collection is enabled, similar to execution time. For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE` variants. A future PR will add support for logging this value in the statement statistics. Note that the main goroutines of the streamer, hash router, and outbox are not currently tracked by this work. However, it is expected that these should be relatively cheap, and shouldn't significantly impact the measurement. Additionally, KV work is performed on a SQL goroutine in some cases (e.g. when there is a single-range request for a local range). This makes it necessary to track CPU time spent fulfilling KV requests on a SQL goroutine so it can be subtracted from the total measured CPU time. This logic is handled by the `kvBatchFetcherHelper` for the operators that only perform reads (e.g. lookup joins and scans). Because mutations do not record stats, they currently have no way to differentiate KV CPU time from SQL CPU time. For this reason, a plan that contains mutations will not output CPU time. Informs: cockroachdb#87213 Release note (sql change): CPU time spent during SQL execution is now visible in the output of queries run with `EXPLAIN ANALYZE`. This measure does not include CPU time spent while serving KV requests, and CPU time is not shown for queries that perform mutations. This can be useful for diagnosing performance issues and optimizing SQL queries.
This commit adds tracking for CPU time spent during SQL execution. The CPU time is tracked at the operator granularity when statistics collection is enabled, similar to execution time. For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE` variants. A future PR will add support for logging this value in the statement statistics. Note that the main goroutines of the streamer, hash router, and outbox are not currently tracked by this work. However, it is expected that these should be relatively cheap, and shouldn't significantly impact the measurement. Additionally, KV work is performed on a SQL goroutine in some cases (e.g. when there is a single-range request for a local range). This makes it necessary to track CPU time spent fulfilling KV requests on a SQL goroutine so it can be subtracted from the total measured CPU time. This logic is handled by the `cFetcher` and `rowFetcherStatCollector` for the operators that only perform reads. Finally, because mutations do not record stats, they currently have no way to differentiate KV CPU time from SQL CPU time. For this reason, a plan that contains mutations will not output CPU time. Informs: cockroachdb#87213 Release note (sql change): CPU time spent during SQL execution is now visible in the output of queries run with `EXPLAIN ANALYZE`. This measure does not include CPU time spent while serving KV requests, and CPU time is not shown for queries that perform mutations. This can be useful for diagnosing performance issues and optimizing SQL queries.
Part Of cockroachdb#87213 Adds CPU Time column on Statement and Transaction Insights pages and their respective details pages. Release note (ui change): Add CPU Time to Statement and Transaction Insights.
96204: sql: add cpu nanos to insights tables r=maryliag a=maryliag Part Of #87213 Add `cpu_sql_nanos` to `crdb_internal.{cluster,nodes}_execution_insights` and `crdb_internal.{cluster,nodes}_txn_execution_insights`. Release note: None 96229: sql: move RandString to util to break coldata->rangden->sem/tree dep r=cucaroach a=cucaroach Epic: CRDB-18892 Informs: #91831 Release note: None 96234: server: properly initialize the cert manager in secondary tenants r=dhartunian a=knz First commit from #96222. Epic: CRDB-14537. Fixes #96226. Fixes #94920. Prior to this patch, the cert manager initialization was incomplete in servers for secondary tenants, in two ways: - it did not properly register its metrics. - it did not properly listen to the SIGHUP signal for cert rotation. Additionally, there was a bug whereby a missing tenant client cert could cause a crash of the process. This is now averted by proper error checking. (A more general fix requires a larger refactor and will be performed in #96233.) 96242: ccl/sqlproxyccl: fix goroutine leak in TestDenylistUpdate due to pgconn r=darinpp a=jaylim-crl Fixes #94450. Previously, there was a possibility for TestDenylistUpdate to fail in stress due to a goroutine that was spun up by asyncClose in pgconn. Based on experiments, it appears that pgconn will invoke an asyncClose if the connection was closed abruptly by the server. In the proxy handler, this can only happen in one of four [cases](https://github.com/cockroachdb/cockroach/blob/10ef5d9d7a646831799faa0f0611834fe032b93b/pkg/ccl/sqlproxyccl/proxy_handler.go#L477-L493): 1. context cancellation 2. forwarder error 3. denylist triggers 4. stopper quiescing We don't really test (1) and (4) specifically as they are usually embedded into most of the tests, and this code path is only called during a proxy shutdown. Forwarder errors due to connections closing abruptly can happen during connection transfers, but the tests that do that do not use pgconn (so this leak isn't there to begin with). The only end-to-end test for (3) is TestDenylistUpdate, and that uses pgconn. When the tests terminate, pgconn appears to invoke asyncClose, spinning up an additional goroutine. A possible fix (and tested) would be to wait for the goroutine to terminate: ```go select { case <-conn.PgConn().CleanupDone(): case <-time.After(20 * time.Second): // 20 seconds was picked because pgconn has an internal deadline of // 15 seconds when performing the async close request. t.Fatalf("pgconn asyncClose did not clean up on time") } ``` However, this commit takes a different approach by using database/sql instead of pgconn, so the problem won't be there in the first place. There is no specific requirement to use pgconn here, so we'll skip that. Epic: none Release note: None Co-authored-by: maryliag <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Jay <[email protected]>
Part Of cockroachdb#87213 Adds CPU Time column on Statement and Transaction Insights pages and their respective details pages. Release note (ui change): Add CPU Time to Statement and Transaction Insights.
Part Of cockroachdb#87213 Adds CPU Time column on Statement and Transaction Insights pages and their respective details pages. Release note (ui change): Add CPU Time to Statement and Transaction Insights.
96279: ui: add CPU column on Insights r=maryliag a=maryliag Part Of #87213 Adds CPU Time column on Statement and Transaction Insights pages and their respective details pages. Transaction Overview <img width="1398" alt="Screen Shot 2023-01-31 at 12 37 20 PM" src="https://user-images.githubusercontent.com/1017486/215839478-c5d999f3-4322-49bf-9467-8ee60d5ceb85.png"> Transaction Details <img width="769" alt="Screen Shot 2023-01-31 at 12 37 52 PM" src="https://user-images.githubusercontent.com/1017486/215839559-c3893236-c22b-4b24-8186-098cc6e809af.png"> Statements Overview <img width="757" alt="Screen Shot 2023-01-31 at 12 37 36 PM" src="https://user-images.githubusercontent.com/1017486/215839496-81b409e8-21c0-4d04-9bd2-8a1b94858304.png"> Statement Details <img width="774" alt="Screen Shot 2023-01-31 at 12 37 43 PM" src="https://user-images.githubusercontent.com/1017486/215839518-0161df45-627c-4eed-9653-583de391c55e.png"> Release note (ui change): Add CPU Time to Statement and Transaction Insights. 96348: multiregionccl: add another missing log scope r=ajwerner a=ajwerner The test output files for this package were huge because of this. Epic: none Release note: None 96384: changefeedccl: shorten random expressions test and remove npe r=HonoreDB a=HonoreDB This test is mysteriously being marked failed occasionally, which I suspect is somehow a result of the needless null pointer exception being thrown if a changefeed errors on create. This fixes that, and also reduces the test size by 90% to ensure no timeouts--I'll bump that up again later. Fixes #96052 Release note: None Co-authored-by: maryliag <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Aaron Zinger <[email protected]>
@DrewKimball what work is remaining here? |
The biggest remaining item will be to also record KV CPU time. There are also some other things left to do - e.g. lifting current limitations on what queries we can collect CPU time for. |
@DrewKimball will this require help from the KV team? |
I think it's probably best that we do it, but KV should take a look at the PR(s). |
Currently, it's not possible to understand what fraction of each query's latency was CPU time. This is frustrating for users, who want to be able to see CPU information alongside the other things that we do collect like disk bytes, network, memory. (See #60628 which I'm closing in favor of this now that we have some more details on how it'll work)
Now that @irfansharif's foundational work for #82625 has landed, it's possible to ask Go to return the number of CPU seconds that have been consumed by a goroutine, with this API:
cockroach/pkg/util/grunning/grunning.go
Lines 19 to 22 in da00c3a
Much of SQL uses just a single goroutine. This changes when we run a DistSQL query, or emit remote KV reads. I think we should start by recording CPU time on just the main SQL runner goroutine, and reporting that. Reporting CPU time spent across the cluster comes with some caveats, so we should think about it from a product perspective - how will we represent in the UI.
Jira issue: CRDB-19217
gz#21392
The text was updated successfully, but these errors were encountered: