Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: record CPU seconds per statement #87213

Open
1 of 3 tasks
jordanlewis opened this issue Aug 31, 2022 · 13 comments
Open
1 of 3 tasks

sql: record CPU seconds per statement #87213

jordanlewis opened this issue Aug 31, 2022 · 13 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@jordanlewis
Copy link
Member

jordanlewis commented Aug 31, 2022

Currently, it's not possible to understand what fraction of each query's latency was CPU time. This is frustrating for users, who want to be able to see CPU information alongside the other things that we do collect like disk bytes, network, memory. (See #60628 which I'm closing in favor of this now that we have some more details on how it'll work)

Now that @irfansharif's foundational work for #82625 has landed, it's possible to ask Go to return the number of CPU seconds that have been consumed by a goroutine, with this API:

// Time returns the time spent by the current goroutine in the running state.
func Time() time.Duration {
return time.Duration(grunningnanos())
}

Much of SQL uses just a single goroutine. This changes when we run a DistSQL query, or emit remote KV reads. I think we should start by recording CPU time on just the main SQL runner goroutine, and reporting that. Reporting CPU time spent across the cluster comes with some caveats, so we should think about it from a product perspective - how will we represent in the UI.

  • Record CPU seconds per statement, similar to how we record end to end latency per statement
  • Record CPU seconds, including remote KV reads
  • Record CPU seconds, including DistSQL CPU

Jira issue: CRDB-19217

gz#21392

@jordanlewis jordanlewis added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 31, 2022
@jordanlewis
Copy link
Member Author

Drew has a prototype here: #87407

@DrewKimball
Copy link
Collaborator

Thanks Jordan, I wasn't aware of this issue. I'll mention that the prototype is more meant for RU estimation in testing environments for serverless users - it wouldn't really work in a production environment, and definitely not for passively monitoring query cpu usage. I think for that we do need to use the per-goroutine running time (the prototype just runs a simple experiment to see how running the query affects cpu usage for the process). I'll have to look into whether doing that is much more complicated than what I did for #87407.

@irfansharif
Copy link
Contributor

@DrewKimball from a quick glance at #87407, just try replacing your uses of wall time (timeutil.Now(), time.Sub(), etc.) with the CPU time measurements Jordan linked above -- grunning.Time() or grunning.Difference(). Hopefully it's mostly drop in. This functionality is disabled on M1 Macs (see build tags in that package) due to some code-signing issues, so you'd have to test on linux VMs.

@DrewKimball
Copy link
Collaborator

@irfansharif thanks for the quick response. Do you foresee any issues with backporting if we were to use grunning?

@irfansharif
Copy link
Contributor

Ah, yea this is not backportable -- it's only going to be available in 22.2+.

@jordanlewis
Copy link
Member Author

I took a slightly different approach to this idea in this draft PR just to prototype it: #89122

It's a bit confusing - there are a lot of possible quantities we could track, same as the trouble with calculating "statement phases". One trouble is that since the beginning of statement execution does cross between goroutines, it's not going to quite be the same as our "service latency" concept which starts as soon as the statement is received.

Drew's PR also takes a stab at calculating the CPU spent in async components, and the prototype I tried out doesn't. I think we'd want to ask the Queries team to take a look at doing that because it's a bit subtle to get it right. Honestly I think that starting with just CPU time spent on the gateway would be a really good first stab.

@DrewKimball
Copy link
Collaborator

This functionality is disabled on M1 Macs (see build tags in that package) due to some code-signing issues, so you'd have to test on linux VMs.

Are there any plans to change this? / Do you have an idea of how difficult it would be to fix it?

@irfansharif
Copy link
Contributor

@rickystewart would know best. If you're asking primarily for your own ease of development on Macs, I've used something like #93206 to point to my own mirror.

@DrewKimball
Copy link
Collaborator

Thanks. It's mostly for ease of development. I'll take a look at #93206

DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 20, 2022
This commit adds tracking for CPU time spent during SQL execution. The CPU
time is tracked at the operator granularity when statistics collection is
enabled, similar to execution time.

For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE`
variants. A future PR will add support for logging this value in the
statement statistics.

Informs: cockroachdb#87213

Release note (sql change): CPU time spent during SQL execution is now visible
in the output of queries run with `EXPLAIN ANALYZE`. This measure does not
include CPU time spent while serving KV requests. This can be useful for
diagnosing performance issues and optimizing SQL queries.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 20, 2022
This commit adds tracking for CPU time spent during SQL execution. The CPU
time is tracked at the operator granularity when statistics collection is
enabled, similar to execution time.

For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE`
variants. A future PR will add support for logging this value in the
statement statistics.

Note that the main goroutines of the streamer, hash router, and outbox are
not currently tracked by this work. However, it is expected that these should
be relatively cheap, and shouldn't significantly impact the measurement.

Informs: cockroachdb#87213

Release note (sql change): CPU time spent during SQL execution is now visible
in the output of queries run with `EXPLAIN ANALYZE`. This measure does not
include CPU time spent while serving KV requests. This can be useful for
diagnosing performance issues and optimizing SQL queries.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 9, 2023
This commit adds tracking for CPU time spent during SQL execution. The CPU
time is tracked at the operator granularity when statistics collection is
enabled, similar to execution time.

For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE`
variants. A future PR will add support for logging this value in the
statement statistics.

Note that the main goroutines of the streamer, hash router, and outbox are
not currently tracked by this work. However, it is expected that these should
be relatively cheap, and shouldn't significantly impact the measurement.

Additionally, KV work is performed on a SQL goroutine in some cases
(e.g. when there is a single-range request for a local range). This makes
it necessary to track CPU time spent fulfilling KV requests on a SQL
goroutine so it can be subtracted from the total measured CPU time.
This logic is handled by the `kvBatchFetcherHelper` for the operators
that only perform reads (e.g. lookup joins and scans).

Because mutations do not record stats, they currently have no way to
differentiate KV CPU time from SQL CPU time. For this reason, a plan that
contains mutations will not output CPU time.

Informs: cockroachdb#87213

Release note (sql change): CPU time spent during SQL execution is now visible
in the output of queries run with `EXPLAIN ANALYZE`. This measure does not
include CPU time spent while serving KV requests, and CPU time is not shown for
queries that perform mutations. This can be useful for diagnosing performance
issues and optimizing SQL queries.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 10, 2023
This commit adds tracking for CPU time spent during SQL execution. The CPU
time is tracked at the operator granularity when statistics collection is
enabled, similar to execution time.

For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE`
variants. A future PR will add support for logging this value in the
statement statistics.

Note that the main goroutines of the streamer, hash router, and outbox are
not currently tracked by this work. However, it is expected that these should
be relatively cheap, and shouldn't significantly impact the measurement.

Additionally, KV work is performed on a SQL goroutine in some cases
(e.g. when there is a single-range request for a local range). This makes
it necessary to track CPU time spent fulfilling KV requests on a SQL
goroutine so it can be subtracted from the total measured CPU time.
This logic is handled by the `kvBatchFetcherHelper` for the operators
that only perform reads (e.g. lookup joins and scans).

Because mutations do not record stats, they currently have no way to
differentiate KV CPU time from SQL CPU time. For this reason, a plan that
contains mutations will not output CPU time.

Informs: cockroachdb#87213

Release note (sql change): CPU time spent during SQL execution is now visible
in the output of queries run with `EXPLAIN ANALYZE`. This measure does not
include CPU time spent while serving KV requests, and CPU time is not shown for
queries that perform mutations. This can be useful for diagnosing performance
issues and optimizing SQL queries.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jan 18, 2023
This commit adds tracking for CPU time spent during SQL execution. The CPU
time is tracked at the operator granularity when statistics collection is
enabled, similar to execution time.

For now, the CPU time is only surfaced in the output of `EXPLAIN ANALYZE`
variants. A future PR will add support for logging this value in the
statement statistics.

Note that the main goroutines of the streamer, hash router, and outbox are
not currently tracked by this work. However, it is expected that these should
be relatively cheap, and shouldn't significantly impact the measurement.

Additionally, KV work is performed on a SQL goroutine in some cases
(e.g. when there is a single-range request for a local range). This makes
it necessary to track CPU time spent fulfilling KV requests on a SQL
goroutine so it can be subtracted from the total measured CPU time.
This logic is handled by the `cFetcher` and `rowFetcherStatCollector`
for the operators that only perform reads.

Finally, because mutations do not record stats, they currently have no way
to differentiate KV CPU time from SQL CPU time. For this reason, a plan that
contains mutations will not output CPU time.

Informs: cockroachdb#87213

Release note (sql change): CPU time spent during SQL execution is now visible
in the output of queries run with `EXPLAIN ANALYZE`. This measure does not
include CPU time spent while serving KV requests, and CPU time is not shown for
queries that perform mutations. This can be useful for diagnosing performance
issues and optimizing SQL queries.
maryliag added a commit to maryliag/cockroach that referenced this issue Jan 31, 2023
Part Of cockroachdb#87213

Adds CPU Time column on Statement and Transaction
Insights pages and their respective details pages.

Release note (ui change): Add CPU Time to Statement and
Transaction Insights.
craig bot pushed a commit that referenced this issue Jan 31, 2023
96204: sql: add cpu nanos to insights tables r=maryliag a=maryliag

Part Of #87213

Add `cpu_sql_nanos` to
`crdb_internal.{cluster,nodes}_execution_insights` and `crdb_internal.{cluster,nodes}_txn_execution_insights`.

Release note: None

96229: sql: move RandString to util to break coldata->rangden->sem/tree dep r=cucaroach a=cucaroach

Epic: CRDB-18892
Informs: #91831
Release note: None


96234: server: properly initialize the cert manager in secondary tenants r=dhartunian a=knz

First commit from #96222.
Epic: CRDB-14537.
Fixes #96226.
Fixes #94920.

Prior to this patch, the cert manager initialization was incomplete in
servers for secondary tenants, in two ways:

- it did not properly register its metrics.
- it did not properly listen to the SIGHUP signal for cert rotation.

Additionally, there was a bug whereby a missing tenant client cert
could cause a crash of the process. This is now averted by proper
error checking.
(A more general fix requires a larger refactor and will be performed
in #96233.)

96242: ccl/sqlproxyccl: fix goroutine leak in TestDenylistUpdate due to pgconn r=darinpp a=jaylim-crl

Fixes #94450.

Previously, there was a possibility for TestDenylistUpdate to fail in stress due to a goroutine that was spun up by asyncClose in pgconn. Based on experiments, it appears that pgconn will invoke an asyncClose if the connection was closed abruptly by the server. In the proxy handler, this can only happen in one of four [cases](https://github.com/cockroachdb/cockroach/blob/10ef5d9d7a646831799faa0f0611834fe032b93b/pkg/ccl/sqlproxyccl/proxy_handler.go#L477-L493):
1. context cancellation
2. forwarder error
3. denylist triggers
4. stopper quiescing

We don't really test (1) and (4) specifically as they are usually embedded into most of the tests, and this code path is only called during a proxy shutdown. Forwarder errors due to connections closing abruptly can happen during connection transfers, but the tests that do that do not use pgconn (so this leak isn't there to begin with). The only end-to-end test for (3) is TestDenylistUpdate, and that uses pgconn. When the tests terminate, pgconn appears to invoke asyncClose, spinning up an additional goroutine.

A possible fix (and tested)  would be to wait for the goroutine to terminate:

```go
select {
case <-conn.PgConn().CleanupDone():
case <-time.After(20 * time.Second):
    // 20 seconds was picked because pgconn has an internal deadline of
    // 15 seconds when performing the async close request.
    t.Fatalf("pgconn asyncClose did not clean up on time")
}
```

However, this commit takes a different approach by using database/sql instead of pgconn, so the problem won't be there in the first place. There is no specific requirement to use pgconn here, so we'll skip that.

Epic: none

Release note: None

Co-authored-by: maryliag <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Jay <[email protected]>
maryliag added a commit to maryliag/cockroach that referenced this issue Feb 1, 2023
Part Of cockroachdb#87213

Adds CPU Time column on Statement and Transaction
Insights pages and their respective details pages.

Release note (ui change): Add CPU Time to Statement and
Transaction Insights.
maryliag added a commit to maryliag/cockroach that referenced this issue Feb 1, 2023
Part Of cockroachdb#87213

Adds CPU Time column on Statement and Transaction
Insights pages and their respective details pages.

Release note (ui change): Add CPU Time to Statement and
Transaction Insights.
craig bot pushed a commit that referenced this issue Feb 2, 2023
96279: ui: add CPU column on Insights r=maryliag a=maryliag

Part Of #87213

Adds CPU Time column on Statement and Transaction
Insights pages and their respective details pages.

Transaction Overview
<img width="1398" alt="Screen Shot 2023-01-31 at 12 37 20 PM" src="https://user-images.githubusercontent.com/1017486/215839478-c5d999f3-4322-49bf-9467-8ee60d5ceb85.png">

Transaction Details
<img width="769" alt="Screen Shot 2023-01-31 at 12 37 52 PM" src="https://user-images.githubusercontent.com/1017486/215839559-c3893236-c22b-4b24-8186-098cc6e809af.png">

Statements Overview
<img width="757" alt="Screen Shot 2023-01-31 at 12 37 36 PM" src="https://user-images.githubusercontent.com/1017486/215839496-81b409e8-21c0-4d04-9bd2-8a1b94858304.png">

Statement Details
<img width="774" alt="Screen Shot 2023-01-31 at 12 37 43 PM" src="https://user-images.githubusercontent.com/1017486/215839518-0161df45-627c-4eed-9653-583de391c55e.png">


Release note (ui change): Add CPU Time to Statement and
Transaction Insights.

96348: multiregionccl: add another missing log scope r=ajwerner a=ajwerner

The test output files for this package were huge because of this.

Epic: none

Release note: None

96384: changefeedccl: shorten random expressions test and remove npe r=HonoreDB a=HonoreDB

This test is mysteriously being marked failed occasionally, which I suspect is somehow a result of the needless null pointer exception being thrown if a changefeed errors on create. This fixes that, and also reduces the test size by 90% to ensure no timeouts--I'll bump that up again later.

Fixes #96052

Release note: None

Co-authored-by: maryliag <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Aaron Zinger <[email protected]>
@mgartner
Copy link
Collaborator

@DrewKimball what work is remaining here?

@DrewKimball
Copy link
Collaborator

The biggest remaining item will be to also record KV CPU time. There are also some other things left to do - e.g. lifting current limitations on what queries we can collect CPU time for.

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Jun 6, 2023
@mgartner mgartner moved this to 23.2 Release in SQL Queries Jul 24, 2023
@mgartner mgartner moved this from 23.2 Release to 24.1 Release in SQL Queries Aug 3, 2023
@michae2
Copy link
Collaborator

michae2 commented Aug 14, 2023

The biggest remaining item will be to also record KV CPU time.

@DrewKimball will this require help from the KV team?

@DrewKimball
Copy link
Collaborator

@DrewKimball will this require help from the KV team?

I think it's probably best that we do it, but KV should take a look at the PR(s).

@yuzefovich yuzefovich added the T-sql-queries SQL Queries Team label Nov 21, 2023
@mgartner mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 27, 2023
@mgartner mgartner moved this from 24.2 Release to New Backlog in SQL Queries Mar 28, 2024
@exalate-issue-sync exalate-issue-sync bot added T-sql-queries SQL Queries Team and removed T-observability labels May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

6 participants