-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: EXPLAIN ANALYZE (DEBUG) 4-12x slower for queries with large lookup joins #90739
Labels
A-sql-explain
Issues related to EXPLAIN and EXPLAIN ANALYZE improvements
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-queries
SQL Queries Team
X-nostale
Marks an issue/pr that should be ignored by the stale bot
Comments
michae2
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-sql-explain
Issues related to EXPLAIN and EXPLAIN ANALYZE improvements
labels
Oct 26, 2022
(Note that this is different from #80671 which was fixed in 21.2.) |
5-second cpu and heap profiles from running the explain analyze query repeatedly against a single-node cluster: |
From the team meeting:
|
yuzefovich
changed the title
EXPLAIN ANALYZE 4-12x slower for queries with large lookup joins
sql: EXPLAIN ANALYZE (DEBUG) 4-12x slower for queries with large lookup joins
Nov 2, 2022
craig bot
pushed a commit
that referenced
this issue
Nov 3, 2022
89632: go.mod: bump raft r=nvanbenschoten a=tbg ``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Compared single-node performance on gceworker via ```bash #!/bin/bash set -euxo pipefail pkill -9 cockroach || true rm -rf cockroach-data cr=./cockroach-$1 $cr start-single-node --background --insecure $cr workload init kv $cr workload run kv --splits 100 --max-rate 2000 --duration 10m --read-percent 0 --min-block-bytes 10 --max-block-bytes 10 | tee $1.txt ``` ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 600.0s 0 1199604 1999.3 6.7 7.1 10.0 11.0 75.5 write #master 600.0s 0 1199614 1999.4 6.8 7.1 10.0 11.0 79.7 write #PR ``` Closes #87264. - [x] [make it build](#88985 (comment)) - [x] remove the maxIndex param and handling from Task.AckCommittedEntriesBeforeApplication - [x] check that single node write latencies don't regress Release note: None 91117: sql: reduce the overhead of EXPLAIN ANALYZE r=yuzefovich a=yuzefovich In order to propagate the execution stats across the distributed query plan we use the tracing infrastructure, where each stats object is added as "structured metadata" to the trace. Thus, whenever we're collecting the exec stats for a statement, we must enable tracing. Previously, in many cases we would enable it at the highest verbosity level which has non-trivial overhead. In some cases this was an overkill (e.g. in `EXPLAIN ANALYZE` we don't really care about the trace containing all of the gory details - we won't expose it anyway), so this is now fixed by using the less verbose "structured" verbosity level. As a concrete example of the difference: for a stmt that without `EXPLAIN ANALYZE` takes around 190ms, with `EXPLAIN ANALYZE` it would previously run for about 1.8s and now it takes around 210ms. This required some minor changes to the row-by-row outbox and router setups to collect thats even if the recording is not verbose. Addresses: #90739. Epic: None Release note (performance improvement): The overhead of running `EXPLAIN ANALYZE` and `EXPLAIN ANALYZE (DISTSQL)` has been significantly reduced. The overhead of `EXPLAIN ANALYZE (DEBUG)` didn't change. 91119: roachprod: improve error in ParallelE r=smg260 a=tbg Prior to this commit, the error's stack trace did not link back to the caller of `ParallelE`. Now it does. Epic: none Release note: None 91126: dev: allow whitespace separated regexps for testlogic files r=ajwerner a=ajwerner This was a feature of `make testlogic` and it was liked. Fixes #91125 Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
This was referenced Nov 3, 2022
This comment was marked as resolved.
This comment was marked as resolved.
michae2
added
X-nostale
Marks an issue/pr that should be ignored by the stale bot
and removed
no-issue-activity
labels
Apr 29, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-explain
Issues related to EXPLAIN and EXPLAIN ANALYZE improvements
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-queries
SQL Queries Team
X-nostale
Marks an issue/pr that should be ignored by the stale bot
Queries that perform large lookup joins (tens of thousands of rows) are roughly 4-12x slower to
EXPLAIN ANALYZE
than to execute normally. This seems to be true on 21.2, 22.1, 22.2, and probably earlier releases as well. Here's a demonstration on 22.2.0:On a one-node cockroach demo the normal execution takes 274 ms, and the
EXPLAIN ANALYZE
takes 3.29s:We record many events in the trace for each lookup of the lookup join, which happens 50k times for this query.
Jira issue: CRDB-20918
The text was updated successfully, but these errors were encountered: