Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpchvec/perf failed #50029

Closed
cockroach-teamcity opened this issue Jun 9, 2020 · 6 comments
Closed

roachtest: tpchvec/perf failed #50029

cockroach-teamcity opened this issue Jun 9, 2020 · 6 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpchvec/perf failed on provisional_202006091546_v20.1.2@92670dd72acad03bd9de54a197d66ef423a90183:

The test failed on branch=provisional_202006091546_v20.1.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:289,tpchvec.go:456,tpchvec.go:466,test_runner.go:753: [q7] vec ON is slower by 89.30% than vec OFF
		vec ON times: [12.26 19.29 20.24]
		vec OFF times: [9.85 10.19 19.73]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-provisional_202006091546_v20.1.2 labels Jun 9, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jun 9, 2020
@yuzefovich yuzefovich assigned yuzefovich and unassigned andreimatei Jun 10, 2020
@yuzefovich
Copy link
Member

All 6 runs of EXPLAIN ANALYZE (3 with on, 3 with off) finished in about 13 seconds accordingly to the timing of the log messages, and this issue is a duplicate of #47401.

I'll temporarily keep it open so that release folks will see this: it is not a release blocker.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on provisional_202006091546_v20.1.2@92670dd72acad03bd9de54a197d66ef423a90183:

The test failed on branch=provisional_202006091546_v20.1.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:306,tpchvec.go:476,tpchvec.go:488,test_runner.go:753: [q7] vec ON is slower by 82.81% than vec OFF
		vec ON times: [18.63 19.36 19.48]
		vec OFF times: [10.39 10.59 20.78]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

Interestingly, this occurred with running vectorize=off config first, and the slow run is the third one, and then with vectorize=on. Also, running EXPLAIN ANALYZE with vectorize=off took about 24s on the first run, then 15s on the second, but the third run as well all three runs of EXPLAIN ANALYZE with vectorize=on took roughly 12 seconds.

The vast majority of the runtime of the query is spent in index stall time of a lookup join (about 21s in the first, 12s in the second, and 10s in all consequent runs). @jordanlewis @asubiotto do you have any insight into why we would get this much variance (note that the slow times occurred with vectorize=off) for this?

@asubiotto
Copy link
Contributor

I don't really know. One explanation (sort of speculative) I had was that it could be the lease location of the range we're performing a lookup against. Join readers are planned on the node where the left side is read and doesn't pay attention to the right (this is what #47472 mentions in the "out of scope" section). Maybe the lease for the right range gets transferred for some reason (possibly too much load), which causes a performance hit?

The time slice we care about is:

I200610 18:37:43.075892 10 workload/tpch/tpch.go:470  [q7] returned 4 rows after 10.39 seconds

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
I200610 18:38:03.856409 10 workload/tpch/tpch.go:470  [q7] returned 4 rows after 20.78 seconds

There's a log message after this event:

I200610 18:38:31.075541 158 kv/kvserver/store_rebalancer.go:236  [n1,s1,store-rebalancer] considering load-based lease transfers for s1 with 222.86 qps (mean=117.14, upperThreshold=217.14)
I200610 18:38:31.078189 158 kv/kvserver/store_rebalancer.go:272  [n1,s1,store-rebalancer] load-based lease transfers successfully brought s1 down to 209.59 qps (mean=117.14, upperThreshold=217.14)

But this happened after the time jump and between two q7 runs that don't have any difference in performance so I don't know if this tells us much. I would try enabling logging for lease movements (I think you can set a high vmodule on replicate_queue, store_rebalancer @nvanbenschoten is there anything else to enable?).

One other possibly useful thing to do is to swap out calls to EXPLAIN ANALYZE with a full trace collection. We'll still have stats in the trace and be able to tell which join reader had a long stall time. The trace will then serve to show us how the lookups were performed and where time was spent.

@yuzefovich
Copy link
Member

One other possibly useful thing to do is to swap out calls to EXPLAIN ANALYZE with a full trace collection. We'll still have stats in the trace and be able to tell which join reader had a long stall time. The trace will then serve to show us how the lookups were performed and where time was spent.

That's a good point, I opened a PR to use EXPLAIN ANALYZE (DEBUG) that will contain the trace.

craig bot pushed a commit that referenced this issue Jun 16, 2020
49934: colexec: optimize aggregate functions r=yuzefovich a=yuzefovich

**colexec: simplify aggregateFunc interface a bit**

This commit simplifies `SetOutputIndex` to no longer be required to
unset nulls after the desired output index. The unsetting is only
necessary when we have "overflow" tuples in the ordered aggregator, so
it makes that it should be the one resetting the nulls.

Additionally, the implementation of `SetOutputIndex` no longer checks
whether `curIdx` is not -1 (meaning the function has already performed
some aggregation). I think this was an artifact from the past and is no
longer necessary.

This commit also makes a minor performance optimization to unwrap the
output batch in the ordered aggregator.

Release note: None

**colexec: optimize any_not_null func for hash aggregation**

When we're performing hash aggregation, we know that there will be
exactly one group. This insight allows us optimize the aggregate
functions, and this commit performs the optimization of any_not_null.
Now we will have two separate files for both aggregator kinds (hash and
ordered), and that part is handled by a preprocessing step that replaces
`_AGGKIND` template "variable" with the corresponding aggregator kind.

I looked into introducing some base struct for aggregates to reuse, but
it is rather complicated at the moment - some aggregate functions
iterate over `[]*oneArgOverload`, and we could flatten that out.
However, other functions operate with a custom "tmpl info" struct.
Furthermore, extending the struct to include `AggKind` variable would
require refactoring the templates themselves because we assume that `.`
(dot) of the template usually is at `lastArgWidthOverload` struct.
Probably in order to provide some "aggregate base tmpl info" struct that
would help with hash vs ordered aggregation generation we would need to
introduce an interface which custom tmpl infos will implement.

Release note: None

**colexec: optimize all aggregate funcs for hash**

This commit adds a simple optimization of omitting the check for a start
of a new group in all aggregate functions if we're performing hash
aggregation.

This change also allows us to slightly simplify the hash aggregator.

Release note: None

**colexec: optimize aggregateFunc contract a bit**

This commit adjusts the contract of `aggregateFunc.Init` method to
require that the very first tuple in the whole input should *not* have
a `true` value set in `groups` vector. This requirement allows us to
optimize all aggregate functions for the case of ordered aggregator. The
ordered aggregator has been adjusted accordingly to satisfy the
contract.

This commit also marks `oneShotOp` as `NonExplainable`.

Release note: None

50103: sql: add a view dep on sequence string arguments r=RichardJCai a=RichardJCai

Add a dependency on sequences that are used via string argument when creating a view.

Also fixed minor bug where column dependencies were not being added when a sequence is referred to in a currval call.

Fixes (part of) #24556 and #50033

Release note (sql change): Sequences passed as a string argument into views
are now tracked as a dependency. Example: CREATE VIEW v AS SELECT nextval('s')
will now add a dependency on sequence s.

50247: colexec: fix resetting of the hashtable with distinct build mode r=yuzefovich a=yuzefovich

Some time ago we introduced "distinct" build mode to the hash table
which is used by the unordered distinct. In that addition we forgot to
add one thing to be reset whenever the hash table is reset which causes
partially ordered distinct to behave incorrectly. The damage, however,
is negligible because the bug doesn't affect "default" build mode nor
unordered distinct as is, only when we have partially ordered distinct
which uses the unordered distinct internally and resets it between
different "chunks". And we don't currently plan partially ordered
distinct, so only our unit tests are able to hit it, and there is no
need for a backport.

Fixes: #50006.

Release note: None

50253: roachtest: run EXPLAIN ANALYZE (DEBUG) in tpchvec/perf on failure r=yuzefovich a=yuzefovich

In order to get more information that could explain occasional slowness
on query 7 this commit updates the usage of `EXPLAIN ANALYZE` to
`EXPLAIN ANALYZE (DEBUG)` which will provide us with traces and other
useful information. The bundles are `curl`ed into `logs` folder so that
the test runner fetched it alongside normal log files.

Addresses: #50029.

Release note: None

50263: roachtest: skip rotten cancel/tpcc test r=yuzefovich a=asubiotto

This test is pretty flaky due to a number of reasons. It fails pretty noisiliy
so skip it until we fix the underlying causes. Fixing this test is tracked in
#42103

Release note: None

50267: logictest: add -print-blocklist-issues flag r=yuzefovich a=asubiotto

Previously all issues associated with a blocklist were printed out, even when
running a single test. This commit turns that printing off by default, which
can be enabled by specifying the -print-blocklist-issues flag.

Release note: None (testing change)

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Alfonso Subiotto Marques <[email protected]>
@asubiotto
Copy link
Contributor

#47401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

4 participants