roachtest: tpchvec/perf failed #50029

cockroach-teamcity · 2020-06-09T22:55:09Z

(roachtest).tpchvec/perf failed on provisional_202006091546_v20.1.2@92670dd72acad03bd9de54a197d66ef423a90183:

The test failed on branch=provisional_202006091546_v20.1.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:289,tpchvec.go:456,tpchvec.go:466,test_runner.go:753: [q7] vec ON is slower by 89.30% than vec OFF
		vec ON times: [12.26 19.29 20.24]
		vec OFF times: [9.85 10.19 19.73]

More

Artifacts: /tpchvec/perf
Related:

roachtest: tpchvec/perf failed #49955 roachtest: tpchvec/perf failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker
roachtest: tpchvec/perf failed #47401 roachtest: tpchvec/perf failed C-test-failure O-roachtest O-robot branch-master release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

yuzefovich · 2020-06-10T04:51:43Z

All 6 runs of EXPLAIN ANALYZE (3 with on, 3 with off) finished in about 13 seconds accordingly to the timing of the log messages, and this issue is a duplicate of #47401.

I'll temporarily keep it open so that release folks will see this: it is not a release blocker.

cockroach-teamcity · 2020-06-10T19:20:15Z

(roachtest).tpchvec/perf failed on provisional_202006091546_v20.1.2@92670dd72acad03bd9de54a197d66ef423a90183:

The test failed on branch=provisional_202006091546_v20.1.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:306,tpchvec.go:476,tpchvec.go:488,test_runner.go:753: [q7] vec ON is slower by 82.81% than vec OFF
		vec ON times: [18.63 19.36 19.48]
		vec OFF times: [10.39 10.59 20.78]

More

Artifacts: /tpchvec/perf
Related:

roachtest: tpchvec/perf failed #49955 roachtest: tpchvec/perf failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker
roachtest: tpchvec/perf failed #47401 roachtest: tpchvec/perf failed C-test-failure O-roachtest O-robot branch-master release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

yuzefovich · 2020-06-11T21:53:41Z

Interestingly, this occurred with running vectorize=off config first, and the slow run is the third one, and then with vectorize=on. Also, running EXPLAIN ANALYZE with vectorize=off took about 24s on the first run, then 15s on the second, but the third run as well all three runs of EXPLAIN ANALYZE with vectorize=on took roughly 12 seconds.

The vast majority of the runtime of the query is spent in index stall time of a lookup join (about 21s in the first, 12s in the second, and 10s in all consequent runs). @jordanlewis @asubiotto do you have any insight into why we would get this much variance (note that the slow times occurred with vectorize=off) for this?

asubiotto · 2020-06-12T09:30:09Z

I don't really know. One explanation (sort of speculative) I had was that it could be the lease location of the range we're performing a lookup against. Join readers are planned on the node where the left side is read and doesn't pay attention to the right (this is what #47472 mentions in the "out of scope" section). Maybe the lease for the right range gets transferred for some reason (possibly too much load), which causes a performance hit?

The time slice we care about is:

I200610 18:37:43.075892 10 workload/tpch/tpch.go:470  [q7] returned 4 rows after 10.39 seconds

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
I200610 18:38:03.856409 10 workload/tpch/tpch.go:470  [q7] returned 4 rows after 20.78 seconds

There's a log message after this event:

I200610 18:38:31.075541 158 kv/kvserver/store_rebalancer.go:236  [n1,s1,store-rebalancer] considering load-based lease transfers for s1 with 222.86 qps (mean=117.14, upperThreshold=217.14)
I200610 18:38:31.078189 158 kv/kvserver/store_rebalancer.go:272  [n1,s1,store-rebalancer] load-based lease transfers successfully brought s1 down to 209.59 qps (mean=117.14, upperThreshold=217.14)

But this happened after the time jump and between two q7 runs that don't have any difference in performance so I don't know if this tells us much. I would try enabling logging for lease movements (I think you can set a high vmodule on replicate_queue, store_rebalancer @nvanbenschoten is there anything else to enable?).

One other possibly useful thing to do is to swap out calls to EXPLAIN ANALYZE with a full trace collection. We'll still have stats in the trace and be able to tell which join reader had a long stall time. The trace will then serve to show us how the lookups were performed and where time was spent.

yuzefovich · 2020-06-16T03:06:12Z

One other possibly useful thing to do is to swap out calls to EXPLAIN ANALYZE with a full trace collection. We'll still have stats in the trace and be able to tell which join reader had a long stall time. The trace will then serve to show us how the lookups were performed and where time was spent.

That's a good point, I opened a PR to use EXPLAIN ANALYZE (DEBUG) that will contain the trace.

49934: colexec: optimize aggregate functions r=yuzefovich a=yuzefovich **colexec: simplify aggregateFunc interface a bit** This commit simplifies `SetOutputIndex` to no longer be required to unset nulls after the desired output index. The unsetting is only necessary when we have "overflow" tuples in the ordered aggregator, so it makes that it should be the one resetting the nulls. Additionally, the implementation of `SetOutputIndex` no longer checks whether `curIdx` is not -1 (meaning the function has already performed some aggregation). I think this was an artifact from the past and is no longer necessary. This commit also makes a minor performance optimization to unwrap the output batch in the ordered aggregator. Release note: None **colexec: optimize any_not_null func for hash aggregation** When we're performing hash aggregation, we know that there will be exactly one group. This insight allows us optimize the aggregate functions, and this commit performs the optimization of any_not_null. Now we will have two separate files for both aggregator kinds (hash and ordered), and that part is handled by a preprocessing step that replaces `_AGGKIND` template "variable" with the corresponding aggregator kind. I looked into introducing some base struct for aggregates to reuse, but it is rather complicated at the moment - some aggregate functions iterate over `[]*oneArgOverload`, and we could flatten that out. However, other functions operate with a custom "tmpl info" struct. Furthermore, extending the struct to include `AggKind` variable would require refactoring the templates themselves because we assume that `.` (dot) of the template usually is at `lastArgWidthOverload` struct. Probably in order to provide some "aggregate base tmpl info" struct that would help with hash vs ordered aggregation generation we would need to introduce an interface which custom tmpl infos will implement. Release note: None **colexec: optimize all aggregate funcs for hash** This commit adds a simple optimization of omitting the check for a start of a new group in all aggregate functions if we're performing hash aggregation. This change also allows us to slightly simplify the hash aggregator. Release note: None **colexec: optimize aggregateFunc contract a bit** This commit adjusts the contract of `aggregateFunc.Init` method to require that the very first tuple in the whole input should *not* have a `true` value set in `groups` vector. This requirement allows us to optimize all aggregate functions for the case of ordered aggregator. The ordered aggregator has been adjusted accordingly to satisfy the contract. This commit also marks `oneShotOp` as `NonExplainable`. Release note: None 50103: sql: add a view dep on sequence string arguments r=RichardJCai a=RichardJCai Add a dependency on sequences that are used via string argument when creating a view. Also fixed minor bug where column dependencies were not being added when a sequence is referred to in a currval call. Fixes (part of) #24556 and #50033 Release note (sql change): Sequences passed as a string argument into views are now tracked as a dependency. Example: CREATE VIEW v AS SELECT nextval('s') will now add a dependency on sequence s. 50247: colexec: fix resetting of the hashtable with distinct build mode r=yuzefovich a=yuzefovich Some time ago we introduced "distinct" build mode to the hash table which is used by the unordered distinct. In that addition we forgot to add one thing to be reset whenever the hash table is reset which causes partially ordered distinct to behave incorrectly. The damage, however, is negligible because the bug doesn't affect "default" build mode nor unordered distinct as is, only when we have partially ordered distinct which uses the unordered distinct internally and resets it between different "chunks". And we don't currently plan partially ordered distinct, so only our unit tests are able to hit it, and there is no need for a backport. Fixes: #50006. Release note: None 50253: roachtest: run EXPLAIN ANALYZE (DEBUG) in tpchvec/perf on failure r=yuzefovich a=yuzefovich In order to get more information that could explain occasional slowness on query 7 this commit updates the usage of `EXPLAIN ANALYZE` to `EXPLAIN ANALYZE (DEBUG)` which will provide us with traces and other useful information. The bundles are `curl`ed into `logs` folder so that the test runner fetched it alongside normal log files. Addresses: #50029. Release note: None 50263: roachtest: skip rotten cancel/tpcc test r=yuzefovich a=asubiotto This test is pretty flaky due to a number of reasons. It fails pretty noisiliy so skip it until we fix the underlying causes. Fixing this test is tracked in #42103 Release note: None 50267: logictest: add -print-blocklist-issues flag r=yuzefovich a=asubiotto Previously all issues associated with a blocklist were printed out, even when running a single test. This commit turns that printing off by default, which can be enabled by specifying the -print-blocklist-issues flag. Release note: None (testing change) Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Alfonso Subiotto Marques <[email protected]>

asubiotto · 2020-07-01T14:48:12Z

#47401

cockroach-teamcity added this to the 20.2 milestone Jun 9, 2020

cockroach-teamcity assigned andreimatei Jun 9, 2020

yuzefovich assigned yuzefovich and unassigned andreimatei Jun 10, 2020

yuzefovich mentioned this issue Jun 16, 2020

roachtest: run EXPLAIN ANALYZE (DEBUG) in tpchvec/perf on failure #50253

Merged

asubiotto closed this as completed Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpchvec/perf failed #50029

roachtest: tpchvec/perf failed #50029

cockroach-teamcity commented Jun 9, 2020

yuzefovich commented Jun 10, 2020

cockroach-teamcity commented Jun 10, 2020

yuzefovich commented Jun 11, 2020

asubiotto commented Jun 12, 2020

yuzefovich commented Jun 16, 2020

asubiotto commented Jul 1, 2020

roachtest: tpchvec/perf failed #50029

roachtest: tpchvec/perf failed #50029

Comments

cockroach-teamcity commented Jun 9, 2020

yuzefovich commented Jun 10, 2020

cockroach-teamcity commented Jun 10, 2020

yuzefovich commented Jun 11, 2020

asubiotto commented Jun 12, 2020

yuzefovich commented Jun 16, 2020

asubiotto commented Jul 1, 2020