Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpchvec/perf failed #47401

Closed
cockroach-teamcity opened this issue Apr 12, 2020 · 18 comments
Closed

roachtest: tpchvec/perf failed #47401

cockroach-teamcity opened this issue Apr 12, 2020 · 18 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpchvec/perf failed on master@7b0f60fe2034ba8677242dbcdad86d3e5587c0d4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200412-1869957/tpchvec/perf/run_1
	tpchvec.go:644,tpchvec.go:659,test_runner.go:753: [q12] vec ON is slower by 57.50% than vec OFF
		vec ON times: [9.81 10.19 10.41]
		vec OFF times: [6.28 6.47 6.52]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 12, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Apr 12, 2020
@yuzefovich yuzefovich assigned yuzefovich and unassigned andreimatei Apr 12, 2020
@yuzefovich
Copy link
Member

Hm, I cannot reproduce it locally. The run times for the last five days of query 12:

vec ON times: [9.12 9.19 9.35]	 vec OFF times: [8.81 8.82 8.82]
vec ON times: [6.74 6.84 10.81]	 vec OFF times: [6.6 6.7 6.82]
vec ON times: [6.12 6.22 9.93]	 vec OFF times: [5.89 5.89 6.24]
vec ON times: [6.4 6.4 6.55]	 vec OFF times: [6.32 6.37 6.45]
vec ON times: [6.37 6.76 7.02]	 vec OFF times: [6.22 6.35 6.45]

So we had one slow run in two out of five cases, and it's always the first one (the runtimes are sorted). I'm not sure what to make of it though.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@b644071166e3a9db1fd595154a82e472d052b258:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200413-1870981/tpchvec/perf/run_1
	tpchvec.go:644,tpchvec.go:659,test_runner.go:753: [q15] vec ON is slower by 15.44% than vec OFF
		vec ON times: [4.4 4.56 4.57]
		vec OFF times: [3.92 3.95 4.05]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@asubiotto
Copy link
Contributor

I also ran into this when running ./bin/roachtest on my refactor branch.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@b5b4a9a55f1122ef9c82b968aa5c8cc137c7e281:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:610,tpchvec.go:715,tpchvec.go:725,test_runner.go:753: [q6] vec ON is slower by 23.55% than vec OFF
		vec ON times: [9.23 9.39 9.91]
		vec OFF times: [7.56 7.6 8.21]

More

Artifacts: /tpchvec/perf

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@666a0ac62832b6884bc6b039b4c944dbb42924aa:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:610,tpchvec.go:715,tpchvec.go:725,test_runner.go:753: [q7] vec ON is slower by 97.57% than vec OFF
		vec ON times: [18.16 18.67 19.15]
		vec OFF times: [9.26 9.45 19.19]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@d620f6242ad43481e61a6af19416733cf05233a4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:226,tpchvec.go:321,tpchvec.go:331,test_runner.go:753: [q7] vec ON is slower by 96.64% than vec OFF
		vec ON times: [9.85 18.13 18.5]
		vec OFF times: [9.18 9.22 14.1]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@c3bb2bce3d69d0ceef20797d856a6ac107a3ef47:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:239,tpchvec.go:400,tpchvec.go:410,test_runner.go:753: [q1] unexpectedly wrong number of run times recorded with vec ON config: []

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@683f0d561bf9b73902edb27d681bca5333bdad60:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:230,tpchvec.go:400,tpchvec.go:410,test_runner.go:753: [q7] vec ON is slower by 99.89% than vec OFF
		vec ON times: [17.83 18.29 18.37]
		vec OFF times: [9.1 9.15 10.43]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue May 8, 2020
48522: geo/geo*fn: error when comparing two geometries of different SRIDs r=sumeerbhola a=otan

When operating on more than one geo type, we must ensure that if we compare
SRIDs that mismatch, we report an error. We've done this for all binary
predicates implemented so far in Geometry and Geography.

Furthermore, this PR adds an interface for Geospatial types. It makes
things a little bit neater, and we can extend this as we go along.

Release note: None

48540: tpchvec: enable some logging to debug perf test failure r=yuzefovich a=yuzefovich

My current theory is that we're seeing random performance failures due
to spilling to disk. To check that we now will enable some verbose
logging for the components that can spill to disk.

Addresses: #47401.

Release note: None

Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpchvec/perf failed on master@c1abb272c94a437f1df9175fc30dc6a6729d3338:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpchvec/perf/run_1
	tpchvec.go:246,tpchvec.go:416,tpchvec.go:426,test_runner.go:753: [q7] vec ON is slower by 96.12% than vec OFF
		vec ON times: [18.89 19.69 20.15]
		vec OFF times: [9.89 10.04 18.99]

More

Artifacts: /tpchvec/perf
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

Hm, logs indicate that the vectorized flow was not setup for query 7 for some reason - there should be messages like vectorized flow setup succeeded around these times:

I200510 09:17:30.923854 10 workload/tpch/tpch.go:465  [q6] returned 1 rows after 10.84 seconds
I200510 09:17:51.446629 25 workload/tpch/tpch.go:465  [q7] returned 4 rows after 18.89 seconds
I200510 09:18:11.594030 25 workload/tpch/tpch.go:465  [q7] returned 4 rows after 20.15 seconds

but all we have is the runtime stats at that period of time.

@asubiotto
Copy link
Contributor

Something must be wrong with the logging, right? Because we also print failed to vectorize in vectorized_flow with verbosity 1 if there's an issue.

@yuzefovich
Copy link
Member

Something must be wrong with the logging, right? Because we also print failed to vectorize in vectorized_flow with verbosity 1 if there's an issue.

I don't think so - we will print this message only if we're trying to set up a vectorized flow and we fail, but if SupportsVectorized check fails, then we will be planning row-based flow right away. I have no idea why SupportsVectorized check would fail though.

craig bot pushed a commit that referenced this issue May 22, 2020
49397: roachtest: run EXPLAIN ANALYZE when tpchvec/perf fails r=yuzefovich a=yuzefovich

**workload: switch TPCH queries mapping from using string to int**

It also exports `NumQueries` and `QueriesByNumber` fields of `tpch`
package to be used by `tpchvec` roachtest.

Release note: None

**roachtest: minor cleanup of tpchbench**

This commit refactors `tpchbench` test slightly in order to hide `tpch`
bench type from the whole `roachtest` package.

Release note: None

**roachtest: run EXPLAIN ANALYZE when tpchvec/perf fails**

We have a mysterious rare failure on query 7 in which `vectorize=on`
performs significantly worse than `vectorize=off`. I'm out of possible
explanations for this, so this commit adds an ability to run `EXPLAIN
ANALYZE` on the query with both `vectorize` options when the slowness
threshold is exceeded. Hopefully it'll give us some insight into the
perf failures.

Addresses: #47401.

Release note: None

49429: *: replace test usages of context.TODO() with context.Background() r=otan a=otan

*: replace test usages of context.TODO() with context.Background()

This commit replaces all usages of context.TODO() with context.Background()
in tests, and adds a lint rule ensuring context.TODO() does not appear.
Tests should always be initiating a new context if they are calling a
function for the first time - it is now "unknown" and hence not a "TODO"
application.

Also added a new test in `lint_test.go` for this.

Release note: None



Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
@yuzefovich
Copy link
Member

I'm inclined to close this issue as well because it appears to be occurring as if there is some infrastructure flake going on rarely. Additionally if it does occur again, we should have some useful info to dig into. @asubiotto thoughts?

@asubiotto
Copy link
Contributor

Maybe it's useful to keep this "history" because it's still an issue. If it is some infrastructure flake, it'd be useful to rope dev-inf in.

@yuzefovich
Copy link
Member

Well, we will remember that this particular github issue exists (even if it's closed), so we'll be able to inspect it if necessary. To me this issue seems unactionable at the moment, so I'm in favor of closing it.

@asubiotto
Copy link
Contributor

unactionable at the moment

I think this is the point. It might become actionable later on. I think it's only useful to close issues if the problem is resolved, we've decided we won't fix it or it's unactionable and always will be so. I think it's desirable to keep issues open where we're currently waiting for something to happen so that we don't lose the context (and don't have to keep everything in our mind).

@yuzefovich
Copy link
Member

Ok, I don't feel too strongly about it.

@asubiotto
Copy link
Contributor

I think I've changed my mind. Closing this and others. If it happens again, we'll see the trace and can discuss on that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

4 participants