Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: timeouts due to missing cancellation checks (causes costfuzz failure) #87482

Closed
cockroach-teamcity opened this issue Sep 7, 2022 · 12 comments · Fixed by #88302
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 7, 2022

roachtest.costfuzz failed with artifacts on master @ 2372698da1dfacb90f60c6a63f2c1298d1db16b8:

test artifacts and logs in: /artifacts/costfuzz/run_1
	test_runner.go:1028,test_runner.go:927: test timed out (1h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-19365

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 7, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Sep 7, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 7, 2022
@DrewKimball
Copy link
Collaborator

Heap profile doesn't look suspicious. Here's the cpu profile:
Screen Shot 2022-09-07 at 12 09 23 PM
Maybe it would be a good idea to add some cancel-checking in the diskqueue methods?

@yuzefovich
Copy link
Member

it would be a good idea to add some cancel-checking in the diskqueue methods?

Yes, I think so. I actually just ran into the missing cancellation checks in the external sorter when debugging #87510 where the external seems to proceed even after the connection that needed it is closed. We should also backport those cancel checks.

@DrewKimball
Copy link
Collaborator

I think we may have to add cancel checking for all operators that buffer before performing a lot of work, specifically in the loops that process the buffered data.

@cockroach-teamcity
Copy link
Member Author

roachtest.costfuzz failed with artifacts on master @ e39111b2e714375faa0facc05e51e8f619a55b21:

		  |   	strings.Join({
		  |   		"-522806371908919723000000000000000.057361179212762389",
		  | - 		"0",
		  |   		",true",
		  |   	}, ""),
		  |   	"-5281573567465919405000050343168773.00290717,true",
		  |   	"-53008036059.78642014,true",
		  |   	"-53012568.62456803803,true",
		  |   	strings.Join({
		  |   		"-53930584384214236340831128.935",
		  | - 		"0",
		  |   		",true",
		  |   	}, ""),
		  | - 	"-54151595530771640900000000000000,true",
		  |   	"-553376140143201345099999531541229259.7143749,true",
		  |   	"-58828056393909638100000000010.1,true",
		  |   	"-598243635.3590026986,true",
		  |   	"-6.5336706868357837291E+37,true",
		  | + 	"-6.720430994736931567E+36,true",
		  |   	"-60667836961626125801.3612886988,true",
		  |   	"-625381854099381688.1239,true",
		  |   	"-63683.76094893934931,true",
		  |   	"-6601987234273.212696,true",
		  |   	"-6704655693005500528.2,true",
		  | - 	"-6720430994736931567000000000000000000,true",
		  |   	"-68479622903227596.51,true",
		  |   	strings.Join({
		  |   		"-68930104522260656.56688190538679",
		  | - 		"0",
		  |   		",true",
		  |   	}, ""),
		  | - 	"-716364259164750370200000000000000000,true",
		  | + 	"-7.163642591647503702E+35,true",
		  |   	... // 192 identical, 1 removed, 1 inserted, and 16 modified elements
		  |   }
		  | sql: SELECT
		  | 	tab_556.col2_13 AS col_1555, true AS col_1556
		  | FROM
		  | 	defaultdb.public.table2@[0] AS tab_556
		  | GROUP BY
		  | 	tab_556.col2_5, tab_556.col2_13
		  | ORDER BY
		  | 	tab_556.col2_5,
		  | 	tab_556.col2_5 ASC,
		  | 	tab_556.col2_5,
		  | 	tab_556.col2_13 ASC,
		  | 	tab_556.col2_13 DESC,
		  | 	tab_556.col2_13,
		  | 	tab_556.col2_5 ASC
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@DrewKimball
Copy link
Collaborator

More decimal precision/format trouble, but the sorting looks suspicious; look at this snippet:

		  |   	"-6.5336706868357837291E+37,true",
		  | + 	"-6.720430994736931567E+36,true",
		  |   	"-60667836961626125801.3612886988,true",
		  |   	"-625381854099381688.1239,true",
		  |   	"-63683.76094893934931,true",
		  |   	"-6601987234273.212696,true",
		  |   	"-6704655693005500528.2,true",
		  | - 	"-6720430994736931567000000000000000000,true",
		  |   	"-68479622903227596.51,true",

@mgartner
Copy link
Collaborator

@yuzefovich @DrewKimball Is the timeout a release blocker, i.e., a regression?

@yuzefovich
Copy link
Member

Missing cancellation checks is not a release blocker - it's a long standing problem, removing the label.

@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 12, 2022
@mgartner
Copy link
Collaborator

Thanks!

@cockroach-teamcity

This comment was marked as duplicate.

@mgartner mgartner changed the title roachtest: costfuzz failed roachtest: costfuzz failed - timeouts due to missing cancellation checks Sep 13, 2022
@mgartner
Copy link
Collaborator

I created #87919 to track the last failure (hidden now). It looks like an actual correctness bug, not a flake.

@cockroach-teamcity

This comment was marked as duplicate.

@mgartner mgartner changed the title roachtest: costfuzz failed - timeouts due to missing cancellation checks roachtest: timeouts due to missing cancellation checks (causes costfuzz failure) Sep 15, 2022
@DrewKimball
Copy link
Collaborator

Most recent failure is #87919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants