Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trino-hive -P test-fault-tolerant-execution times out #10773

Closed
sopel39 opened this issue Jan 25, 2022 · 8 comments · Fixed by #10987
Closed

trino-hive -P test-fault-tolerant-execution times out #10773

sopel39 opened this issue Jan 25, 2022 · 8 comments · Fixed by #10987
Assignees
Labels

Comments

@sopel39
Copy link
Member

sopel39 commented Jan 25, 2022

f3d6ef9a814b] :: elapsed 24ms :: planning 5ms :: waiting 1ms :: scheduling 5ms :: running 2ms :: finishing 12ms :: begin 2022-01-24T16:55:11.274-06:00 :: end 2022-01-24T16:55:11.298-06:00
2022-01-24T22:55:11.3505406Z 2022-01-24T16:55:11.338-0600	INFO	dispatcher-query-141	io.trino.event.QueryMonitor	TIMELINE: Query 20220124_225511_04198_3c2xn :: Transaction:[84485524-a80b-4a29-a149-360d2ddeabb2] :: elapsed 32ms :: planning 5ms :: waiting 1ms :: scheduling 10ms :: running 2ms :: finishing 15ms :: begin 2022-01-24T16:55:11.303-06:00 :: end 2022-01-24T16:55:11.335-06:00
2022-01-24T22:55:11.3506814Z 2022-01-24T16:55:11.342-0600	INFO	dispatcher-query-48	io.trino.event.QueryMonitor	TIMELINE: Query 20220124_225511_04199_3c2xn :: Transaction:[9a01988b-0f60-4d79-8a32-b01547dc390c] :: elapsed 1ms :: planning 0ms :: waiting 0ms :: scheduling 1ms :: running 0ms :: finishing 1ms :: begin 2022-01-24T16:55:11.340-06:00 :: end 2022-01-24T16:55:11.341-06:00
2022-01-24T22:55:11.3508072Z 2022-01-24T16:55:11.342-0600	INFO	pool-3-thread-1	io.trino.testng.services.ProgressLoggingListener	[TEST SUCCESS] io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest.testSelectWithNoColumns; (took: 1.5 seconds)
2022-01-24T22:55:11.3509493Z 2022-01-24T16:55:11.343-0600	INFO	pool-3-thread-1	io.trino.testng.services.ProgressLoggingListener	[TEST START] io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest.testShowColumnMetadata
2022-01-24T23:03:13.5473374Z 2022-01-24T16:55:11.349-0600	INFO	dispatcher-query-141	io.trino.event.QueryMonitor	TIMELINE: Query 20220124_225511_04200_3c2xn :: Transaction:[f28ca484-f339-46fe-b4f5-fd9e701ac032] :: elapsed 2ms :: planning 0ms :: waiting 0ms :: scheduling 2ms :: running 0ms :: finishing 2ms :: begin 2022-01-24T16:55:11.346-06:00 :: end 2022-01-24T16:55:11.348-06:00
2022-01-24T23:03:13.5498209Z 2022-01-24T17:03:13.537-0600	WARN	TestHangMonitor	io.trino.testng.services.LogTestDurationListener	No test started or completed in 8.00m. Running tests:
2022-01-24T23:03:13.5499088Z 	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest::testInvalidPartitionValue running for 9.34m
2022-01-24T23:03:13.5500400Z 	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest running for 15.71m
2022-01-24T23:03:13.5501104Z 	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest::testShowColumnMetadata running for 8.03m.
2022-01-24T23:03:13.5574467Z 
2022-01-24T23:03:13.5575020Z Full Thread Dump:
2022-01-24T23:03:13.5575596Z "main" prio=5 Id=1 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1967da55
2022-01-24T23:03:13.5576141Z 	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
2022-01-24T23:03:13.5755700Z 	-  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1967da55
2022-01-24T23:03:13.5756456Z 	at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
2022-01-24T23:03:13.5757218Z 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
2022-01-24T23:03:13.5758136Z 	at [email protected]/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1454)
2022-01-24T23:03:13.5758808Z 	at app//org.testng.TestRunner.privateRun(TestRunner.java:735)
2022-01-24T23:03:13.5759307Z 	at app//org.testng.TestRunner.run(TestRunner.java:610)
2022-01-24T23:03:13.5759706Z 	at app//org.testng.SuiteRunner.runTest(SuiteRunner.java:387)
2022-01-24T23:03:13.5760116Z 	at app//org.testng.SuiteRunner.runSequentially(SuiteRunner.java:382)
2022-01-24T23:03:13.5760550Z 	at app//org.testng.SuiteRunner.privateRun(SuiteRunner.java:340)
2022-01-24T23:03:13.5760953Z 	at app//org.testng.SuiteRunner.run(SuiteRunner.java:289)
2022-01-24T23:03:13.5761379Z 	at app//org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
2022-01-24T23:03:13.5761815Z 	at app//org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
2022-01-24T23:03:13.5762249Z 	at app//org.testng.TestNG.runSuitesSequentially(TestNG.java:1293)
2022-01-24T23:03:13.5762668Z 	at app//org.testng.TestNG.runSuitesLocally(TestNG.java:1218)
2022-01-24T23:03:13.5763057Z 	at app//org.testng.TestNG.runSuites(TestNG.java:1133)
2022-01-24T23:03:13.5763395Z 	at app//org.testng.TestNG.run(TestNG.java:1104)
2022-01-24T23:03:13.5764082Z 	at app//org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:136)
2022-01-24T23:03:13.5764731Z 	at app//org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeMulti(TestNGDirectoryTestSuite.java:193)
2022-01-24T23:03:13.5765424Z 	at app//org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:94)
2022-01-24T23:03:13.5766032Z 	at app//org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:145)
2022-01-24T23:03:13.5766614Z 	at app//org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
2022-01-24T23:03:13.5767184Z 	at app//org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
2022-01-24T23:03:13.5767704Z 	at app//org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
2022-01-24T23:03:13.5768214Z 	at app//org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
2022-01-24T23:03:13.5768498Z 
2022-01-24T23:03:13.5768653Z 

https://github.com/trinodb/trino/runs/4928406693?check_suite_focus=true

@sopel39 sopel39 added the test label Jan 25, 2022
@sopel39
Copy link
Member Author

sopel39 commented Jan 25, 2022

cc @losipiuk @arhimondr

@arhimondr arhimondr self-assigned this Jan 25, 2022
@arhimondr
Copy link
Contributor

I will take a look

@arhimondr
Copy link
Contributor

I spent a couple of hours yesterday trying to reproduce it locally. I tried to throttle CPU and play with OS scheduling parameters and it still passed more than 100000 times.

Reproducing concurrency problems is a real challenge.

I found this research from Microsoft when they explore a scheduling algorithm designed to maximize a chance of concurrency bug being discovered: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/asplos277-pct.pdf

Ideally it would be great if we had tooling to help us discover concurrency related problems. I will try to give it a read during the weekend and see if it is anyhow applicable:

@findepi
Copy link
Member

findepi commented Feb 2, 2022

2022-02-02T03:30:12.303-0600	WARN	TestHangMonitor	io.trino.testng.services.LogTestDurationListener	No test started or completed in 8.00m. Running tests:
	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest::testInvalidPartitionValue running for 9.73m
	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest running for 18.32m
	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest::testShowColumnMetadata running for 8.04m.

full thread dump @ https://github.com/trinodb/trino/runs/5033675081?check_suite_focus=true

@arhimondr
Copy link
Contributor

arhimondr commented Feb 2, 2022

Hmm, it looks like something could be wrong specifically when these two tests are running in parallel

	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest::testInvalidPartitionValue running for 9.73m
	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest running for 18.32m
	io.trino.plugin.hive.TestHiveFaultTolerantExecutionConnectorTest::testShowColumnMetadata running for 8.04m.

The other build got stuck in exactly same way.

@arhimondr
Copy link
Contributor

arhimondr commented Feb 8, 2022

I think I got to the bottom of it

  1. Jetty propagates interrupted flag incorrectly between independent requests: Interrupt flag is not always cleared in between requests jetty/jetty.project#7548
  2. Task 1 interrupts owner threads of its drivers, and one of the threads happens to be a Jetty thread:
  3. The same Jetty Thread is reused to process TaskResource#createOrUpdateTask of a different task. Because of the interrupt flag being set the createOrUpdateTask fails to deliver splits to operators: https://gist.github.com/arhimondr/81d80f27722d5429bf5e210034660369
  4. Since with fault tolerant execution tasks are created with all splits right away no subsequent createOrUpdateTask requests are sent. As the splits are not delivered to operators the task is getting stuck.

@sopel39
Copy link
Member Author

sopel39 commented Feb 8, 2022

Nice!. So this is really pre-existing bug then. Can it happen in non-fault tolerant execution?

@arhimondr
Copy link
Contributor

@sopel39 It can, but it is significantly less likely, as in normal execution there's usually multiple task updates being sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants