Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

TPC-DS q23a.sql will fail in the next few round when we run power test for 20 rounds with master branch. #412

Open
haojinIntel opened this issue Jul 14, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@haojinIntel
Copy link
Collaborator

haojinIntel commented Jul 14, 2021

We try to run TPC-DS power test for 20 rounds to verify the stability of gazelle_plugin. The cluster contains 3 workers and each worker has 512GB DRAM. The data scale is 1.5TB. Q23a.sql will fail in next few rounds which cause the failure of thrift-server. The error message is showed below:


. . . . . . . . . . . . . . > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 5112 (run at AccessController.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:685)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)      at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at com.intel.oap.expression.ColumnarSorter$$anon$1.hasNext(ColumnarSorter.scala:143)    at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)         at com.intel.oap.execution.ColumnarWholeStageCodegenExec.$anonfun$doExecuteColumnar$5(ColumnarWholeStageCodegenExec.scala:408)  at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:106)    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)   at org.apache.spark.scheduler.Task.run(Task.scala:131)  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)      at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.file.NoSuchFileException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index      at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)    at java.nio.file.Files.newByteChannel(Files.java:361)   at java.nio.file.Files.newByteChannel(Files.java:407)   at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:351)         at org.apache.spark.storage.BlockManager.getHostLocalShuffleData(BlockManager.scala:629)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlock(ShuffleBlockFetcherIterator.scala:450)      at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2(ShuffleBlockFetcherIterator.scala:529)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2$adapted(ShuffleBlockFetcherIterator.scala:528)  at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)      at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)     at scala.collection.immutable.List.forall(List.scala:89)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1(ShuffleBlockFetcherIterator.scala:528)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1$adapted(ShuffleBlockFetcherIterator.scala:527)  at scala.collection.Iterator.forall(Iterator.scala:953)         at scala.collection.Iterator.forall$(Iterator.scala:951)        at scala.collection.AbstractIterator.forall(Iterator.scala:1429)        at scala.collection.IterableLike.forall(IterableLike.scala:77)  at scala.collection.IterableLike.forall$(IterableLike.scala:76)         at scala.collection.AbstractIterable.forall(Iterable.scala:56)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchMultipleHostLocalBlocks(ShuffleBlockFetcherIterator.scala:527)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlocks(ShuffleBlockFetcherIterator.scala:516)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4(ShuffleBlockFetcherIterator.scala:562)    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4$adapted(ShuffleBlockFetcherIterator.scala:562)    at scala.Option.foreach(Option.scala:407)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:562)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:171)   at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)      at org.apache.spark.sql.execution.ShuffledColumnarBatchRDD.compute(ShuffledColumnarBatchRDD.scala:128)  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     ... 28 more
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:361)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 5112 (run at AccessController.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index         at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:685)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)      at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at com.intel.oap.expression.ColumnarSorter$$anon$1.hasNext(ColumnarSorter.scala:143)    at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)         at com.intel.oap.execution.ColumnarWholeStageCodegenExec.$anonfun$doExecuteColumnar$5(ColumnarWholeStageCodegenExec.scala:408)  at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:106)    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)   at org.apache.spark.scheduler.Task.run(Task.scala:131)  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)      at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.file.NoSuchFileException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index      at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)    at java.nio.file.Files.newByteChannel(Files.java:361)   at java.nio.file.Files.newByteChannel(Files.java:407)   at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:351)         at org.apache.spark.storage.BlockManager.getHostLocalShuffleData(BlockManager.scala:629)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlock(ShuffleBlockFetcherIterator.scala:450)      at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2(ShuffleBlockFetcherIterator.scala:529)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2$adapted(ShuffleBlockFetcherIterator.scala:528)  at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)      at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)     at scala.collection.immutable.List.forall(List.scala:89)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1(ShuffleBlockFetcherIterator.scala:528)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1$adapted(ShuffleBlockFetcherIterator.scala:527)  at scala.collection.Iterator.forall(Iterator.scala:953)         at scala.collection.Iterator.forall$(Iterator.scala:951)        at scala.collection.AbstractIterator.forall(Iterator.scala:1429)        at scala.collection.IterableLike.forall(IterableLike.scala:77)  at scala.collection.IterableLike.forall$(IterableLike.scala:76)         at scala.collection.AbstractIterable.forall(Iterable.scala:56)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchMultipleHostLocalBlocks(ShuffleBlockFetcherIterator.scala:527)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlocks(ShuffleBlockFetcherIterator.scala:516)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4(ShuffleBlockFetcherIterator.scala:562)    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4$adapted(ShuffleBlockFetcherIterator.scala:562)    at scala.Option.foreach(Option.scala:407)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:562)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:171)   at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)      at org.apache.spark.sql.execution.ShuffledColumnarBatchRDD.compute(ShuffledColumnarBatchRDD.scala:128)  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     ... 28 more
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1763)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2437)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) (state=,code=0)
Closing: 0: jdbc:hive2://vsr215:10000

Our test can pass when the latest commit is "e4782f0aaa6cee899bed5a3cb1d9ba2b2c219461".

@haojinIntel haojinIntel added the bug Something isn't working label Jul 14, 2021
@haojinIntel haojinIntel changed the title TPC-DS q23a.sql will fail in 2nd round when we run power test for 20 rounds with master branch. TPC-DS q23a.sql will fail in the next few round when we run power test for 20 rounds with master branch. Jul 14, 2021
@haojinIntel
Copy link
Collaborator Author

@zhouyuan @zhixingheyi-tian Please help to track the issue. Thanks!

@zhixingheyi-tian
Copy link
Collaborator

cc @weiting-chen

@haojinIntel
Copy link
Collaborator Author

If we use another cluster which contains 1 master and 3 workers and each worker has 384GB DRAM, q23a.sql will fail on the 1st round. This is the tpcds-kit version I used: https://github.com/davies/tpcds-kit.git.

@zhouyuan
Copy link
Collaborator

zhouyuan commented Jul 15, 2021

@haojinIntel
thanks fore reporting, the log shows some container got killed during the tests, probably due to memory issue

this is due to big memory footprint and lack of spill support. we have two pending PRs which should fix this
#387
#369

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants