Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] q95 odd task failure in query95 at 30TB #8939

Closed
abellina opened this issue Aug 6, 2023 · 2 comments
Closed

[BUG] q95 odd task failure in query95 at 30TB #8939

abellina opened this issue Aug 6, 2023 · 2 comments
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Aug 6, 2023

While testing #8936 I ran into an issue one out of 6 runs I did. It looks like we have reached a case where LazySpillableColumnarBatchImpl has neither a spill or a cached, and so we get the supressed exception: batch is closed (scroll below).

It is also surprising where the top level exception comes from, I am a bit confused on this. I haven't really reviewed the code at this point but I am filing this since I saw it. Query 95 finished with a task failure.

ecutor task launch worker for task 126.0 in stage 42.0 (TID 14511) 23/08/06 12:16:21:336 WARN RapidsBufferCatalog: Targeting a host memory size of 34299716096. Current total 27486502912. Current spillable 27486502912
Executor task launch worker for task 126.0 in stage 42.0 (TID 14511) 23/08/06 12:16:21:357 WARN RapidsBufferCatalog: Targeting a host memory size of 34292557376. Current total 27546525184. Current spillable 27546525184
Executor task launch worker for task 126.0 in stage 42.0 (TID 14511) 23/08/06 12:16:21:380 INFO DeviceMemoryEventHandler: Spilled 6359537936 bytes from the device store
Executor task launch worker for task 102.0 in stage 42.0 (TID 14487) 23/08/06 12:16:29:03 ERROR Executor: Exception in task 102.0 in stage 42.0 (TID 14487)
java.util.NoSuchElementException: Cannot locate buffers associated with ID: TempSpillBufferId(18060,temp_local_112999a3-2b92-4792-948b-fb42f51d9882)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$acquireBuffer$1(RapidsBufferCatalog.scala:409)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.acquireBuffer(RapidsBufferCatalog.scala:405)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.com$nvidia$spark$rapids$RapidsBufferCatalog$$updateUnderlyingRapidsBuffer(RapidsBufferCatalog.scala:387)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.trackNewHandle(RapidsBufferCatalog.scala:148)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.makeNewHandle(RapidsBufferCatalog.scala:129)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$addBuffer$1(RapidsBufferCatalog.scala:231)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.RapidsBufferCatalog.addBuffer(RapidsBufferCatalog.scala:230)
	at com.nvidia.spark.rapids.RapidsBufferCatalog$.addBuffer(RapidsBufferCatalog.scala:882)
	at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:204)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:189)
	at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:142)
	at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$allowSpilling$1(JoinGatherer.scala:304)
	at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$allowSpilling$1$adapted(JoinGatherer.scala:300)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.allowSpilling(JoinGatherer.scala:300)
	at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.checkpoint(JoinGatherer.scala:323)
	at com.nvidia.spark.rapids.JoinGathererImpl.checkpoint(JoinGatherer.scala:553)
	at com.nvidia.spark.rapids.MultiJoinGather.checkpoint(JoinGatherer.scala:672)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$2(AbstractGpuJoinIterator.scala:144)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$1(AbstractGpuJoinIterator.scala:137)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.nextCbFromGatherer(AbstractGpuJoinIterator.scala:134)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$hasNext$5(AbstractGpuJoinIterator.scala:97)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:97)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9(GpuSubPartitionHashJoin.scala:537)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9$adapted(GpuSubPartitionHashJoin.scala:537)
	at scala.Option.exists(Option.scala:376)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:537)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:177)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:176)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:176)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.populateCandidateBatches(GpuCoalesceBatches.scala:433)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$1(GpuCoalesceBatches.scala:595)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:575)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:248)
	at scala.collection.Iterator$$anon$22.next(Iterator.scala:1095)
	at com.nvidia.spark.rapids.CloseableBufferedIterator.next(CloseableBufferedIterator.scala:37)
	at com.nvidia.spark.rapids.CloseableBufferedIterator.next(CloseableBufferedIterator.scala:29)
	at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:232)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:232)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:183)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:182)
	at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:171)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.rapids.execution.GpuBatchSubPartitioner.partitionBatches(GpuSubPartitionHashJoin.scala:189)
	at org.apache.spark.sql.rapids.execution.GpuBatchSubPartitioner.initPartitions(GpuSubPartitionHashJoin.scala:169)
	at org.apache.spark.sql.rapids.execution.GpuBatchSubPartitioner.batchesCount(GpuSubPartitionHashJoin.scala:126)
	at org.apache.spark.sql.rapids.execution.GpuSubPartitionPairIterator.$anonfun$hasNextBatch$1(GpuSubPartitionHashJoin.scala:410)
	at org.apache.spark.sql.rapids.execution.GpuSubPartitionPairIterator.tryPullNextPair(GpuSubPartitionHashJoin.scala:419)
	at org.apache.spark.sql.rapids.execution.GpuSubPartitionPairIterator.hasNext(GpuSubPartitionHashJoin.scala:378)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$5(GpuSubPartitionHashJoin.scala:522)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:522)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:177)
	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:176)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:176)
	at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1089)
	at com.nvidia.spark.rapids.CloseableBufferedIterator.hasNext(CloseableBufferedIterator.scala:38)
	at org.apache.spark.sql.rapids.execution.GpuBroadcastHashJoinExecBase.$anonfun$getBroadcastBuiltBatchAndStreamIter$2(GpuBroadcastHashJoinExecBase.scala:148)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.execution.GpuBroadcastHashJoinExecBase.$anonfun$getBroadcastBuiltBatchAndStreamIter$1(GpuBroadcastHashJoinExecBase.scala:147)
	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:88)
	at org.apache.spark.sql.rapids.execution.GpuBroadcastHashJoinExecBase.getBroadcastBuiltBatchAndStreamIter(GpuBroadcastHashJoinExecBase.scala:146)
	at org.apache.spark.sql.rapids.execution.GpuBroadcastHashJoinExecBase.$anonfun$doColumnarBroadcastJoin$1(GpuBroadcastHashJoinExecBase.scala:181)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.lang.IllegalStateException: batch is closed
		at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$getBatch$3(JoinGatherer.scala:287)
		at scala.Option.getOrElse(Option.scala:189)
		at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.getBatch(JoinGatherer.scala:287)
		at com.nvidia.spark.rapids.JoinGathererImpl.$anonfun$gatherNext$1(JoinGatherer.scala:583)
		at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
		at com.nvidia.spark.rapids.JoinGathererImpl.gatherNext(JoinGatherer.scala:582)
		at com.nvidia.spark.rapids.MultiJoinGather.gatherNext(JoinGatherer.scala:653)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$4(AbstractGpuJoinIterator.scala:148)
		at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:229)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$3(AbstractGpuJoinIterator.scala:146)
		at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:431)
		at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:542)
		at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:468)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$2(AbstractGpuJoinIterator.scala:145)
		at scala.Option.map(Option.scala:230)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$1(AbstractGpuJoinIterator.scala:137)
		at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.nextCbFromGatherer(AbstractGpuJoinIterator.scala:134)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$hasNext$2(AbstractGpuJoinIterator.scala:86)
		at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
		at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
		at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:86)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
		at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$3(GpuSubPartitionHashJoin.scala:518)
		at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$3$adapted(GpuSubPartitionHashJoin.scala:518)
		at scala.Option.exists(Option.scala:376)
		at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:518)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
		at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:177)
		at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:176)
		at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
		at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:176)
		at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:309)
		at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:326)
		at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1089)
		at com.nvidia.spark.rapids.CloseableBufferedIterator.hasNext(CloseableBufferedIterator.scala:38)
		at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:224)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
		at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:224)
		at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:177)
		at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:176)
		at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
		at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:176)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
		at org.apache.spark.sql.rapids.execution.GpuBatchSubPartitioner.partitionBatches(GpuSubPartitionHashJoin.scala:188)
		at org.apache.spark.sql.rapids.execution.GpuBatchSubPartitioner.initPartitions(GpuSubPartitionHashJoin.scala:169)
		at org.apache.spark.sql.rapids.execution.GpuBatchSubPartitioner.batchesCount(GpuSubPartitionHashJoin.scala:126)
		at org.apache.spark.sql.rapids.execution.GpuSubPartitionPairIterator.$anonfun$hasNextBatch$1(GpuSubPartitionHashJoin.scala:410)
		at org.apache.spark.sql.rapids.execution.GpuSubPartitionPairIterator.tryPullNextPair(GpuSubPartitionHashJoin.scala:419)
		at org.apache.spark.sql.rapids.execution.GpuSubPartitionPairIterator.hasNext(GpuSubPartitionHashJoin.scala:378)
		at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$5(GpuSubPartitionHashJoin.scala:522)
		at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
		at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
		at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:522)
		at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:177)
		at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:176)
		at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
		at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:176)
		at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1089)
		at scala.collection.BufferedIterator.headOption(BufferedIterator.scala:32)
		at scala.collection.BufferedIterator.headOption$(BufferedIterator.scala:32)
		at scala.collection.Iterator$$anon$22.headOption(Iterator.scala:1076)
		at com.nvidia.spark.rapids.CloseableBufferedIterator.headOption(CloseableBufferedIterator.scala:36)
		at com.nvidia.spark.rapids.CloseableBufferedIterator.close(CloseableBufferedIterator.scala:41)
		at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:55)
		at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:94)
		... 71 more
dispatcher-Executor 23/08/06 12:16:29:36 INFO CoarseGrainedExecutorBackend: Got assigned task 14577
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:36 INFO Executor: Running task 192.0 in stage 42.0 (TID 14577)
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:42 INFO RapidsShuffleBlockFetcherIterator: Getting 1029 (1900.5 MiB) non-empty blocks including 135 (253.4 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 894 (1647.2 MiB) remote blocks
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:42 INFO RapidsShuffleBlockFetcherIterator: Started 4 remote fetches in 0 ms
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:45 INFO RapidsShuffleBlockFetcherIterator: Getting 1029 (512.3 MiB) non-empty blocks including 117 (58.0 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 912 (454.3 MiB) remote blocks
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:45 INFO RapidsShuffleBlockFetcherIterator: Started 5 remote fetches in 0 ms
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:47 INFO RapidsShuffleBlockFetcherIterator: Getting 1029 (512.3 MiB) non-empty blocks including 117 (58.0 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 912 (454.3 MiB) remote blocks
Executor task launch worker for task 192.0 in stage 42.0 (TID 14577) 23/08/06 12:16:29:47 INFO RapidsShuffleBlockFetcherIterator: Started 5 remote fetches in 0 ms
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:252 INFO ConditionalHashJoinIterator: Split stream batch into 10 batches of about 10805389 rows
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:892 INFO GpuShuffledHashJoinExec: LeftSemi hash join is executed by sub-partitioning in task 14455
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:894 INFO RapidsShuffleBlockFetcherIterator: Getting 132 (43.6 MiB) non-empty blocks including 16 (5.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 116 (37.9 MiB) remote blocks
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:894 INFO RapidsShuffleBlockFetcherIterator: Started 7 remote fetches in 0 ms
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:895 INFO RapidsShuffleBlockFetcherIterator: Getting 1029 (520.5 MiB) non-empty blocks including 117 (58.9 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 912 (461.7 MiB) remote blocks
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:896 INFO RapidsShuffleBlockFetcherIterator: Started 4 remote fetches in 0 ms
Executor task launch worker for task 70.0 in stage 42.0 (TID 14455) 23/08/06 12:16:29:898 INFO RapidsShuffleBlockFetcherIterator: Getting 1029 (520.5 MiB) non-empty blocks including 117 (58.9 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 912 (461.7 MiB) remote blocks
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Aug 6, 2023
@abellina abellina self-assigned this Aug 8, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 8, 2023
@abellina
Copy link
Collaborator Author

Because this is an aliased buffer that is later found to not be in the catalog (given this line in the stack):

	at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$addBuffer$1(RapidsBufferCatalog.scala:231)

This looks to be another version of #9082.

@abellina
Copy link
Collaborator Author

I have run this query 20 times without my patch and I cannot reproduce this. That said, given the stack and that this is also an issue while aliasing, I am going to close this as solved by #9084.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

2 participants