Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to create memory map on query14_part1 at 100TB with spark.executor.cores=64 #9223

Closed
abellina opened this issue Sep 12, 2023 · 2 comments · Fixed by #9211
Closed
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

I ran into the following issue while testing a PR I have open (#9211). It is not clear if it is my PR or if it's a pre-existing thing, so I figure I'd document it.

java.io.IOException: Error creating memory map for spark-008f985a-2488-473a-9ff9-1ddffb78d8b8/executor-4ddc72ac-a983-4cc1-a1e4-e5b0e76b0bb4/blockmgr-df6c7391-c8aa-4672-b15a-3ecfc577baf3/15/temp_local_973a5d08-f9e5-4666-b922-ff8644c66154
        at ai.rapids.cudf.HostMemoryBuffer.mapFile(HostMemoryBuffer.java:175)
        at com.nvidia.spark.rapids.RapidsDiskStore$RapidsDiskBuffer.getMemoryBuffer(RapidsDiskStore.scala:123)
        at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.materializeMemoryBuffer(RapidsBufferStore.scala:431)
        at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.$anonfun$getDeviceMemoryBuffer$1(RapidsBufferStore.scala:506)
        at org.apache.spark.sql.rapids.GpuTaskMetrics.$anonfun$timeIt$1(GpuTaskMetrics.scala:126)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.GpuTaskMetrics.timeIt(GpuTaskMetrics.scala:124)
        at org.apache.spark.sql.rapids.GpuTaskMetrics.readSpillTime(GpuTaskMetrics.scala:137)
        at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.getDeviceMemoryBuffer(RapidsBufferStore.scala:482)
        at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.getColumnarBatch(RapidsBufferStore.scala:445)
        at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.$anonfun$getColumnarBatch$1(SpillableColumnarBatch.scala:112)
        at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.$anonfun$withRapidsBuffer$1(SpillableColumnarBatch.scala:95)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.withRapidsBuffer(SpillableColumnarBatch.scala:94)
        at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.getColumnarBatch(SpillableColumnarBatch.scala:110)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$getBatch$2(JoinGatherer.scala:284)
        at scala.Option.map(Option.scala:230)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$getBatch$1(JoinGatherer.scala:284)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$getBatch$1$adapted(JoinGatherer.scala:283)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.getBatch(JoinGatherer.scala:283)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$4(GpuHashJoin.scala:314)
        at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:88)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$3(GpuHashJoin.scala:308)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:267)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$2(GpuHashJoin.scala:307)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.next(RmmRapidsRetryIterator.scala:376)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:568)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:494)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:286)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:184)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.createGatherer(GpuHashJoin.scala:305)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$2(AbstractGpuJoinIterator.scala:245)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$1(AbstractGpuJoinIterator.scala:227)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
        at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:227)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:101)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9(GpuSubPartitionHashJoin.scala:538)
        at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9$adapted(GpuSubPartitionHashJoin.scala:538)
        at scala.Option.exists(Option.scala:376)
        at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:538)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
        at scala.collection.AbstractIterator.to(Iterator.scala:1431)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Invalid argument
        at ai.rapids.cudf.HostMemoryBufferNativeUtils.mmap(Native Method)
        at ai.rapids.cudf.HostMemoryBuffer.mapFile(HostMemoryBuffer.java:172)
        ... 69 more
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 12, 2023
@abellina abellina self-assigned this Sep 12, 2023
@abellina abellina added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Sep 12, 2023
@abellina
Copy link
Collaborator Author

This is interesting. This looks to be a 0-byte object, that for some reason is in the catalog:

Executor task launch worker for task 14.0 in stage 1737.0 (TID 344796) 23/09/12 03:03:20:73 INFO RapidsDeviceMemoryStore: Spilled TempSpillBufferId(988025,temp_local_973a5d08-f9e5-4666-b922-ff8644c66154) from device memory to Some(host memory)
Executor task launch worker for task 14.0 in stage 1737.0 (TID 344796) 23/09/12 03:03:20:73 INFO RapidsBufferCatalog: Spilled TempSpillBufferId(988025,temp_local_973a5d08-f9e5-4666-b922-ff8644c66154) from tier device memory. Removing. Registering TempSpillBufferId(988025,temp_local_973a5d08-f9e5-4666-b922-ff8644c66154) Some(host memory buffer size=0)
Executor task launch worker for task 14.0 in stage 1737.0 (TID 344796) 23/09/12 03:03:25:118 INFO RapidsHostMemoryStore: Spilled TempSpillBufferId(988025,temp_local_973a5d08-f9e5-4666-b922-ff8644c66154) from host memory to Some(local disk)
Executor task launch worker for task 14.0 in stage 1737.0 (TID 344796) 23/09/12 03:03:25:118 INFO RapidsBufferCatalog: Spilled TempSpillBufferId(988025,temp_local_973a5d08-f9e5-4666-b922-ff8644c66154) from tier host memory. Removing. Registering TempSpillBufferId(988025,temp_local_973a5d08-f9e5-4666-b922-ff8644c66154) Some(local disk buffer size=0)
java.io.IOException: Error creating memory map for 2488-473a-9ff9-1ddffb78d8b8/executor-4ddc72ac-a983-4cc1-a1e4-e5b0e76b0bb4/blockmgr-df6c7391-c8aa-4672-b15a-3ecfc577baf3/15/temp_local_973a5d08-f9e5-4666-b922-ff8644c66154

@abellina
Copy link
Collaborator Author

abellina commented Sep 12, 2023

Specifically, this is a 0-byte build batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants