Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_numeric_running_sum_window_no_part_unbounded failed in MT tests #9071

Closed
abellina opened this issue Aug 17, 2023 · 5 comments
Closed
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

This is likely an issue with running this in a distributed fashion:

[2023-08-17T19:20:50.219Z] ../../src/main/python/window_function_test.py::test_numeric_running_sum_window_no_part_unbounded[Decimal(38,1)][IGNORE_ORDER, APPROXIMATE_FLOAT] 23/08/17 19:20:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[2023-08-17T19:20:50.473Z] 23/08/17 19:20:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[2023-08-17T19:20:50.473Z] 23/08/17 19:20:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[2023-08-17T19:20:50.473Z] 23/08/17 19:20:50 WARN GpuOverrides:
[2023-08-17T19:20:50.473Z]           ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
[2023-08-17T19:20:50.473Z]             @Expression <AttributeReference> a#1293590L could run on GPU
[2023-08-17T19:20:50.473Z]             @Expression <AttributeReference> b#1293591 could run on GPU
[2023-08-17T19:20:50.473Z]
[2023-08-17T19:20:51.035Z] 23/08/17 19:20:50 WARN TaskSetManager: Lost task 0.0 in stage 91788.0 (TID 2749182) (executor 0): java.lang.AssertionError: Type conversion is not allowed from Table{columns=[ColumnVector{rows=584, type=INT64, nullCount=Optional.empty, offHeap=(ID: 19973678 7fa8e9fda870)}, ColumnVector{rows=584, type=DECIMAL128 scale:-1, nullCount=Optional.empty, offHeap=(ID: 19973679 7fa8e812f110)}, ColumnVector{rows=584, type=DECIMAL64 scale:0, nullCount=Optional.empty, offHeap=(ID: 19974281 7fa8ea02bd20)}, ColumnVector{rows=584, type=DECIMAL128 scale:1, nullCount=Optional.empty, offHeap=(ID: 19974283 7fa8e8120a40)}], cudfTable=140363424643760, rows=584} to [LongType, DecimalType(38,1), DecimalType(10,0), DecimalType(38,1)] columns 0 to 4
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:649)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:530)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$makeBatch$1(GpuWindowExec.scala:1826)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.makeBatch(GpuWindowExec.scala:1825)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$postProcess$3(GpuWindowExec.scala:1818)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:65)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$postProcess$2(GpuWindowExec.scala:1806)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:74)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$postProcess$1(GpuWindowExec.scala:1805)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.postProcess(GpuWindowExec.scala:1803)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$next$9(GpuWindowExec.scala:1976)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$next$8(GpuWindowExec.scala:1975)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.$anonfun$next$7(GpuWindowExec.scala:1973)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:444)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:558)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:484)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:276)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:129)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.next(GpuWindowExec.scala:1972)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuCachedDoublePassWindowIterator.next(GpuWindowExec.scala:1715)
[2023-08-17T19:20:51.035Z]      at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-08-17T19:20:51.035Z]      at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.firstPassReadBatches(GpuSortExec.scala:402)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:563)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:238)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:264)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:261)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:234)
[2023-08-17T19:20:51.035Z]      at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:278)
[2023-08-17T19:20:51.035Z]      at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-08-17T19:20:51.035Z]      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-08-17T19:20:51.035Z]      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-08-17T19:20:51.035Z]      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-08-17T19:20:51.035Z]      at java.lang.Thread.run(Thread.java:750)
[2023-08-17T19:20:51.035Z]
[2023-08-17T19:20:51.035Z] 23/08/17 19:20:50 ERROR TaskSetManager: Task 0 in stage 91788.0 failed 1 times; aborting job
[2023-08-17T19:20:51.291Z] ^[[31mFAILED^[[0m^[[31m [ 97%]^[[0m
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 17, 2023
@revans2
Copy link
Collaborator

revans2 commented Aug 17, 2023

Looks like the scale is off on the last Decimal column. CUDF has a scale of 1 and so does Spark, but the CUDF scale should be - Spark scale.

@andygrove
Copy link
Contributor

For reference, here is the PR that introduced this code:

#8934

@abellina
Copy link
Collaborator Author

During the revert review, @revans2 pointed out where thought the bug was: #9072 (comment). I will merge the revert, and keep this issue open for follow on.

@abellina
Copy link
Collaborator Author

I assigned this to you for now @andygrove so we don't loose track of it.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 22, 2023
@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Aug 23, 2023
@andygrove
Copy link
Contributor

andygrove commented Sep 1, 2023

I think we can close this issue now because the code was reverted, fixing the test failure. Also, the original issue was re-opened, so we can use that to track re-implementing the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants