Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_floor_scale_zero fails with DATAGEN_SEED=1700009407 #9722

Closed
abellina opened this issue Nov 15, 2023 · 3 comments · Fixed by #9874
Closed

[BUG] test_floor_scale_zero fails with DATAGEN_SEED=1700009407 #9722

abellina opened this issue Nov 15, 2023 · 3 comments · Fixed by #9874
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Collaborator

abellina commented Nov 15, 2023

FAILED^[[0m ../../../../integration_tests/src/main/python/arithmetic_ops_test.py::test_floor_scale_zero[Double][DATAGEN_SEED=1700009407, INJECT_OOM]

Repro:

SPARK_RAPIDS_TEST_DATAGEN_SEED=1700009407 ./run_pyspark_from_build.sh -k test_floor_scale_zero
654798 [2023-11-15T02:31:18.513Z] ----------------------------- Captured stdout call -----------------------------
654799 [2023-11-15T02:31:18.513Z] ### CPU RUN ###
654800 [2023-11-15T02:31:18.513Z] ### GPU RUN ###
654801 [2023-11-15T02:31:18.513Z] ### COLLECT: GPU TOOK 0.10657858848571777 CPU TOOK 0.14875459671020508 ###
654802 [2023-11-15T02:31:18.513Z] --- CPU OUTPUT
654803 [2023-11-15T02:31:18.513Z] +++ GPU OUTPUT
654804 [2023-11-15T02:31:18.513Z] @@ -1944,7 +1944,7 @@
654805 [2023-11-15T02:31:18.513Z]  Row(floor(a, 0)=Decimal('0'))
654806 [2023-11-15T02:31:18.513Z]  Row(floor(a, 0)=None)
654807 [2023-11-15T02:31:18.513Z]  Row(floor(a, 0)=Decimal('0'))
654808 [2023-11-15T02:31:18.513Z] -Row(floor(a, 0)=Decimal('-940459146980391'))
654809 [2023-11-15T02:31:18.513Z] +Row(floor(a, 0)=Decimal('-940459146980392'))
654810 [2023-11-15T02:31:18.513Z]  Row(floor(a, 0)=None)
654811 [2023-11-15T02:31:18.513Z]  Row(floor(a, 0)=None)
654812 [2023-11-15T02:31:18.513Z]  Row(floor(a, 0)=None)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 15, 2023
@revans2 revans2 removed the ? - Needs Triage Need team to review and classify label Nov 15, 2023
@jlowe jlowe changed the title [BUG] test_floor_scale_zero fails with DATAGEN_SEED=1700004494 [BUG] test_floor_scale_zero fails with DATAGEN_SEED=1700009407 Nov 28, 2023
@jlowe jlowe self-assigned this Nov 28, 2023
@jlowe
Copy link
Member

jlowe commented Nov 28, 2023

test_floor is working with the same datagen and seed, so this is ultimately a difference in floor(a) vs floor(a, 0). Here's a snippet from a Spark 3.3.2 shell session showing the issue:

scala> val df2 = sql("select a, floor(a) as f, floor(a, 0) as fa from df"
scala> df2.filter("abs(a) < 940459146980400 and abs(a) > 940459146980000").show(truncate=false)
23/11/28 10:25:12 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

+--------------+----------------+----------------+
|a             |f               |fa              |
+--------------+----------------+----------------+
|-9.40459147E14|-940459146980391|-940459146980392|
+--------------+----------------+----------------+

@jlowe
Copy link
Member

jlowe commented Nov 28, 2023

The difference is tied to the output types and operations involved. floor(a) produces a long, whereas floor(a, 0) produces a decimal(16,0). The latter is computed first by casting a as a decimal(30,15) and then computing floor at scale=0 on that. What's interesting is that we start to diverge from Spark CPU on the cast to decimal(30,15):

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> Seq(-9.40459146980391E14).toDF("a").repartition(1).createOrReplaceTempView("df")

scala> sql("select a, cast(a as decimal(30,15)) as c, floor(a) as f, floor(a, 0) as fa from df").collect
res22: Array[org.apache.spark.sql.Row] = Array([-9.40459146980391E14,-940459146980391.000000000000000,-940459146980391,-940459146980391])

scala> sql("select a, cast(a as decimal(30,15)) as c, floor(a) as f, floor(a, 0) as fa from df").printSchema
root
 |-- a: double (nullable = false)
 |-- c: decimal(30,15) (nullable = true)
 |-- f: long (nullable = true)
 |-- fa: decimal(16,0) (nullable = true)

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> Seq(-9.40459146980391E14).toDF("a").repartition(1).createOrReplaceTempView("df")

scala> sql("select a, cast(a as decimal(30,15)) as c, floor(a) as f, floor(a, 0) as fa from df").collect
23/11/28 11:07:33 WARN GpuOverrides: 
  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
    @Partitioning <SinglePartition$> could run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> a#102 could run on GPU

23/11/28 11:07:33 WARN GpuOverrides: 
  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
    @Partitioning <SinglePartition$> could run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> a#102 could run on GPU

23/11/28 11:07:33 WARN GpuOverrides: 
  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
    @Partitioning <SinglePartition$> could run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> a#102 could run on GPU

23/11/28 11:07:33 WARN GpuOverrides: 
!Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
  @Partitioning <SinglePartition$> could run on GPU
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
    @Expression <AttributeReference> a#102 could run on GPU

res26: Array[org.apache.spark.sql.Row] = Array([-9.40459146980391E14,-940459146980391.021204383727616,-940459146980391,-940459146980392])

@jlowe
Copy link
Member

jlowe commented Nov 28, 2023

So this ends up being solely about the known imprecision when casting floating point values to decimals. The test has to explicitly set spark.rapids.sql.castFloatToDecimal.enabled=true to keep everything on the GPU, and that leads to the difference of -940459146980391.021204383727616 vs. -940459146980391.000000000000000 when casting the double to a decimal(30,15). Performing a floor on those values results in a difference, because floor moves to the next integral value towards negative infinity.

If the GPU matched the CPU when casting floating point to decimals, the test would pass as written.

@jlowe jlowe linked a pull request Nov 28, 2023 that will close this issue
@jlowe jlowe closed this as completed Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants