Support `format_number` #9281

thirtiseven · 2023-09-21T13:14:16Z

Partly supported #9173

This PR support format_number for integral and decimal type. Float/double support is still wip, I plan to refactor this part with a string operations based solution to avoid precision error.

And for float/double type, the results will be mismatch between spark and plugin for large/small numbers.

It is because we should first convert a float/double to string correctly before formatting it, but in plugin casting float/double to string doesn't match Spark/Java's result, see compatibility doc. We may need a custom kernel for float to string casting, see #4204.

The solution is quite long and calls many times of cuDF API, which may make it slower than excepted. I did some performance test, it ran faster than CPU.

performance test results

10000000 random number generated by BigDataGen:

val dataTable = DBGen().addTable("data", "a {{{type}}}", 10000000)
dataTable.toDF(spark).write.mode("overwrite").parquet("{{{type}}}_for")

test code:

spark.time(df.selectExpr("COUNT(format_number(a, -1)) as a", "COUNT(format_number(a, 0)) as b", "COUNT(format_number(a, 5)) as c", "COUNT(format_number(a, 50)) as d").show())

Data Type	GPU Time (ms)	CPU Time (ms)
double	11016	22905
float	1464	10307
int	2743	5127
short	545	3967
byte	536	3481
long	3251	7002

Signed-off-by: Haoyang Li <[email protected]>

revans2

I only got part way through. I'll finish the review when it is no longer in draft.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

docs/compatibility.md

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

sameerz · 2023-09-25T15:55:28Z

I did some performance test on long type, it ran faster than CPU. I will do more tests and update soon.

Can you please add the performance test results here?

res-life · 2023-09-26T07:57:58Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+        }
+      }
+      (intPartExp, decPartExp)


If execption occurs at line 2371, then intPartExp will be leaked.
Should handle this resource pair as a whole.

withResource(ArrayBuffer.empty[AutoCloseable]) { resource_array => var res1 = genRes() resource_array ++= res1 var res2 = genRes() resource_array ++= res2 }

Thanks, updated all such cases I can founded.

res-life · 2023-09-26T07:58:47Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+            }
+          }
+          (intPart, decPart)


Handle resource pair as a whole.

revans2

I started to go through this, but the complexity, especially for the float/double code is rather hard to follow. Also in my own testing I saw a lot of problems with the float/double results. It looks like it is related to rounding and to the amount of precision we can get as output from casting a float to a string using CUDF. I'm not sure if we can fix that without a custom kernel.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

firestarman

More comment for the processing will be better for review.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

thirtiseven · 2023-09-27T03:49:39Z

I did some performance test on long type, it ran faster than CPU. I will do more tests and update soon.

Can you please add the performance test results here?

Ok, updated them in the PR description.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

firestarman · 2023-09-27T06:12:27Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+    }
+    val substrs = closeOnExcept(sepCol) { _ =>
+      (0 until maxstrlen by 3).map { i =>


NIT:
I think we do not need to reverse and reverse back the input string if we slice strings from end to start, something like

var curEndsCol: ColumnVector = strlen val substrs = withResource(curEndsCol) { _ => (0 until maxstrlen by 3).safeMap { _ => val startCol = curEndsCol - 3, val sub = closeOnExcept(startCol) { _ => str.substring(startCol, curEndsCol) } curEndsCol.close() curEndsCol = startCol sub }.reverse }

You need to do strlen - 3 in columnar way, e.g.

withResource(Scalar.fromInt(3)) { scalar3 => strlen.sub(scalar3) }

This is just another option, not sure if it would have better perf.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

revans2

I think this is very close.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

revans2 · 2023-09-27T15:26:10Z

Sorry that this is late, but I am seeing some errors with the latest code for decimal. It looks like the rounding is off in some cases.

In the spark shell I ran

import org.apache.spark.sql.DataFrame

def compare(left: DataFrame, right: DataFrame): DataFrame = {
  val leftCount = left.groupBy(left.columns.map(col(_)): _*).count
  val rightCount = right.groupBy(right.columns.map(col(_)): _*).count
  val joinOn = leftCount.columns.map(c => leftCount(c) <=> rightCount(c)).reduceLeft(_ and _)
  val onlyRight = rightCount.join(leftCount, joinOn, joinType="left_anti").withColumn("_in_column", lit("right"))
  val onlyLeft = leftCount.join(rightCount, joinOn, joinType="left_anti").withColumn("_in_column", lit("left"))
  onlyRight.union(onlyLeft)
}

spark.conf.set("spark.rapids.sql.enabled", false)
spark.time(spark.range(100000000L).selectExpr("*", "format_number(1 / CAST(id AS DECIMAL(38,0)), 4) as fnid", "1 / CAST(id as DECIMAL(38, 0))").write.mode("overwrite").parquet("/data/tmp/TEST_OUT_CPU"))
spark.conf.set("spark.rapids.sql.enabled", true)
spark.time(spark.range(100000000L).selectExpr("*", "format_number(1 / CAST(id AS DECIMAL(38,0)), 4) as fnid", "1 / CAST(id as DECIMAL(38, 0))").write.mode("overwrite").parquet("/data/tmp/TEST_OUT"))
spark.time(compare(spark.read.parquet("/data/tmp/TEST_OUT"), spark.read.parquet("/data/tmp/TEST_OUT_CPU")).orderBy("id", "_in_column").show(false))

It produced the following

+-----+------+---------------------------------------+-----+----------+         
|id   |fnid  |(1 / CAST(id AS DECIMAL(38,0)))        |count|_in_column|
+-----+------+---------------------------------------+-----+----------+
|32   |0.0313|0.0312500000000000000000000000000000000|1    |left      |
|32   |0.0312|0.0312500000000000000000000000000000000|1    |right     |
|160  |0.0063|0.0062500000000000000000000000000000000|1    |left      |
|160  |0.0062|0.0062500000000000000000000000000000000|1    |right     |
|800  |0.0013|0.0012500000000000000000000000000000000|1    |left      |
|800  |0.0012|0.0012500000000000000000000000000000000|1    |right     |
|4000 |0.0003|0.0002500000000000000000000000000000000|1    |left      |
|4000 |0.0002|0.0002500000000000000000000000000000000|1    |right     |
|20000|0.0001|0.0000500000000000000000000000000000000|1    |left      |
|20000|0.0000|0.0000500000000000000000000000000000000|1    |right     |
+-----+------+---------------------------------------+-----+----------+

I compared the results to bround, and it looks like we have a bug in there somewhere. I'll file a separate issue for that. Just giving you a heads up.

firestarman · 2023-09-28T02:04:27Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+        val numberToRoundStr = withResource(zeroPointCv) { _ =>
+          withResource(leadingZeros) { _ =>
+            ColumnVector.stringConcatenate(Array(zeroPointCv, leadingZeros, intPart, decPart))


nit: scalar version is better.

integration_tests/src/main/python/string_test.py

Signed-off-by: Haoyang Li <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-09-28T08:24:23Z

@revans2 Thanks for the review, I think I fixed the memory issues.

I'm not sure we can fix this without a custom kernel.

I don't think we can fully match Spark's result on the plugin side for float/decimal yet, considering these cuDF issues. This PR can produce correct results with limited precision and aims to require minimal change to fully support double/float when float to string in JNI is ready.

Signed-off-by: Haoyang Li <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

….scala Co-authored-by: Liangcai Li <[email protected]>

revans2

Looks good to me.

revans2 · 2023-09-28T14:07:32Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -3102,6 +3102,12 @@ object GpuOverrides extends Logging {
                  s" ${RapidsConf.ENABLE_FLOAT_FORMAT_NUMBER} to true.")
              }
            }
+            case dt: DecimalType => {


Once rapidsai/cudf#14210 is fixed we should come back and retest to be sure that it is working properly.

revans2 · 2023-09-28T14:11:06Z

build

thirtiseven · 2023-09-29T01:35:32Z

Thanks all for the review and help! merging this...

thirtiseven and others added 10 commits September 8, 2023 13:59

wip

62dc4a9

Merge branch 'NVIDIA:branch-23.10' into format_number

dc4419c

wip

59af888

Merge branch 'NVIDIA:branch-23.10' into format_number

0ca68db

support format_number for integral and decimal type

e273c6a

Signed-off-by: Haoyang Li <[email protected]>

support double/float normal cases

2f664a6

Merge branch 'NVIDIA:branch-23.10' into format_number

11290b8

support scientific notation double/float with positive exp

40f48c2

support scientific notation double/float with negative exp

4e7af76

bug fixed and clean up

e60dfb9

thirtiseven changed the title ~~WIP: Support format_number~~ Support format_number Sep 25, 2023

thirtiseven self-assigned this Sep 25, 2023

thirtiseven marked this pull request as ready for review September 25, 2023 14:43

revans2 reviewed Sep 25, 2023

View reviewed changes

sameerz added the feature request New feature or request label Sep 25, 2023

refactor and memory leak fix

8d0d6a4

res-life reviewed Sep 26, 2023

View reviewed changes

Handle resource pair as a whole

2f14e18

revans2 reviewed Sep 26, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Show resolved Hide resolved

firestarman reviewed Sep 27, 2023

View reviewed changes

fix more memory leak

845984e

firestarman reviewed Sep 27, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

firestarman reviewed Sep 27, 2023

View reviewed changes

thirtiseven added 2 commits September 27, 2023 14:28

address some comments

68a3b2f

add a config to control float/double enabling

2708cf7

firestarman reviewed Sep 27, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

fixed a bug in neg exp get parts

9c4eff8

fixed another bug and add float scala test

28d06ac

revans2 reviewed Sep 27, 2023

View reviewed changes

add some comments and use lstrip to remove neg sign

ed12c40

firestarman reviewed Sep 28, 2023

View reviewed changes

fix memory leaks

0889332

Signed-off-by: Haoyang Li <[email protected]>

firestarman reviewed Sep 28, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Show resolved Hide resolved

minor changes

2fb9430

Signed-off-by: Haoyang Li <[email protected]>

fallback decimal with high scale

f5d4000

Signed-off-by: Haoyang Li <[email protected]>

firestarman reviewed Sep 28, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides…

c3f1004

….scala Co-authored-by: Liangcai Li <[email protected]>

revans2 approved these changes Sep 28, 2023

View reviewed changes

thirtiseven merged commit 7bffb16 into NVIDIA:branch-23.10 Sep 29, 2023
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `format_number` #9281

Support `format_number` #9281

thirtiseven commented Sep 21, 2023 •

edited

Loading

revans2 left a comment

sameerz commented Sep 25, 2023

res-life Sep 26, 2023

thirtiseven Sep 26, 2023

res-life Sep 26, 2023

revans2 left a comment

firestarman left a comment

thirtiseven commented Sep 27, 2023

firestarman Sep 27, 2023 •

edited

Loading

revans2 left a comment

revans2 commented Sep 27, 2023

firestarman Sep 28, 2023

thirtiseven commented Sep 28, 2023 •

edited

Loading

revans2 left a comment

revans2 Sep 28, 2023

revans2 commented Sep 28, 2023

thirtiseven commented Sep 29, 2023

Support format_number #9281

Support format_number #9281

Conversation

thirtiseven commented Sep 21, 2023 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

sameerz commented Sep 25, 2023

res-life Sep 26, 2023

Choose a reason for hiding this comment

thirtiseven Sep 26, 2023

Choose a reason for hiding this comment

res-life Sep 26, 2023

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

firestarman left a comment

Choose a reason for hiding this comment

thirtiseven commented Sep 27, 2023

firestarman Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Sep 27, 2023

firestarman Sep 28, 2023

Choose a reason for hiding this comment

thirtiseven commented Sep 28, 2023 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

revans2 Sep 28, 2023

Choose a reason for hiding this comment

revans2 commented Sep 28, 2023

thirtiseven commented Sep 29, 2023

Support `format_number` #9281

Support `format_number` #9281

thirtiseven commented Sep 21, 2023 •

edited

Loading

firestarman Sep 27, 2023 •

edited

Loading

thirtiseven commented Sep 28, 2023 •

edited

Loading