Support float case of format_number with format_float kernel #9790

thirtiseven · 2023-11-20T04:09:34Z

This PR adds better support for the float case of format_number, which made it good enough to be enabled by default.

It will have known compatibility issues from ryu, as the same way as float to string.

Depends on: NVIDIA/spark-rapids-jni#1572

performance test results

10000000 random number generated by BigDataGen:

val dataTable = DBGen().addTable("data", "a float, b double", 10000000)
dataTable.toDF(spark).write.mode("overwrite").parquet("format_float_data")

test code:

spark.time(df.selectExpr("COUNT(format_number(a, -1)) as a", "COUNT(format_number(a, 0)) as b", "COUNT(format_number(a, 5)) as c", "COUNT(format_number(a, 50)) as d").show())

Data Type	GPU Time (ms)	CPU Time (ms)	Speed up
double	324	12,160	37.53x
float	222	6,101	27.48x

~~I plan to move special case checking to the kernel next, but personally I think it is a nit for this pr.~~ done and it runs much faster than before the change.

results before move nan/inf replacement to kernel

Data Type	GPU Time (ms)	CPU Time (ms)	Speed up
double	1,326.8	9,841.6	7.42x
float	991.4	4,663.2	4.70x

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-12-28T10:58:21Z

@revans2 please take a look on this too if you have time, thanks!

thirtiseven · 2023-12-28T10:59:00Z

build

Signed-off-by: Haoyang Li <[email protected]>

revans2 · 2024-01-03T14:28:15Z

docs/compatibility.md

 This configuration is enabled by default. To disable this operation on the GPU set
 [`spark.rapids.sql.castFloatToString.enabled`](additional-functionality/advanced_configs.md#sql.castFloatToString.enabled) to `false`.

+The `format_number` function also use ryu as the solution when formatting floating-point data types to 


nit: also uses ryu

revans2 · 2024-01-03T14:30:13Z

integration_tests/src/main/python/string_test.py

+            'format_number(a, 5)').collect(), conf = float_format_number_conf)
+    gpu = with_gpu_session(lambda spark: unary_op_df(spark, gen).selectExpr('*',
+            'format_number(a, 5)').collect(), conf = float_format_number_conf)
+    mismatched = sum(x[0] != x[1] for x in zip(cpu, gpu))


I preferred the version that checked that when we parsed them back to a float the numbers were within the error bounds instead of saying that we cannot be wrong more than some set percentage.

thirtiseven · 2024-01-04T14:08:01Z

The test code is really not strong enough, I made a fix from the kernel side: NVIDIA/spark-rapids-jni#1676

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-01-09T12:02:47Z

build

Use format_float kernel

ada8d7a

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Nov 20, 2023

thirtiseven mentioned this pull request Nov 20, 2023

Adding format_float kernel NVIDIA/spark-rapids-jni#1572

Merged

thirtiseven marked this pull request as draft November 20, 2023 14:30

Add tests and doc

238c061

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as ready for review November 21, 2023 10:46

thirtiseven changed the base branch from branch-23.12 to branch-24.02 November 22, 2023 13:40

Merge branch 'NVIDIA:branch-23.12' into format_float

bc08d57

thirtiseven requested review from jlowe, revans2, tgravescs, GaryShen2008, NvTimLiu and pxLi as code owners November 27, 2023 10:11

thirtiseven added 2 commits November 27, 2023 18:17

Merge branch 'branch-24.02' into format_float

a375433

use new name from jni change

92845cc

Signed-off-by: Haoyang Li <[email protected]>

sameerz added the feature request New feature or request label Nov 28, 2023

thirtiseven and others added 4 commits December 12, 2023 12:18

Merge branch 'branch-24.02' into format_float

40450e8

move inf/nan replacement to kernel

7140c9f

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'branch-24.02' into format_float

fcca63c

Merge branch 'NVIDIA:branch-24.02' into format_float

6cac1a9

claen up

4fb4691

Signed-off-by: Haoyang Li <[email protected]>

revans2 reviewed Jan 3, 2024

View reviewed changes

Address comments

7729c32

Signed-off-by: Haoyang Li <[email protected]>

revans2 approved these changes Jan 9, 2024

View reviewed changes

thirtiseven merged commit af91522 into NVIDIA:branch-24.02 Jan 9, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support float case of format_number with format_float kernel #9790

Support float case of format_number with format_float kernel #9790

thirtiseven commented Nov 20, 2023 •

edited

Loading

thirtiseven commented Dec 28, 2023

thirtiseven commented Dec 28, 2023

revans2 Jan 3, 2024

thirtiseven Jan 4, 2024

revans2 Jan 3, 2024

thirtiseven Jan 4, 2024

thirtiseven commented Jan 4, 2024

thirtiseven commented Jan 9, 2024

Support float case of format_number with format_float kernel #9790

Support float case of format_number with format_float kernel #9790

Conversation

thirtiseven commented Nov 20, 2023 • edited Loading

thirtiseven commented Dec 28, 2023

thirtiseven commented Dec 28, 2023

revans2 Jan 3, 2024

Choose a reason for hiding this comment

thirtiseven Jan 4, 2024

Choose a reason for hiding this comment

revans2 Jan 3, 2024

Choose a reason for hiding this comment

thirtiseven Jan 4, 2024

Choose a reason for hiding this comment

thirtiseven commented Jan 4, 2024

thirtiseven commented Jan 9, 2024

thirtiseven commented Nov 20, 2023 •

edited

Loading