[FEA] Support format_number #9173

viadea · 2023-09-01T18:47:13Z

I wish we can support format_number function.
eg:

select format_number(ss_wholesale_cost, 5) from store_sales limit 5;

      ! <FormatNumber> format_number(ss_wholesale_cost#95, 5) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatNumber

The text was updated successfully, but these errors were encountered:

revans2 · 2023-09-05T16:34:13Z

This is fun because we essentially have to match java DecimalFormat code

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/text/DecimalFormat.html

Because that is what it uses. Happily it looks like the formatting is hard coded to the Locale.US so we don't have to worry about other locales and ways of formatting the numbers.

thirtiseven · 2023-09-07T08:44:26Z

format_number supports both integer and string (spark sql only) as second parameters. In the string case we can format input numbers with a custom pattern. The logic to handle the pattern in Java DecimalFormat code is quite complicated and we may need to write a CUDA kernel for it. But I think the number case can be more easily supported in plugin.

@viadea Is it sufficient that second parameter is literal integer as first step for the customer's request? We can fully support the string pattern (maybe in cuDF or spark-rapids-jni) later.

thirtiseven · 2023-09-08T02:59:05Z

Since much of the complexity happens in parsing the format string in Java DecimalFormat code, I think it is possible to implement it in plugin side with many substr/concat if the format string is literal.

Most logic for format string parsing can follow or call DecimalFormat but we need to rewrite the logic for formatting input numbers to adapt cuDF api.

I plan to work on supporting integer as 2nd parameter first to verify my solution, most code can be reused for string format case.

viadea · 2023-09-11T18:00:28Z

Above is just my example.
The real customer use case is:

<FormatNumber> format_number(somecol, 5) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatNumber

where somecol is a double type.

thirtiseven · 2023-09-20T07:42:53Z

Unfortunately I think the float/double type for first parameter is also unable to be fully supported on plugin side, because casting float/double to string doesn't match Spark/Java's result, see compatibility doc.

In normal way, we should first convert a float/double to string correctly before formatting it, and I also didn't find workaround to get enough information I want.

I will create a PR to support other types and part support float/double soon. We may need a custom kernel for float to string casting, see #4204.

thirtiseven · 2023-10-18T11:07:11Z

Hi @viadea, do we have more information about the range/precision of the double customer will use? It could be difficult to exactly match Spark's behavior for doubles with high precision, but it will be easy to match them in a limited precision.

viadea · 2023-10-20T17:52:59Z

Hi @viadea, do we have more information about the range/precision of the double customer will use? It could be difficult to exactly match Spark's behavior for doubles with high precision, but it will be easy to match them in a limited precision.

Let me check that and will update you internally once i get the answer.

thirtiseven · 2023-10-30T07:48:46Z

To implement float to string part in the format_number, the solution from cuDF (draft PR NVIDIA/spark-rapids-jni#1508) does not look good enough because of rounding off errors in high precision.

Instead I found https://github.com/ulfjack/ryu which is a popular solution of float to string and it has a C implementation in Apache license. If we can get approval to use some code from it, the custom kernel development will be easier.

However It also can't fully match Java's results and the mismatched part is because Java's result is not good enough. (Actually Java switched to a new solution for float to string in higher version of JDK to fix those issues.) Since we can't use Java's code because of the license, it will be difficult to match Java's bug perfectly.

@revans2 Do you think it's ok to use this solution for the float to string casting kernel?

revans2 · 2023-10-30T14:01:22Z

@thirtiseven I am fine with a different implementation for float/double to string and string to float/double, so long as

It is documented exactly what we are doing and when things might be different. Like the result you get back is different based on JDK version used in java, but ours is not and here is why.
It is self consistent. In that we get the same result back for the same input each time.

It would also be nice if we could be consistent in how we round trip the data float -> string -> float, double -> string -> double each produce the same value as the input (bit for bit). But that is not a requirement in any way.

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 1, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 6, 2023

GaryShen2008 assigned thirtiseven Sep 6, 2023

thirtiseven mentioned this issue Sep 21, 2023

Support format_number #9281

Merged

revans2 mentioned this issue Sep 29, 2023

[BUG] Match JDK behavior when formatting double/float #9343

Open

thirtiseven mentioned this issue Nov 6, 2023

Use float to string kernel #9470

Merged

This was referenced Nov 20, 2023

Support float case of format_number with format_float kernel #9790

Merged

Adding format_float kernel NVIDIA/spark-rapids-jni#1572

Merged

thirtiseven closed this as completed in #9790 Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support format_number #9173

[FEA] Support format_number #9173

viadea commented Sep 1, 2023

revans2 commented Sep 5, 2023

thirtiseven commented Sep 7, 2023 •

edited

Loading

thirtiseven commented Sep 8, 2023 •

edited

Loading

viadea commented Sep 11, 2023

thirtiseven commented Sep 20, 2023

thirtiseven commented Oct 18, 2023

viadea commented Oct 20, 2023

thirtiseven commented Oct 30, 2023

revans2 commented Oct 30, 2023

[FEA] Support format_number #9173

[FEA] Support format_number #9173

Comments

viadea commented Sep 1, 2023

revans2 commented Sep 5, 2023

thirtiseven commented Sep 7, 2023 • edited Loading

thirtiseven commented Sep 8, 2023 • edited Loading

viadea commented Sep 11, 2023

thirtiseven commented Sep 20, 2023

thirtiseven commented Oct 18, 2023

viadea commented Oct 20, 2023

thirtiseven commented Oct 30, 2023

revans2 commented Oct 30, 2023

thirtiseven commented Sep 7, 2023 •

edited

Loading

thirtiseven commented Sep 8, 2023 •

edited

Loading