No way to define data class with Decimal(38, 0) in Spark schema #181

jkylling · 2022-09-20T08:00:15Z

There seems to be no way to define data classes where the data class encoder produces a Spark schema with fields of type Decimal(38, 0). The natural approach would be to define a data class with a field of type BigInteger, but this is unsupported by the data class encoder.

This can be seen by the following code

data class A(val value: BigInteger)

fun main() = withSpark {
        val ds = dsOf(1, 2)
        val df = ds.`as`<A>()
        println(df.schema())
    }

which throws
java.lang.IllegalArgumentException: java.math.BigInteger is unsupported.

The text was updated successfully, but these errors were encountered:

Jolanrensen · 2022-09-20T11:00:03Z

Hi! java.math.BigDecimal is supported at the moment, Since you want the type Decimal, does that suffice for you? Or should we add BigInteger too?

jkylling · 2022-09-20T11:14:35Z

Hi! Yes, I've noticed that java.math.BigDecimal is supported, but it looks like in the resulting schema the type can only be Decimal(38, 18). It would be useful to be able to specify which decimal type it should be converted to. Perhaps this could be done with some kind of annotation on the type? For instance,

data class A(@DecimalType(38, 0) val value: BigDecimal)

Jolanrensen · 2022-09-20T16:18:30Z

I see... So you actually would need a way to give the DataType in case it doesn't exist in or is different from the knownDataTypes in the Encoding.kt file...
That's actually an interesting potential addition to the API since we cannot possibly know about all possible types. I'll see what's possible :)
In the meantime you could create a UDT for your class (User Defined Type). There's an example for that and it allows you to specify entirely how your class should be (de)serialized

Jolanrensen · 2022-09-21T12:44:34Z

Hmm so whatever DecimalType I specify, it turns out that the dtype turns into DecimalType(38,0) anyways whenever I encode a BigInteger. So maybe I can just add a BigInteger::class to DecimalType.SYSTEM_DEFAULT() to knownDataTypes since it won't make a difference anyways.

So... @jkylling, would an annotation approach still help? Cause I think for all unsupported classes UDTs would do the trick just fine (since you need both an encoder and datatype anyways). I could maybe see if I can make the @SQLUserDefinedType work for data class properties too instead of registering them for entire classes everywhere... but I'm not sure how many people would use that.

jkylling · 2022-09-21T12:59:08Z

Just support for BigInteger would be perfect, and probably cover most use cases. DecimalType.SYSTEM_DEFAULT() is Decimal(38, 18) right, so something else would be needed for Decimal(38, 0)?

Jolanrensen · 2022-09-21T13:05:22Z

Just support for BigInteger would be perfect, and probably cover most use cases. DecimalType.SYSTEM_DEFAULT() is Decimal(38, 18) right, so something else would be needed for Decimal(38, 0)?

Well yes, but it gets converted to DecimalType(38,0) by spark immediately even when I give it DecimalType.SYSTEM_DEFAULT(), so I don't think it matters much

jkylling · 2022-09-21T13:14:32Z

Just support for BigInteger would be perfect, and probably cover most use cases. DecimalType.SYSTEM_DEFAULT() is Decimal(38, 18) right, so something else would be needed for Decimal(38, 0)?

Well yes, but it gets converted to DecimalType(38,0) by spark immediately even when I give it DecimalType.SYSTEM_DEFAULT(), so I don't think it matters much

Interesting, this is what is used for BigDecimal already? I tried writing a data frame based on data classes with a BigDecimal field to a table with schema Decimal(38, 0), but got type errors because Decimal(38, 18) could not be converted to Decimal(38, 0). I figured the errors were because the data frame had a schema with Decimal(38, 18).

Jolanrensen · 2022-09-21T13:15:35Z

Interesting, this is what is used for BigDecimal already? I tried writing a data frame based on data classes with a BigDecimal field to a table with schema Decimal(38, 0), but got type errors because Decimal(38, 18) could not be converted to Decimal(38, 0). I figured the errors were because the data frame had a schema with Decimal(38, 18).

Have you got an example for me to try? :)

jkylling · 2022-09-21T14:05:44Z

Interesting, this is what is used for BigDecimal already? I tried writing a data frame based on data classes with a BigDecimal field to a table with schema Decimal(38, 0), but got type errors because Decimal(38, 18) could not be converted to Decimal(38, 0). I figured the errors were because the data frame had a schema with Decimal(38, 18).

Have you got an example for me to try? :)

It turns out the particular error I got was related to the code running in the Databricks runtime. Probably because the Databricks runtime has stricter conversion checks than the open source Spark runtime, which creates null values instead of throwing an exception. A minimal example of this behavior is below:

import org.jetbrains.kotlinx.spark.api.`as`
import org.jetbrains.kotlinx.spark.api.map
import org.jetbrains.kotlinx.spark.api.withSpark
import java.math.BigDecimal

data class A(val value: BigDecimal)

fun main() = withSpark {
    val table = "tbl"
    spark.sql("CREATE TABLE $table (value DECIMAL(38, 0)) USING parquet")
    spark.sql("INSERT INTO $table VALUES (2)")
    spark.sql("INSERT INTO $table VALUES (1${"0".repeat(37)})")
    val df = spark.sql("select * from $table order by 1 asc limit 1")
    df.`as`<A>()
        .map { A(it.value.add(BigDecimal("1" + "0".repeat(19)))) }
        .also { println(it.schema()) }
        .write().insertInto(table)
    df.`as`<A>()
        .map { A(it.value.add(BigDecimal("1" + "0".repeat(20)))) }
        .write().insertInto(table)
    spark.sql("select * from $table").show()
}

This outputs

StructType(StructField(value,DecimalType(38,18),true))
+--------------------+
|               value|
+--------------------+
|10000000000000000002|
|                   2|
|10000000000000000...|
|                null|
+--------------------+

In the Databricks runtime this example should instead throw an exception. Is there a way to write transforms like above which are able to write BigDecimal to Decimal(38, 0) using the full precision?

Jolanrensen · 2022-09-21T15:20:40Z

I'm afraid I don't know enough about this specific part of Spark to give a helpful answer. Maybe you should try StackOverflow for that too :)

jkylling · 2022-09-21T15:37:15Z

Let me rephrase the question: How would I use the Kotlin Spark API to get a Spark data frame with schema Decimal(38, 0) having a single row with value 2, and then add 10..0 (20 zeros) to this value using a transformation of data classes. As the example above shows, the row in the resulting data frame is null and not 10..02.

Jolanrensen · 2022-09-22T10:41:50Z

I just played around with it a bit. If you add 10..0 as BigDecimal it turns to null indeed. You need BigInteger encoding support to be able to get 10..02. (For which I made a pull request #182, it's merged and will appear in GH packages shortly for you to try). After that:

data class A(val value: BigInteger)

val df = dsOf(A(2.toBigInteger()))
    .showDS()
    .also { println(it.dtypes().toList()) }

+---+
| _1|
+---+
|  2|
+---+

[(_1,DecimalType(38,0))]

df
    .map { A(it.value.add(BigInteger("1" + "0".repeat(20)))) }
    .showDS(truncate = false)
    .also { println(it.dtypes().toList()) }

+---------------------+
|value                |
+---------------------+
|100000000000000000002|
+---------------------+

[(value,DecimalType(38,0))]

jkylling · 2022-09-22T14:00:28Z

I just played around with it a bit. If you add 10..0 as BigDecimal it turns to null indeed. You need BigInteger encoding support to be able to get 10..02. (For which I made a pull request #182, it's merged and will appear in GH packages shortly for you to try). After that:
data class A(val value: BigInteger)

val df = dsOf(A(2.toBigInteger()))
    .showDS()
    .also { println(it.dtypes().toList()) }
+---+
| _1|
+---+
|  2|
+---+

[(_1,DecimalType(38,0))]
df
    .map { A(it.value.add(BigInteger("1" + "0".repeat(20)))) }
    .showDS(truncate = false)
    .also { println(it.dtypes().toList()) }
+---------------------+
|value                |
+---------------------+
|100000000000000000002|
+---------------------+

[(value,DecimalType(38,0))]

Great! Thank you for fixing this! I'll give it a go soon.

Jolanrensen added the good first issue Good for newcomers label Sep 20, 2022

Jolanrensen self-assigned this Sep 20, 2022

Jolanrensen mentioned this issue Sep 22, 2022

Added BigInteger support #182

Merged

Jolanrensen added this to the 1.2.2 milestone Dec 1, 2022

Jolanrensen removed the good first issue Good for newcomers label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No way to define data class with Decimal(38, 0) in Spark schema #181

No way to define data class with Decimal(38, 0) in Spark schema #181

jkylling commented Sep 20, 2022

Jolanrensen commented Sep 20, 2022

jkylling commented Sep 20, 2022

Jolanrensen commented Sep 20, 2022

Jolanrensen commented Sep 21, 2022 •

edited

Loading

jkylling commented Sep 21, 2022

Jolanrensen commented Sep 21, 2022

jkylling commented Sep 21, 2022

Jolanrensen commented Sep 21, 2022

jkylling commented Sep 21, 2022 •

edited

Loading

Jolanrensen commented Sep 21, 2022

jkylling commented Sep 21, 2022 •

edited

Loading

Jolanrensen commented Sep 22, 2022

jkylling commented Sep 22, 2022

No way to define data class with Decimal(38, 0) in Spark schema #181

No way to define data class with Decimal(38, 0) in Spark schema #181

Comments

jkylling commented Sep 20, 2022

Jolanrensen commented Sep 20, 2022

jkylling commented Sep 20, 2022

Jolanrensen commented Sep 20, 2022

Jolanrensen commented Sep 21, 2022 • edited Loading

jkylling commented Sep 21, 2022

Jolanrensen commented Sep 21, 2022

jkylling commented Sep 21, 2022

Jolanrensen commented Sep 21, 2022

jkylling commented Sep 21, 2022 • edited Loading

Jolanrensen commented Sep 21, 2022

jkylling commented Sep 21, 2022 • edited Loading

Jolanrensen commented Sep 22, 2022

jkylling commented Sep 22, 2022

Jolanrensen commented Sep 21, 2022 •

edited

Loading

jkylling commented Sep 21, 2022 •

edited

Loading

jkylling commented Sep 21, 2022 •

edited

Loading