-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199
Conversation
Hello @HyukjinKwon, can you take a look at this one? I am unsure if we should be accepting lowercased values like |
@NathanHowell Thank you for cc'ing me. I will try to take a look within tomorrow at my best. |
Test build #69828 has finished for PR 16199 at commit
|
@NathanHowell, while tracking down the history, I found a similar PR including this in https://github.com/apache/spark/pull/9759/files#diff-8affe5ec7d691943a88e43eb30af656e (that seems reverted due to conflicts of Would this make sense if we take out the valid changes from there? It seems safe to follow it as the change there seems already approved by several reviewers (+I also like the change there). |
@@ -1764,4 +1764,37 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||
val df2 = spark.read.option("PREfersdecimaL", "true").json(records) | |||
assert(df2.schema == schema) | |||
} | |||
|
|||
test("SPARK-18772: Special floats") { | |||
val records = sparkContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nicer if it has some roundtrip tests in reading and writing.
What do you think about my suggestion @NathanHowell ? |
@HyukjinKwon Good idea, I'll take another stab and try to revive the original pull request. |
Gentle ping @NathanHowell, how is it going? |
@NathanHowell, please let me know. I can pick up the commits and take over. |
…s in JSON ## What changes were proposed in this pull request? This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes #16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes #17956 from HyukjinKwon/SPARK-18772. (cherry picked from commit 3f98375) Signed-off-by: Wenchen Fan <[email protected]>
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes apache#17956 from HyukjinKwon/SPARK-18772.
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes apache#17956 from HyukjinKwon/SPARK-18772.
What changes were proposed in this pull request?
This relaxes the parsing of
Float
andDouble
columns to properly support mixed case values ofNaN
and (+/-)Infinity
, as well as properly supporting (+/-)Inf
. Currently a string literal ofNan
orInfinitY
will cause a task to fail instead of placing the record in the corrupt record column, andInf
causes a failure instead of being a valid double.How was this patch tested?
Additional unit tests have been added