[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199

NathanHowell · 2016-12-07T23:43:03Z

What changes were proposed in this pull request?

This relaxes the parsing of Float and Double columns to properly support mixed case values of NaN and (+/-)Infinity, as well as properly supporting (+/-)Inf. Currently a string literal of Nan or InfinitY will cause a task to fail instead of placing the record in the corrupt record column, and Inf causes a failure instead of being a valid double.

How was this patch tested?

Additional unit tests have been added

NathanHowell · 2016-12-07T23:46:28Z

Hello @HyukjinKwon, can you take a look at this one? I am unsure if we should be accepting lowercased values like nan (versus strictly testing for NaN) but I think this PR matches the original intent of the code.

HyukjinKwon · 2016-12-08T01:49:01Z

@NathanHowell Thank you for cc'ing me. I will try to take a look within tomorrow at my best.

SparkQA · 2016-12-08T02:23:07Z

Test build #69828 has finished for PR 16199 at commit 11ac443.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-12-08T05:59:59Z

@NathanHowell, while tracking down the history, I found a similar PR including this in https://github.com/apache/spark/pull/9759/files#diff-8affe5ec7d691943a88e43eb30af656e (that seems reverted due to conflicts of dev/deps/spark-deps-hadoop* which is not related with this PR).

Would this make sense if we take out the valid changes from there? It seems safe to follow it as the change there seems already approved by several reviewers (+I also like the change there).

HyukjinKwon · 2016-12-08T06:00:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -1764,4 +1764,37 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
    val df2 = spark.read.option("PREfersdecimaL", "true").json(records)
    assert(df2.schema == schema)
  }
+
+  test("SPARK-18772: Special floats") {
+    val records = sparkContext


I think it would be nicer if it has some roundtrip tests in reading and writing.

HyukjinKwon · 2017-02-09T14:33:33Z

What do you think about my suggestion @NathanHowell ?

NathanHowell · 2017-02-09T16:22:47Z

@HyukjinKwon Good idea, I'll take another stab and try to revive the original pull request.

HyukjinKwon · 2017-05-07T15:10:31Z

Gentle ping @NathanHowell, how is it going?

HyukjinKwon · 2017-05-11T14:00:38Z

@NathanHowell, please let me know. I can pick up the commits and take over.

…s in JSON ## What changes were proposed in this pull request? This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes #16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes #17956 from HyukjinKwon/SPARK-18772. (cherry picked from commit 3f98375) Signed-off-by: Wenchen Fan <[email protected]>

…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes apache#17956 from HyukjinKwon/SPARK-18772.

[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent

11ac443

HyukjinKwon reviewed Dec 8, 2016

View reviewed changes

HyukjinKwon mentioned this pull request May 12, 2017

[SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON #17956

Closed

asfgit closed this in 3f98375 May 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199

[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199

NathanHowell commented Dec 7, 2016

NathanHowell commented Dec 7, 2016

HyukjinKwon commented Dec 8, 2016

SparkQA commented Dec 8, 2016

HyukjinKwon commented Dec 8, 2016 •

edited

Loading

HyukjinKwon Dec 8, 2016

HyukjinKwon commented Feb 9, 2017

NathanHowell commented Feb 9, 2017

HyukjinKwon commented May 7, 2017

HyukjinKwon commented May 11, 2017

[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199

[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199

Conversation

NathanHowell commented Dec 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

NathanHowell commented Dec 7, 2016

HyukjinKwon commented Dec 8, 2016

SparkQA commented Dec 8, 2016

HyukjinKwon commented Dec 8, 2016 • edited Loading

HyukjinKwon Dec 8, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Feb 9, 2017

NathanHowell commented Feb 9, 2017

HyukjinKwon commented May 7, 2017

HyukjinKwon commented May 11, 2017

HyukjinKwon commented Dec 8, 2016 •

edited

Loading