Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent #16199

Closed
wants to merge 1 commit into from

Conversation

NathanHowell
Copy link

What changes were proposed in this pull request?

This relaxes the parsing of Float and Double columns to properly support mixed case values of NaN and (+/-)Infinity, as well as properly supporting (+/-)Inf. Currently a string literal of Nan or InfinitY will cause a task to fail instead of placing the record in the corrupt record column, and Inf causes a failure instead of being a valid double.

How was this patch tested?

Additional unit tests have been added

@NathanHowell
Copy link
Author

Hello @HyukjinKwon, can you take a look at this one? I am unsure if we should be accepting lowercased values like nan (versus strictly testing for NaN) but I think this PR matches the original intent of the code.

@HyukjinKwon
Copy link
Member

@NathanHowell Thank you for cc'ing me. I will try to take a look within tomorrow at my best.

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69828 has finished for PR 16199 at commit 11ac443.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 8, 2016

@NathanHowell, while tracking down the history, I found a similar PR including this in https://github.com/apache/spark/pull/9759/files#diff-8affe5ec7d691943a88e43eb30af656e (that seems reverted due to conflicts of dev/deps/spark-deps-hadoop* which is not related with this PR).

Would this make sense if we take out the valid changes from there? It seems safe to follow it as the change there seems already approved by several reviewers (+I also like the change there).

@@ -1764,4 +1764,37 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
val df2 = spark.read.option("PREfersdecimaL", "true").json(records)
assert(df2.schema == schema)
}

test("SPARK-18772: Special floats") {
val records = sparkContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer if it has some roundtrip tests in reading and writing.

@HyukjinKwon
Copy link
Member

What do you think about my suggestion @NathanHowell ?

@NathanHowell
Copy link
Author

@HyukjinKwon Good idea, I'll take another stab and try to revive the original pull request.

@HyukjinKwon
Copy link
Member

Gentle ping @NathanHowell, how is it going?

@HyukjinKwon
Copy link
Member

@NathanHowell, please let me know. I can pick up the commits and take over.

asfgit pushed a commit that referenced this pull request May 13, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  #16199 and extracts the valid change from #9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes #16199

Author: hyukjinkwon <[email protected]>
Author: Nathan Howell <[email protected]>

Closes #17956 from HyukjinKwon/SPARK-18772.

(cherry picked from commit 3f98375)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 3f98375 May 13, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <[email protected]>
Author: Nathan Howell <[email protected]>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <[email protected]>
Author: Nathan Howell <[email protected]>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants