[SPARK-9101] [PySpark] Add missing NullType #7499

sixers · 2015-07-18T19:10:39Z

JIRA: https://issues.apache.org/jira/browse/SPARK-9101

…rk.sql.types

rxin · 2015-07-18T19:49:59Z

Jenkins, test this please.

rxin · 2015-07-18T19:50:07Z

cc @davies

davies · 2015-07-18T20:04:02Z

I'm just wondering that is there a real use case that need NullType? Currently, it's only used during type inferring.

SparkQA · 2015-07-18T20:12:25Z

Test build #37728 has finished for PR 7499 at commit 97e3f2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Expression extends TreeNode[Expression]
- case class IsNaN(child: Expression) extends UnaryExpression
- abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging
- abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product
- abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable
- case class ConvertToUnsafe(child: SparkPlan) extends UnaryNode
- case class ConvertToSafe(child: SparkPlan) extends UnaryNode

rxin · 2015-07-18T20:16:42Z

It can happen if there is a null literal -- I'm not sure what happens in Python though.

JoshRosen · 2015-07-18T21:27:05Z

@rxin, @davies, the JIRA ticket contains an example of a query that fails due to this issue: https://issues.apache.org/jira/browse/SPARK-9101

@sixers, it might be nice to add a regression test based on the simple example you gave in the JIRA.

davies · 2015-07-19T01:48:20Z

@JoshRosen I see, thanks!

sixers · 2015-07-19T18:25:38Z

@JoshRosen
@davies
@rxin

This is my first contribution to Spark, would you give me some directions where to put this test?

In general what is broken is parsing schema of Java DataFrame (with NullType). It's done lazily here:

spark/python/pyspark/sql/dataframe.py

Line 182 in 692378c

def schema(self):

which eventually uses _parse_datatype_json_value: to parse the schema:

spark/python/pyspark/sql/types.py

Line 708 in 692378c

def _parse_datatype_json_value(json_value):

So it also breaks in other cases like this:

sqlContext.createDataFrame(sc.parallelize([(None,1),(None,2), (None,3), (None, 4)]), samplingRatio=0.5).collect()
sqlContext.createDataFrame([[None]], schema=StructType([StructField("col", NullType(), True)])).collect()

Because of that I think tests should be written for _parse_datatype_json_value.

There are some tests in _parse_datatype_json_string :

spark/python/pyspark/sql/types.py

Line 651 in 692378c

def _parse_datatype_json_string(json_string):

Tests for simple types are dynamic, created by iterating over _all_atomic_types, where NullType was missing. Now it's included in those tests. :

spark/python/pyspark/sql/types.py

Line 660 in 692378c

>>> for cls in _all_atomic_types.values():

In general I think there are two options:

unroll those dynamic checks and write explicit check for each atomic type
add additional NullType field to complex_structtype test:

spark/python/pyspark/sql/types.py

Line 680 in 692378c

>>> complex_structtype = StructType([

I'm not sure if it brings any value.

What do you think? Should I go with one of those or you see other places where I can introduce a test for that?

JoshRosen · 2015-07-20T01:22:46Z

@sixers, my suggestion was to add an end-to-end test, like sqlContext.sql("select null").collect(), to PySpark SQL's SQLTests unittest suite:

spark/python/pyspark/sql/tests.py

Line 130 in 163e3f1

class SQLTests(ReusedPySparkTestCase):

. This could be a new test case, named something like test_select_null_literal.

The fact that this bug was unnoticed for so long implies that our Python suite doesn't contain any tests which try to select null literals, which is why I wanted to add such a test.

sixers · 2015-07-20T07:50:46Z

@JoshRosen

I see, thanks for the suggestion, I added this test.

JoshRosen · 2015-07-20T17:02:46Z

Jenkins, this is ok to test.

SparkQA · 2015-07-20T17:28:23Z

Test build #37847 has finished for PR 7499 at commit dd75aa6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-20T18:11:26Z

Thanks! Going to merge this.

rxin · 2015-07-20T18:15:38Z

Actually I'm having some trouble with ASF git. I will merge when that works.

rxin · 2015-07-20T19:00:56Z

I merged it.

JIRA: https://issues.apache.org/jira/browse/SPARK-9101 Author: Mateusz Buśkiewicz <[email protected]> Closes #7499 from sixers/spark-9101 and squashes the following commits: dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal 97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types (cherry picked from commit 02181fb) Signed-off-by: Reynold Xin <[email protected]>

rxin · 2015-07-20T23:18:40Z

Note: I merged it in master (1.5.0), as well as branch-1.4 (1.4.2).

[SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspa…

97e3f2f

…rk.sql.types

[SPARK-9101] [PySpark] Test for selecting null literal

dd75aa6

asfgit closed this in 02181fb Jul 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9101] [PySpark] Add missing NullType #7499

[SPARK-9101] [PySpark] Add missing NullType #7499

sixers commented Jul 18, 2015

rxin commented Jul 18, 2015

rxin commented Jul 18, 2015

davies commented Jul 18, 2015

SparkQA commented Jul 18, 2015

rxin commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

davies commented Jul 19, 2015

sixers commented Jul 19, 2015

JoshRosen commented Jul 20, 2015

sixers commented Jul 20, 2015

JoshRosen commented Jul 20, 2015

SparkQA commented Jul 20, 2015

rxin commented Jul 20, 2015

rxin commented Jul 20, 2015

rxin commented Jul 20, 2015

rxin commented Jul 20, 2015

[SPARK-9101] [PySpark] Add missing NullType #7499

[SPARK-9101] [PySpark] Add missing NullType #7499

Conversation

sixers commented Jul 18, 2015

rxin commented Jul 18, 2015

rxin commented Jul 18, 2015

davies commented Jul 18, 2015

SparkQA commented Jul 18, 2015

rxin commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

davies commented Jul 19, 2015

sixers commented Jul 19, 2015

JoshRosen commented Jul 20, 2015

sixers commented Jul 20, 2015

JoshRosen commented Jul 20, 2015

SparkQA commented Jul 20, 2015

rxin commented Jul 20, 2015

rxin commented Jul 20, 2015

rxin commented Jul 20, 2015

rxin commented Jul 20, 2015