Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9101] [PySpark] Add missing NullType #7499

Closed
wants to merge 2 commits into from
Closed

[SPARK-9101] [PySpark] Add missing NullType #7499

wants to merge 2 commits into from

Conversation

sixers
Copy link

@sixers sixers commented Jul 18, 2015

@rxin
Copy link
Contributor

rxin commented Jul 18, 2015

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Jul 18, 2015

cc @davies

@davies
Copy link
Contributor

davies commented Jul 18, 2015

I'm just wondering that is there a real use case that need NullType? Currently, it's only used during type inferring.

@SparkQA
Copy link

SparkQA commented Jul 18, 2015

Test build #37728 has finished for PR 7499 at commit 97e3f2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Expression extends TreeNode[Expression]
    • case class IsNaN(child: Expression) extends UnaryExpression
    • abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging
    • abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product
    • abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable
    • case class ConvertToUnsafe(child: SparkPlan) extends UnaryNode
    • case class ConvertToSafe(child: SparkPlan) extends UnaryNode

@rxin
Copy link
Contributor

rxin commented Jul 18, 2015

It can happen if there is a null literal -- I'm not sure what happens in Python though.

@JoshRosen
Copy link
Contributor

@rxin, @davies, the JIRA ticket contains an example of a query that fails due to this issue: https://issues.apache.org/jira/browse/SPARK-9101

@sixers, it might be nice to add a regression test based on the simple example you gave in the JIRA.

@davies
Copy link
Contributor

davies commented Jul 19, 2015

@JoshRosen I see, thanks!

@sixers
Copy link
Author

sixers commented Jul 19, 2015

@JoshRosen
@davies
@rxin

This is my first contribution to Spark, would you give me some directions where to put this test?

In general what is broken is parsing schema of Java DataFrame (with NullType). It's done lazily here:

def schema(self):

which eventually uses _parse_datatype_json_value: to parse the schema:

def _parse_datatype_json_value(json_value):

So it also breaks in other cases like this:

sqlContext.createDataFrame(sc.parallelize([(None,1),(None,2), (None,3), (None, 4)]), samplingRatio=0.5).collect()
sqlContext.createDataFrame([[None]], schema=StructType([StructField("col", NullType(), True)])).collect()

Because of that I think tests should be written for _parse_datatype_json_value.

There are some tests in _parse_datatype_json_string :

def _parse_datatype_json_string(json_string):

Tests for simple types are dynamic, created by iterating over _all_atomic_types, where NullType was missing. Now it's included in those tests. :

>>> for cls in _all_atomic_types.values():

In general I think there are two options:

  • unroll those dynamic checks and write explicit check for each atomic type
  • add additional NullType field to complex_structtype test:
    >>> complex_structtype = StructType([

I'm not sure if it brings any value.

What do you think? Should I go with one of those or you see other places where I can introduce a test for that?

@JoshRosen
Copy link
Contributor

@sixers, my suggestion was to add an end-to-end test, like sqlContext.sql("select null").collect(), to PySpark SQL's SQLTests unittest suite:

class SQLTests(ReusedPySparkTestCase):
. This could be a new test case, named something like test_select_null_literal.

The fact that this bug was unnoticed for so long implies that our Python suite doesn't contain any tests which try to select null literals, which is why I wanted to add such a test.

@sixers
Copy link
Author

sixers commented Jul 20, 2015

@JoshRosen

I see, thanks for the suggestion, I added this test.

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37847 has finished for PR 7499 at commit dd75aa6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 20, 2015

Thanks! Going to merge this.

@rxin
Copy link
Contributor

rxin commented Jul 20, 2015

Actually I'm having some trouble with ASF git. I will merge when that works.

@rxin
Copy link
Contributor

rxin commented Jul 20, 2015

I merged it.

asfgit pushed a commit that referenced this pull request Jul 20, 2015
JIRA: https://issues.apache.org/jira/browse/SPARK-9101

Author: Mateusz Buśkiewicz <[email protected]>

Closes #7499 from sixers/spark-9101 and squashes the following commits:

dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types

(cherry picked from commit 02181fb)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in 02181fb Jul 20, 2015
@rxin
Copy link
Contributor

rxin commented Jul 20, 2015

Note: I merged it in master (1.5.0), as well as branch-1.4 (1.4.2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants