[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specifi… · nemccarthy/spark@4588b97

Commit

[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specifi…

…c row accessors

This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.

At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`.  In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods.  This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.

The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:

- apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
- apache#6218: DataFrame.describe() should cast all aggregates to String
- apache#6400: Use output schema, not relation schema, for data source input conversion

Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema.  According to the `createDataFrame()` Scaladoc:

>  It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.

Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats.  This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.

In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows.  Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch.  Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.

Author: Josh Rosen <[email protected]>

Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:

740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
befc613 [Josh Rosen] Add tests to document Option-handling behavior.
5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
677ff27 [Josh Rosen] Fix null handling bug; add tests.
8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
7ca7fcb [Josh Rosen] Comments and cleanup
1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

Loading branch information

JoshRosen authored and nemccarthy committed Jun 19, 2015

1 parent 1567993 commit 4588b97

mllib/src/test/java/org/apache/spark/ml/feature/JavaHashingTFSuite.java

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -55,9 +55,9 @@ public void tearDown() {
  
      @Test

      public void hashingTF() {

        JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(

          RowFactory.create(0, "Hi I heard about Spark"),

          RowFactory.create(0, "I wish Java could use case classes"),

          RowFactory.create(1, "Logistic regression models are neat")

          RowFactory.create(0.0, "Hi I heard about Spark"),

          RowFactory.create(0.0, "I wish Java could use case classes"),

          RowFactory.create(1.0, "Logistic regression models are neat")

        ));

        StructType schema = new StructType(new StructField[]{

          new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),

0 comments on commit `4588b97`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `4588b97`

Commit

There are no files selected for viewing

0 comments on commit 4588b97

0 comments on commit `4588b97`