[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion #6400

JoshRosen · 2015-05-26T02:33:58Z

In DataSourceStrategy.createPhysicalRDD, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates #6414 by @liancheng, which fixes an issue in SimpleTestRelation that prevented this bug from being caught by our old tests:

In SimpleTextRelation, we specified needsConversion to true, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used Cast to convert strings to expected data types. And Cast always produces values of Catalyst types, thus no conversion is done at all. This PR makes SimpleTextRelation produce Scala values so that data conversion code paths can be properly tested.

Closes #5986.

JoshRosen · 2015-05-26T02:41:06Z

The tests for the first commit should fail the assertion that it added, demonstrating the bug. My second commit should fix it.

JoshRosen · 2015-05-26T02:45:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

-        val converters = schemaFields.map {
-          f => CatalystTypeConverters.createToCatalystConverter(f.dataType)
-        }
+        val mutableRow = new SpecificMutableRow(outputTypes)


Whoops, I copy-pasted a bit too eagerly here; this should be a GenericMutableRow for consistency with the old code. Will fix now.

SparkQA · 2015-05-26T02:57:28Z

Test build #33488 has finished for PR 6400 at commit 8e547a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-05-26T03:05:13Z

I went ahead and removed the use of BufferedIterator; based on looking through the Git history, it looks like this was used in a very old version of this code where we didn't have a schema and thus did not know how many columns to expect, so we had to peek at the first row in order to construct converters.

yhuai · 2015-05-26T03:26:10Z

How about we add a test using SimpleTextSource (since it does not override needsConversion)? We can test accessing a subset of columns and project columns with different data types in a different ordering (for example, if a, b, c, d is the ordering of these columns in the schema, we access them with d, c, b, a.).

JoshRosen · 2015-05-26T03:32:29Z

Good idea; I can do this and remove the asserts, which might be expensive. EDIT: they're probably pretty cheap, actually, compared to the other costs, so I might leave them in.

SparkQA · 2015-05-26T04:37:54Z

Test build #33489 has finished for PR 6400 at commit df80bfa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-26T05:39:30Z

Test build #33492 timed out for PR 6400 at commit cc85825 after a configured wait of 150m.

JoshRosen · 2015-05-26T06:03:55Z

I added a simple regression test in HadoopFsRelationTest. Projecting more columns than the input relation contains seems to trigger an ArrayIndexOutOfBoundsException when the wrong schema / types are used.

JoshRosen · 2015-05-26T06:27:07Z

Well, that wasn't the failure I was hoping for, but you should be able to check out e6d97fb and run SimpleTextHadoopFsRelationSuite to see that this test can actually reproduce the bug.

SparkQA · 2015-05-26T08:38:13Z

Test build #33505 timed out for PR 6400 at commit f5cdec6 after a configured wait of 150m.

liancheng · 2015-05-26T15:41:32Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala

@@ -76,6 +76,12 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils {
      df.filter('a > 1 && 'p1 < 2).select('b, 'p1),
      for (i <- 2 to 3; _ <- Seq("foo", "bar")) yield Row(s"val_$i", 1))


@JoshRosen Found that I made a mistake in this checkAnswer test, which should have caught this bug... Values of p1 should be either "foo" or "bar", but I put a 1 there. The correct version of this test should be:

// Simple projection and partition pruning checkAnswer( df.filter('a > 1 && 'p1 < 2).select('b, 'p1), for (i <- 2 to 3; j <- Seq("foo", "bar")) yield Row(s"val_$i", j))

Are you sure that p1 is string-typed? I think that p2 is, but there's other code which implies that p1 should be an int.

e.g. at the top of HadoopFsRelationTest:

val partitionedTestDF1 = (for { i <- 1 to 3 p2 <- Seq("foo", "bar") } yield (i, s"val_$i", 1, p2)).toDF("a", "b", "p1", "p2") val partitionedTestDF2 = (for { i <- 1 to 3 p2 <- Seq("foo", "bar") } yield (i, s"val_$i", 2, p2)).toDF("a", "b", "p1", "p2")

Yeah, I think the type of p1 is int.

cc @liancheng

Ah, sorry my bad.

SparkQA · 2015-05-26T19:39:56Z

Test build #860 has finished for PR 6400 at commit f5cdec6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-05-26T23:08:22Z

@JoshRosen Would you please also help fixing the minor bug in the test case mentioned in #6400 (comment)? Thanks!

SparkQA · 2015-05-26T23:50:15Z

Test build #33543 has finished for PR 6400 at commit e71c866.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-05-27T03:15:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

+      val numColumns = outputTypes.length
+      val mutableRow = new GenericMutableRow(numColumns)
+      val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
+      iterator.map { r =>


Maybe a while loop will be better?

This is actually the return value of the mapPartitions call, which must be a Scala iterator. Note that we do use a while loop to iterate over the columns.

Actually, I am fine with the current version. We are not slower than the previous version.

This is probably faster than the older version since it doesn't use an unnecessary bufferedIterator :)

Yeah, I also realized that after I made my comment.

yhuai · 2015-05-27T03:24:01Z

LGTM. I am merging it to master and branch 1.4.

…ource input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes #5986. Author: Josh Rosen <[email protected]> Author: Cheng Lian <[email protected]> Author: Cheng Lian <[email protected]> Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba into 6166473 9968fba [Cheng Lian] Tests the data type conversion code paths (cherry picked from commit 0c33c7b) Signed-off-by: Yin Huai <[email protected]>

…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - #6218: DataFrame.describe() should cast all aggregates to String - #6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <[email protected]> Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

…ource input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes apache#5986. Author: Josh Rosen <[email protected]> Author: Cheng Lian <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba into 6166473 9968fba [Cheng Lian] Tests the data type conversion code paths

…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - apache#6218: DataFrame.describe() should cast all aggregates to String - apache#6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <[email protected]> Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

…ource input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates apache#6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes apache#5986. Author: Josh Rosen <[email protected]> Author: Cheng Lian <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba into 6166473 9968fba [Cheng Lian] Tests the data type conversion code paths

…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - apache#6218: DataFrame.describe() should cast all aggregates to String - apache#6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <[email protected]> Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

JoshRosen mentioned this pull request May 26, 2015

[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors #6222

Closed

JoshRosen reviewed May 26, 2015
View reviewed changes

liancheng reviewed May 26, 2015
View reviewed changes

Tests the data type conversion code paths

9968fba

Merge 9968fba into 6166473

8ba195c

JoshRosen mentioned this pull request May 26, 2015

[SQL] [Minor] Refactor SimpleTextRelation to test the data type conversion code paths #6414

Closed

JoshRosen added 5 commits May 26, 2015 14:42

Add assertions in order to reproduce SPARK-7858

5a00e66

Fix SPARK-7858 by using output types for conversion.

6cd7366

Remove use of SpecificMutableRow and BufferedIterator

2169a0f

Add regression test to hadoopFsRelationSuites

56b13e5

Re-fix bug so that the tests pass again

e71c866

JoshRosen force-pushed the SPARK-7858 branch from f5cdec6 to e71c866 Compare May 26, 2015 21:42

yhuai reviewed May 27, 2015
View reviewed changes

asfgit closed this in 0c33c7b May 27, 2015

JoshRosen deleted the SPARK-7858 branch May 27, 2015 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion #6400

[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion #6400

JoshRosen commented May 26, 2015

JoshRosen commented May 26, 2015

JoshRosen May 26, 2015

SparkQA commented May 26, 2015

JoshRosen commented May 26, 2015

yhuai commented May 26, 2015

JoshRosen commented May 26, 2015

SparkQA commented May 26, 2015

SparkQA commented May 26, 2015

JoshRosen commented May 26, 2015

JoshRosen commented May 26, 2015

SparkQA commented May 26, 2015

liancheng May 26, 2015

JoshRosen May 26, 2015

yhuai May 27, 2015

liancheng May 27, 2015

SparkQA commented May 26, 2015

liancheng commented May 26, 2015

SparkQA commented May 26, 2015

yhuai May 27, 2015

JoshRosen May 27, 2015

yhuai May 27, 2015

JoshRosen May 27, 2015

yhuai May 27, 2015

yhuai commented May 27, 2015

		@@ -76,6 +76,12 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils {
		df.filter('a > 1 && 'p1 < 2).select('b, 'p1),
		for (i <- 2 to 3; _ <- Seq("foo", "bar")) yield Row(s"val_$i", 1))

[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion #6400

[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion #6400

Conversation

JoshRosen commented May 26, 2015

JoshRosen commented May 26, 2015

Choose a reason for hiding this comment

SparkQA commented May 26, 2015

JoshRosen commented May 26, 2015

yhuai commented May 26, 2015

JoshRosen commented May 26, 2015

SparkQA commented May 26, 2015

SparkQA commented May 26, 2015

JoshRosen commented May 26, 2015

JoshRosen commented May 26, 2015

SparkQA commented May 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 26, 2015

liancheng commented May 26, 2015

SparkQA commented May 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented May 27, 2015