[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table #16313

cloud-fan · 2016-12-16T17:06:07Z

What changes were proposed in this pull request?

When we append data to an existing table with DataFrameWriter.saveAsTable, we will do various checks to make sure the appended data is consistent with the existing data.

However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for HadoopFsRelation, we forget to check bucketing, etc.

This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:

SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
SPARK-18912: We forget to check the number of columns for non-file-based data source table
SPARK-18913: We don't support append data to a table with special column names.

How was this patch tested?

new regression test.

cloud-fan · 2016-12-16T17:08:45Z

cc @yhuai @gatorsmile

SparkQA · 2016-12-16T18:47:28Z

Test build #70257 has finished for PR 16313 at commit 370bdc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-17T13:41:22Z

Test build #70307 has finished for PR 16313 at commit 7a9bfaa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ng should fail

SparkQA · 2016-12-17T16:48:37Z

Test build #70309 has finished for PR 16313 at commit b1dbd0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-17T20:04:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

@@ -157,39 +156,74 @@ case class CreateDataSourceTableAsSelectCommand(
          // Since the table already exists and the save mode is Ignore, we will just return.
          return Seq.empty[Row]
        case SaveMode.Append =>
+          val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB)
+          if (existingTable.tableType == CatalogTableType.VIEW) {
+            throw new AnalysisException("Saving data into a view is not allowed.")


We already have an assert at the beginning of this function. We need to improve the error message there. That can cover more cases.

So far, the error message is pretty confusing.

assertion failed java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156)

We can add a test case for it.

val df = spark.range(1, 10).toDF("id1") df.write.saveAsTable("tab1") spark.sql("create view view1 as select * from tab1") df.write.mode(SaveMode.Append).format("parquet").saveAsTable("view1")

I'll remove this check as it's unreachable now, and fix the error message in another PR.

gatorsmile · 2016-12-17T20:06:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

@@ -157,39 +156,74 @@ case class CreateDataSourceTableAsSelectCommand(
          // Since the table already exists and the save mode is Ignore, we will just return.
          return Seq.empty[Row]
        case SaveMode.Append =>
+          val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB)


Is that possible we can directly use the input parameter table?

they are different. The input table just contains some user-specified information and the schema is always empty. https://github.com/cloud-fan/spark/blob/b1dbd0a19a174eaae1aaf114e04f6d3683ea65c4/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L135

uh, I see. Will do the final review today.

gatorsmile · 2016-12-17T20:10:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -133,6 +133,16 @@ case class BucketSpec(
  if (numBuckets <= 0) {
    throw new AnalysisException(s"Expected positive number of buckets, but got `$numBuckets`.")
  }
+
+  override def toString: String = {


Since we implement toString here, we can simplify our logics in describeBucketingInfo

how? the toString returns a single line, while describeBucketingInfo generates 3 result lines.

: ) If we want to keep the existing format (3 lines), then we are unable to do it.

gatorsmile · 2016-12-17T20:17:37Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala

@@ -459,7 +459,7 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
  test("saveAsTable()/load() - partitioned table - ErrorIfExists") {
    Seq.empty[(Int, String)].toDF().createOrReplaceTempView("t")

-    withTempView("t") {


We need withTempView("t") to drop the temp view that is created at the beginning of this test case?

yea, my mistake

SparkQA · 2016-12-18T15:43:16Z

Test build #70321 has finished for PR 16313 at commit d5a464d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-19T02:28:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+          if (specifiedPartCols != existingTable.partitionColumnNames) {
+            throw new AnalysisException(
+              s"""
+                |Specified partitioning does not match the existing table $tableName.


Nit: the grammar issue.
Specified partitioning does not match the existing table $tableName.
->
Specified partitioning does not match that of the existing table $tableName.

Found a reference link: in sybase adaptive server.

gatorsmile · 2016-12-19T02:29:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+              existingTable.bucketSpec.map(_.toString).getOrElse("not bucketed")
+            throw new AnalysisException(
+              s"""
+                |Specified bucketing does not match the existing table $tableName.


Nit: The same here.

gatorsmile · 2016-12-19T02:56:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+
+          if (existingTable.provider.get == DDLUtils.HIVE_PROVIDER) {
+            throw new AnalysisException(s"Saving data in the Hive serde table $tableName is " +
+              s"not supported yet. Please use the insertInto() API as an alternative.")


Nit: the string interpolations is not needed.

gatorsmile · 2016-12-19T03:17:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+            throw new AnalysisException(s"Saving data in the Hive serde table $tableName is " +
+              s"not supported yet. Please use the insertInto() API as an alternative.")
+          }
+
          // Check if the specified data source match the data source of the existing table.


Now, the checking logics are split two places for CTAS of data source tables using the Append mode. Maybe we can improve the comment to explain AnalyzeCreateTable verifies the consistency between the user-specified table schema/definition and the SELECT query. Here, we verifies the consistency between the user-specified table schema/definition and the existing table schema/definition, the consistency between the existing table schema/definition and the SELECT query.

gatorsmile · 2016-12-19T03:17:52Z

LGTM except a few minor comments.

SparkQA · 2016-12-19T06:31:51Z

Test build #70331 has finished for PR 16313 at commit 32857e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-19T06:33:20Z

LGTM

gatorsmile · 2016-12-20T01:18:05Z

retest this please

SparkQA · 2016-12-20T03:45:39Z

Test build #70392 has finished for PR 16313 at commit 32857e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-20T04:04:30Z

Thanks! Merging to master/2.1.

…ing when append data to an existing table ## What changes were proposed in this pull request? When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data. However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc. This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs: * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files. * SPARK-18912: We forget to check the number of columns for non-file-based data source table * SPARK-18913: We don't support append data to a table with special column names. ## How was this patch tested? new regression test. Author: Wenchen Fan <[email protected]> Closes #16313 from cloud-fan/bug1. (cherry picked from commit f923c84) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-01-20T01:54:47Z

Actually this PR was not backported to 2.1, now I've backported.

…ing when append data to an existing table ## What changes were proposed in this pull request? When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data. However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc. This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs: * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files. * SPARK-18912: We forget to check the number of columns for non-file-based data source table * SPARK-18913: We don't support append data to a table with special column names. ## How was this patch tested? new regression test. Author: Wenchen Fan <[email protected]> Closes apache#16313 from cloud-fan/bug1.

cloud-fan changed the title ~~[SPARK-18899][SQL] append a bucketed table using DataFrameWriter with mismatched bucketing should fail~~ [SPARK-18899][SQL] append data to a bucketed table with mismatched bucketing should fail Dec 16, 2016

cloud-fan force-pushed the bug1 branch from 370bdc9 to 7a9bfaa Compare December 17, 2016 11:55

cloud-fan changed the title ~~[SPARK-18899][SQL] append data to a bucketed table with mismatched bucketing should fail~~ [SPARK-18899][SPARK-18912][SPARK-18913][SQL] fix error checking logic when append data to an existing table Dec 17, 2016

cloud-fan changed the title ~~[SPARK-18899][SPARK-18912][SPARK-18913][SQL] fix error checking logic when append data to an existing table~~ [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table Dec 17, 2016

append a bucketed table using DataFrameWriter with mismatched bucketi…

b1dbd0a

…ng should fail

cloud-fan force-pushed the bug1 branch from 7a9bfaa to b1dbd0a Compare December 17, 2016 14:22

gatorsmile reviewed Dec 17, 2016

View reviewed changes

revert unnecessary changes

d5a464d

gatorsmile reviewed Dec 19, 2016

View reviewed changes

address comments

32857e6

asfgit closed this in f923c84 Dec 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table #16313

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table #16313

cloud-fan commented Dec 16, 2016 •

edited

Loading

cloud-fan commented Dec 16, 2016

SparkQA commented Dec 16, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

gatorsmile Dec 17, 2016

cloud-fan Dec 18, 2016

gatorsmile Dec 17, 2016

cloud-fan Dec 18, 2016

gatorsmile Dec 18, 2016

gatorsmile Dec 17, 2016

cloud-fan Dec 18, 2016

gatorsmile Dec 18, 2016

gatorsmile Dec 17, 2016

cloud-fan Dec 18, 2016

SparkQA commented Dec 18, 2016

gatorsmile Dec 19, 2016

gatorsmile Dec 19, 2016

gatorsmile Dec 19, 2016

gatorsmile Dec 19, 2016 •

edited

Loading

gatorsmile commented Dec 19, 2016

SparkQA commented Dec 19, 2016

gatorsmile commented Dec 19, 2016

gatorsmile commented Dec 20, 2016

SparkQA commented Dec 20, 2016

gatorsmile commented Dec 20, 2016

cloud-fan commented Jan 20, 2017

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table #16313

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table #16313

Conversation

cloud-fan commented Dec 16, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Dec 16, 2016

SparkQA commented Dec 16, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Dec 19, 2016 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Dec 19, 2016

SparkQA commented Dec 19, 2016

gatorsmile commented Dec 19, 2016

gatorsmile commented Dec 20, 2016

SparkQA commented Dec 20, 2016

gatorsmile commented Dec 20, 2016

cloud-fan commented Jan 20, 2017

cloud-fan commented Dec 16, 2016 •

edited

Loading

gatorsmile Dec 19, 2016 •

edited

Loading