Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table #16313

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 16, 2016

What changes were proposed in this pull request?

When we append data to an existing table with DataFrameWriter.saveAsTable, we will do various checks to make sure the appended data is consistent with the existing data.

However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for HadoopFsRelation, we forget to check bucketing, etc.

This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:

  • SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
  • SPARK-18912: We forget to check the number of columns for non-file-based data source table
  • SPARK-18913: We don't support append data to a table with special column names.

How was this patch tested?

new regression test.

@cloud-fan cloud-fan changed the title [SPARK-18899][SQL] append a bucketed table using DataFrameWriter with mismatched bucketing should fail [SPARK-18899][SQL] append data to a bucketed table with mismatched bucketing should fail Dec 16, 2016
@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Dec 16, 2016

Test build #70257 has finished for PR 16313 at commit 370bdc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-18899][SQL] append data to a bucketed table with mismatched bucketing should fail [SPARK-18899][SPARK-18912][SPARK-18913][SQL] fix error checking logic when append data to an existing table Dec 17, 2016
@cloud-fan cloud-fan changed the title [SPARK-18899][SPARK-18912][SPARK-18913][SQL] fix error checking logic when append data to an existing table [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table Dec 17, 2016
@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70307 has finished for PR 16313 at commit 7a9bfaa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70309 has finished for PR 16313 at commit b1dbd0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -157,39 +156,74 @@ case class CreateDataSourceTableAsSelectCommand(
// Since the table already exists and the save mode is Ignore, we will just return.
return Seq.empty[Row]
case SaveMode.Append =>
val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB)
if (existingTable.tableType == CatalogTableType.VIEW) {
throw new AnalysisException("Saving data into a view is not allowed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an assert at the beginning of this function. We need to improve the error message there. That can cover more cases.

So far, the error message is pretty confusing.

assertion failed
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:156)

We can add a test case for it.

        val df = spark.range(1, 10).toDF("id1")
        df.write.saveAsTable("tab1")
        spark.sql("create view view1 as select * from tab1")
        df.write.mode(SaveMode.Append).format("parquet").saveAsTable("view1")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this check as it's unreachable now, and fix the error message in another PR.

@@ -157,39 +156,74 @@ case class CreateDataSourceTableAsSelectCommand(
// Since the table already exists and the save mode is Ignore, we will just return.
return Seq.empty[Row]
case SaveMode.Append =>
val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible we can directly use the input parameter table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I see. Will do the final review today.

@@ -133,6 +133,16 @@ case class BucketSpec(
if (numBuckets <= 0) {
throw new AnalysisException(s"Expected positive number of buckets, but got `$numBuckets`.")
}

override def toString: String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we implement toString here, we can simplify our logics in describeBucketingInfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how? the toString returns a single line, while describeBucketingInfo generates 3 result lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) If we want to keep the existing format (3 lines), then we are unable to do it.

@@ -459,7 +459,7 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
test("saveAsTable()/load() - partitioned table - ErrorIfExists") {
Seq.empty[(Int, String)].toDF().createOrReplaceTempView("t")

withTempView("t") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need withTempView("t") to drop the temp view that is created at the beginning of this test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, my mistake

@SparkQA
Copy link

SparkQA commented Dec 18, 2016

Test build #70321 has finished for PR 16313 at commit d5a464d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (specifiedPartCols != existingTable.partitionColumnNames) {
throw new AnalysisException(
s"""
|Specified partitioning does not match the existing table $tableName.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the grammar issue.
Specified partitioning does not match the existing table $tableName.
->
Specified partitioning does not match that of the existing table $tableName.

Found a reference link: in sybase adaptive server.

existingTable.bucketSpec.map(_.toString).getOrElse("not bucketed")
throw new AnalysisException(
s"""
|Specified bucketing does not match the existing table $tableName.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The same here.


if (existingTable.provider.get == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException(s"Saving data in the Hive serde table $tableName is " +
s"not supported yet. Please use the insertInto() API as an alternative.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the string interpolations is not needed.

throw new AnalysisException(s"Saving data in the Hive serde table $tableName is " +
s"not supported yet. Please use the insertInto() API as an alternative.")
}

// Check if the specified data source match the data source of the existing table.
Copy link
Member

@gatorsmile gatorsmile Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the checking logics are split two places for CTAS of data source tables using the Append mode. Maybe we can improve the comment to explain AnalyzeCreateTable verifies the consistency between the user-specified table schema/definition and the SELECT query. Here, we verifies the consistency between the user-specified table schema/definition and the existing table schema/definition, the consistency between the existing table schema/definition and the SELECT query.

@gatorsmile
Copy link
Member

LGTM except a few minor comments.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70331 has finished for PR 16313 at commit 32857e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70392 has finished for PR 16313 at commit 32857e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master/2.1.

@asfgit asfgit closed this in f923c84 Dec 20, 2016
asfgit pushed a commit that referenced this pull request Jan 20, 2017
…ing when append data to an existing table

## What changes were proposed in this pull request?

When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.

However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.

This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
* SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
* SPARK-18912: We forget to check the number of columns for non-file-based data source table
* SPARK-18913: We don't support append data to a table with special column names.

## How was this patch tested?
new regression test.

Author: Wenchen Fan <[email protected]>

Closes #16313 from cloud-fan/bug1.

(cherry picked from commit f923c84)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor Author

Actually this PR was not backported to 2.1, now I've backported.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ing when append data to an existing table

## What changes were proposed in this pull request?

When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.

However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.

This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
* SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
* SPARK-18912: We forget to check the number of columns for non-file-based data source table
* SPARK-18913: We don't support append data to a table with special column names.

## How was this patch tested?
new regression test.

Author: Wenchen Fan <[email protected]>

Closes apache#16313 from cloud-fan/bug1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants