Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables #16296

Closed
wants to merge 5 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 15, 2016

What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details.

TODO(for follow-up PRs):

  1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
  2. SHOW CREATE TABLE should be updated to use the new syntax.
  3. we should decide if we wanna change the behavior of SET LOCATION.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @yhuai

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70195 has finished for PR 16296 at commit 234d935.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

bucketSpec? (AS? query)? #createTableUsing
bucketSpec?
(TBLPROPERTIES properties=tablePropertyList)?
(COMMENT comment=STRING)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep the same order? For example, moving (COMMENT comment=STRING)? before (PARTITIONED BY partitionColumnNames=identifierList)??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more natural to write the table definition first, then the comment, i.e. important things first.

@cloud-fan cloud-fan changed the title [SPARK-18885][SQL][WIP] unify CREATE TABLE syntax for data source and hive serde tables [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables Dec 19, 2016
@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70366 has finished for PR 16296 at commit 8dafb9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70395 has finished for PR 16296 at commit 631edf7.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70398 has started for PR 16296 at commit 4049645.

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70410 has finished for PR 16296 at commit a553366.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70447 has finished for PR 16296 at commit 7b5f226.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

* USING table_provider
* [OPTIONS table_property_list]
* [PARTITIONED BY (col_name, col_name, ...)]
* [CLUSTERED BY (col_name, col_name, ...)
* [SORTED BY (col_name [ASC|DESC], ...)]
* INTO num_buckets BUCKETS
* ]
* [TBLPROPERTIES (property_name=property_value, ...)]
Copy link
Member

@gatorsmile gatorsmile Dec 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we need an update. In the recent commit, we removed TBLPROPERTIES but added new locationSpec

@gatorsmile
Copy link
Member

Just read the design doc. What is the decision about adding (TBLPROPERTIES tablePropertyList)? in the CREATE TABLE syntax? So far, .g4 file does not have it.

@gatorsmile
Copy link
Member

There is a syntax difference in partition column definition between Hive serde tables and data source tables. In Hive serde tables, the partitioning columns cannot be part of the table schema. Do we need to document the difference, or we can assume users understand this when they convert it?

@gatorsmile
Copy link
Member

We might need another PR for updating the output of SHOW CREATE TABLE, since we recommend users use the new syntax.

@gatorsmile
Copy link
Member

CREATE TEMPORARY TABLE is not supported for all the types of hive serde tables. However, in CREATE TEMPORARY TABLE is allowed for creating data souce tables if AS query is not specified.

@gatorsmile
Copy link
Member

Hive does not allow to use a CTAS statement to create a partitioned table, but we allow it in the Create Data Source table syntax.

@cloud-fan
Copy link
Contributor Author

CREATE TEMPORARY TABLE is not supported for all the types of hive serde tables. However, in CREATE TEMPORARY TABLE is allowed for creating data souce tables if AS query is not specified.

CREATE TEMPORARY TABLE is also not allowed for data source tables, we will convert it to temp view and give an error message. Can we remove it in spark 2.2? cc @yhuai @liancheng

Hive does not allow to use a CTAS statement to create a partitioned table, but we allow it in the Create Data Source table syntax.

We can add an extra check to make sure we don't use CTAS to create partitioned table.

@SparkQA
Copy link

SparkQA commented Dec 30, 2016

Test build #70736 has finished for PR 16296 at commit a1dbf61.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

@cloud-fan
Copy link
Contributor Author

There is a syntax difference in partition column definition between Hive serde tables and data source tables. In Hive serde tables, the partitioning columns cannot be part of the table schema. Do we need to document the difference, or we can assume users understand this when they convert it?

the hive syntax has data schema and partition schema, while the new syntax only has a table schema(logically data schema + partition schema). This syntax difference already exists between data source table syntax and hive table syntax, and users don't need to convert their old SQL statements, the legacy hive syntax still exists.

@SparkQA
Copy link

SparkQA commented Dec 30, 2016

Test build #70742 has finished for PR 16296 at commit 25af2cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

In the original Create Hive Serde Table command, users are allowed to specify the serde properties for ROW FORMAT SERDE. It sounds like the unified Create Table command is missing such a capability.

@cloud-fan
Copy link
Contributor Author

@gatorsmile , OPTIONS is serde properties, I should document this.

val tableType = if (storage.locationUri.isDefined) {

if (location.isDefined && storage.locationUri.isDefined) {
throw new ParseException("Cannot specify LOCATION when there is 'path' in OPTIONS.", ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more specific at here. These two approaches are the same and we only want users to use one, right?

/**
* Options for the Hive data source.
*/
class HiveOptions(@transient private val parameters: CaseInsensitiveMap) extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also mention that DetermineHiveSerde will fill in default values based on the file format.


def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters))

val format = parameters.get(FORMAT).map(_.toLowerCase)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file format?


val serde = parameters.get(SERDE)

for (f <- format if serde.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe using if is easier to read?

case None =>
throw new IllegalArgumentException(s"invalid format: '${options.format.get}'")
}
} else if (options.inputFormat.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a helper function to know if inputFormat and outputFormat are set? The current version assumes that the reader know the internal of HiveOptions.


val v2 = "CREATE TABLE t (c1 int, c2 int) USING hive CLUSTERED BY (c2) INTO 4 BUCKETS"
val e = intercept[AnalysisException](analyzeCreateTable(v2))
assert(e.message.contains("Cannot create bucketed Hive serde table"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also have a test using both partitioning and bucketing.

assert(table2.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))
checkAnswer(spark.table("t2"), Row(1, "a"))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also exercise partitioning.

Let's also test a orc's option.

.orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")),
compressed = false,
properties = Map())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a function to generate defaultStorage? We are having the duplicate codes in the parser.

@cloud-fan cloud-fan force-pushed the create-table branch 3 times, most recently from 90c8520 to 83ecc24 Compare January 5, 2017 05:41
@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70905 has finished for PR 16296 at commit 83ecc24.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70904 has finished for PR 16296 at commit 91d173d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70912 has finished for PR 16296 at commit 09dce41.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70916 has finished for PR 16296 at commit 08ec4a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
.orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
serde = defaultHiveSerde.flatMap(_.serde)
.orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a little different from what it was. Previously we don't set default serde at parser, but set it during analysis. However, this doesn't make sense and I think we should just provide the default serde at the beginning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this version break any test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because they are same fundamentally.

@yhuai
Copy link
Contributor

yhuai commented Jan 6, 2017

LGTM.

@yhuai
Copy link
Contributor

yhuai commented Jan 6, 2017

Merged to master.

@asfgit asfgit closed this in cca945b Jan 6, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 9, 2017
… serde tables

## What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.

TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.

## How was this patch tested?

new tests

Author: Wenchen Fan <[email protected]>

Closes apache#16296 from cloud-fan/create-table.
asfgit pushed a commit that referenced this pull request Jan 10, 2017
…nd Catalog

## What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now.

This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests.

## How was this patch tested?

new tests in `HiveDDLSuite`

Author: Wenchen Fan <[email protected]>

Closes #16487 from cloud-fan/hive-table.
ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 17, 2017
## What changes were proposed in this pull request?

In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.

## How was this patch tested?

new tests in `CatalogSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#16528 from cloud-fan/create-table.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… serde tables

## What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.

TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.

## How was this patch tested?

new tests

Author: Wenchen Fan <[email protected]>

Closes apache#16296 from cloud-fan/create-table.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…nd Catalog

## What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now.

This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests.

## How was this patch tested?

new tests in `HiveDDLSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#16487 from cloud-fan/hive-table.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.

## How was this patch tested?

new tests in `CatalogSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#16528 from cloud-fan/create-table.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…nd Catalog

## What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now.

This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests.

## How was this patch tested?

new tests in `HiveDDLSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#16487 from cloud-fan/hive-table.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.

## How was this patch tested?

new tests in `CatalogSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#16528 from cloud-fan/create-table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants