[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables #16296

cloud-fan · 2016-12-15T17:07:21Z

What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details.

TODO(for follow-up PRs):

TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
SHOW CREATE TABLE should be updated to use the new syntax.
we should decide if we wanna change the behavior of SET LOCATION.

How was this patch tested?

new tests

cloud-fan · 2016-12-15T17:08:57Z

cc @yhuai

SparkQA · 2016-12-15T17:14:13Z

Test build #70195 has finished for PR 16296 at commit 234d935.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

gatorsmile · 2016-12-15T17:14:48Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

-        bucketSpec? (AS? query)?                                       #createTableUsing
+        bucketSpec?
+        (TBLPROPERTIES properties=tablePropertyList)?
+        (COMMENT comment=STRING)?


Do we need to keep the same order? For example, moving (COMMENT comment=STRING)? before (PARTITIONED BY partitionColumnNames=identifierList)??

I think it's more natural to write the table definition first, then the comment, i.e. important things first.

SparkQA · 2016-12-19T19:22:59Z

Test build #70366 has finished for PR 16296 at commit 8dafb9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

SparkQA · 2016-12-20T06:13:20Z

Test build #70395 has finished for PR 16296 at commit 631edf7.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

SparkQA · 2016-12-20T07:52:37Z

Test build #70398 has started for PR 16296 at commit 4049645.

SparkQA · 2016-12-20T15:55:32Z

Test build #70410 has finished for PR 16296 at commit a553366.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

SparkQA · 2016-12-21T04:52:50Z

Test build #70447 has finished for PR 16296 at commit 7b5f226.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

gatorsmile · 2016-12-25T07:38:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

   *   USING table_provider
   *   [OPTIONS table_property_list]
   *   [PARTITIONED BY (col_name, col_name, ...)]
   *   [CLUSTERED BY (col_name, col_name, ...)
   *    [SORTED BY (col_name [ASC|DESC], ...)]
   *    INTO num_buckets BUCKETS
   *   ]
+   *   [TBLPROPERTIES (property_name=property_value, ...)]


Here, we need an update. In the recent commit, we removed TBLPROPERTIES but added new locationSpec

gatorsmile · 2016-12-25T08:08:31Z

Just read the design doc. What is the decision about adding (TBLPROPERTIES tablePropertyList)? in the CREATE TABLE syntax? So far, .g4 file does not have it.

gatorsmile · 2016-12-25T08:21:27Z

There is a syntax difference in partition column definition between Hive serde tables and data source tables. In Hive serde tables, the partitioning columns cannot be part of the table schema. Do we need to document the difference, or we can assume users understand this when they convert it?

gatorsmile · 2016-12-25T08:21:37Z

We might need another PR for updating the output of SHOW CREATE TABLE, since we recommend users use the new syntax.

gatorsmile · 2016-12-25T08:37:06Z

CREATE TEMPORARY TABLE is not supported for all the types of hive serde tables. However, in CREATE TEMPORARY TABLE is allowed for creating data souce tables if AS query is not specified.

gatorsmile · 2016-12-25T08:40:58Z

Hive does not allow to use a CTAS statement to create a partitioned table, but we allow it in the Create Data Source table syntax.

cloud-fan · 2016-12-26T10:51:17Z

CREATE TEMPORARY TABLE is not supported for all the types of hive serde tables. However, in CREATE TEMPORARY TABLE is allowed for creating data souce tables if AS query is not specified.

CREATE TEMPORARY TABLE is also not allowed for data source tables, we will convert it to temp view and give an error message. Can we remove it in spark 2.2? cc @yhuai @liancheng

Hive does not allow to use a CTAS statement to create a partitioned table, but we allow it in the Create Data Source table syntax.

We can add an extra check to make sure we don't use CTAS to create partitioned table.

SparkQA · 2016-12-30T06:03:07Z

Test build #70736 has finished for PR 16296 at commit a1dbf61.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan]

cloud-fan · 2016-12-30T08:45:15Z

There is a syntax difference in partition column definition between Hive serde tables and data source tables. In Hive serde tables, the partitioning columns cannot be part of the table schema. Do we need to document the difference, or we can assume users understand this when they convert it?

the hive syntax has data schema and partition schema, while the new syntax only has a table schema(logically data schema + partition schema). This syntax difference already exists between data source table syntax and hive table syntax, and users don't need to convert their old SQL statements, the legacy hive syntax still exists.

SparkQA · 2016-12-30T11:09:21Z

Test build #70742 has finished for PR 16296 at commit 25af2cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-31T08:03:04Z

In the original Create Hive Serde Table command, users are allowed to specify the serde properties for ROW FORMAT SERDE. It sounds like the unified Create Table command is missing such a capability.

cloud-fan · 2016-12-31T11:13:03Z

@gatorsmile , OPTIONS is serde properties, I should document this.

yhuai · 2017-01-05T01:05:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

-    val tableType = if (storage.locationUri.isDefined) {
+
+    if (location.isDefined && storage.locationUri.isDefined) {
+      throw new ParseException("Cannot specify LOCATION when there is 'path' in OPTIONS.", ctx)


Let's be more specific at here. These two approaches are the same and we only want users to use one, right?

yhuai · 2017-01-05T01:16:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala

+/**
+ * Options for the Hive data source.
+ */
+class HiveOptions(@transient private val parameters: CaseInsensitiveMap) extends Serializable {


Let's also mention that DetermineHiveSerde will fill in default values based on the file format.

yhuai · 2017-01-05T01:19:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala

+
+  def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters))
+
+  val format = parameters.get(FORMAT).map(_.toLowerCase)


file format?

yhuai · 2017-01-05T01:22:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala

+
+  val serde = parameters.get(SERDE)
+
+  for (f <- format if serde.isDefined) {


maybe using if is easier to read?

yhuai · 2017-01-05T01:25:12Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+          case None =>
+            throw new IllegalArgumentException(s"invalid format: '${options.format.get}'")
+        }
+      } else if (options.inputFormat.isDefined) {


Maybe we should use a helper function to know if inputFormat and outputFormat are set? The current version assumes that the reader know the internal of HiveOptions.

yhuai · 2017-01-05T01:27:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala

+
+    val v2 = "CREATE TABLE t (c1 int, c2 int) USING hive CLUSTERED BY (c2) INTO 4 BUCKETS"
+    val e = intercept[AnalysisException](analyzeCreateTable(v2))
+    assert(e.message.contains("Cannot create bucketed Hive serde table"))


Let's also have a test using both partitioning and bucketing.

yhuai · 2017-01-05T01:30:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      assert(table2.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))
+      checkAnswer(spark.table("t2"), Row(1, "a"))
+    }
+  }


Let's also exercise partitioning.

Let's also test a orc's option.

gatorsmile · 2017-01-05T01:34:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+            .orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")),
+          compressed = false,
+          properties = Map())
+      }


Can we create a function to generate defaultStorage? We are having the duplicate codes in the parser.

SparkQA · 2017-01-05T07:09:32Z

Test build #70905 has finished for PR 16296 at commit 83ecc24.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-05T07:28:46Z

Test build #70904 has finished for PR 16296 at commit 91d173d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-05T09:45:52Z

Test build #70912 has finished for PR 16296 at commit 09dce41.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-05T12:24:09Z

Test build #70916 has finished for PR 16296 at commit 08ec4a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-05T12:32:47Z

sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala

+      outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
+        .orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
+      serde = defaultHiveSerde.flatMap(_.serde)
+        .orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")))


it's a little different from what it was. Previously we don't set default serde at parser, but set it during analysis. However, this doesn't make sense and I think we should just provide the default serde at the beginning.

Does this version break any test?

no, because they are same fundamentally.

yhuai · 2017-01-06T01:39:38Z

LGTM.

yhuai · 2017-01-06T01:41:15Z

Merged to master.

… serde tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <[email protected]> Closes apache#16296 from cloud-fan/create-table.

…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes #16487 from cloud-fan/hive-table.

## What changes were proposed in this pull request? In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes apache#16528 from cloud-fan/create-table.

… serde tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <[email protected]> Closes apache#16296 from cloud-fan/create-table.

…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes apache#16487 from cloud-fan/hive-table.

## What changes were proposed in this pull request? In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes apache#16528 from cloud-fan/create-table.

…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes apache#16487 from cloud-fan/hive-table.

## What changes were proposed in this pull request? In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes apache#16528 from cloud-fan/create-table.

gatorsmile reviewed Dec 15, 2016

View reviewed changes

cloud-fan changed the title ~~[SPARK-18885][SQL][WIP] unify CREATE TABLE syntax for data source and hive serde tables~~ [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables Dec 19, 2016

cloud-fan force-pushed the create-table branch from 234d935 to 8dafb9d Compare December 19, 2016 17:52

cloud-fan force-pushed the create-table branch from 8dafb9d to 631edf7 Compare December 20, 2016 04:04

cloud-fan force-pushed the create-table branch from 631edf7 to 4049645 Compare December 20, 2016 07:50

cloud-fan force-pushed the create-table branch from 4049645 to a553366 Compare December 20, 2016 14:23

cloud-fan force-pushed the create-table branch from a553366 to 7b5f226 Compare December 21, 2016 02:03

gatorsmile reviewed Dec 25, 2016

View reviewed changes

unify CREATE TABLE syntax for data source and hive serde tables

a1dbf61

cloud-fan force-pushed the create-table branch from 7b5f226 to a1dbf61 Compare December 30, 2016 04:26

address comments

25af2cb

Merge remote-tracking branch 'origin/master' into create-table

ec4ab37

yhuai reviewed Jan 5, 2017

View reviewed changes

gatorsmile reviewed Jan 5, 2017

View reviewed changes

cloud-fan force-pushed the create-table branch 3 times, most recently from 90c8520 to 83ecc24 Compare January 5, 2017 05:41

cloud-fan force-pushed the create-table branch from 83ecc24 to 09dce41 Compare January 5, 2017 08:14

address comments

08ec4a7

cloud-fan force-pushed the create-table branch from 09dce41 to 08ec4a7 Compare January 5, 2017 09:54

cloud-fan commented Jan 5, 2017

View reviewed changes

asfgit closed this in cca945b Jan 6, 2017

cloud-fan mentioned this pull request Jan 6, 2017

[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487

Closed


		def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters))

		val format = parameters.get(FORMAT).map(_.toLowerCase)


		val serde = parameters.get(SERDE)

		for (f <- format if serde.isDefined) {

[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables #16296

[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables #16296

Conversation

cloud-fan commented Dec 15, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Dec 15, 2016

SparkQA commented Dec 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 19, 2016

SparkQA commented Dec 20, 2016

SparkQA commented Dec 20, 2016

SparkQA commented Dec 20, 2016

SparkQA commented Dec 21, 2016

gatorsmile Dec 25, 2016 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Dec 25, 2016

gatorsmile commented Dec 25, 2016

gatorsmile commented Dec 25, 2016

gatorsmile commented Dec 25, 2016

gatorsmile commented Dec 25, 2016

cloud-fan commented Dec 26, 2016

SparkQA commented Dec 30, 2016

cloud-fan commented Dec 30, 2016

SparkQA commented Dec 30, 2016

gatorsmile commented Dec 31, 2016

cloud-fan commented Dec 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 5, 2017

SparkQA commented Jan 5, 2017

SparkQA commented Jan 5, 2017

SparkQA commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented Jan 6, 2017

yhuai commented Jan 6, 2017

cloud-fan commented Dec 15, 2016 •

edited

Loading

gatorsmile Dec 25, 2016 •

edited

Loading