-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables #16296
Conversation
cc @yhuai |
Test build #70195 has finished for PR 16296 at commit
|
bucketSpec? (AS? query)? #createTableUsing | ||
bucketSpec? | ||
(TBLPROPERTIES properties=tablePropertyList)? | ||
(COMMENT comment=STRING)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep the same order? For example, moving (COMMENT comment=STRING)?
before (PARTITIONED BY partitionColumnNames=identifierList)?
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more natural to write the table definition first, then the comment, i.e. important things first.
234d935
to
8dafb9d
Compare
Test build #70366 has finished for PR 16296 at commit
|
8dafb9d
to
631edf7
Compare
Test build #70395 has finished for PR 16296 at commit
|
631edf7
to
4049645
Compare
Test build #70398 has started for PR 16296 at commit |
4049645
to
a553366
Compare
Test build #70410 has finished for PR 16296 at commit
|
a553366
to
7b5f226
Compare
Test build #70447 has finished for PR 16296 at commit
|
* USING table_provider | ||
* [OPTIONS table_property_list] | ||
* [PARTITIONED BY (col_name, col_name, ...)] | ||
* [CLUSTERED BY (col_name, col_name, ...) | ||
* [SORTED BY (col_name [ASC|DESC], ...)] | ||
* INTO num_buckets BUCKETS | ||
* ] | ||
* [TBLPROPERTIES (property_name=property_value, ...)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we need an update. In the recent commit, we removed TBLPROPERTIES
but added new locationSpec
Just read the design doc. What is the decision about adding |
There is a syntax difference in partition column definition between Hive serde tables and data source tables. In Hive serde tables, the partitioning columns cannot be part of the table schema. Do we need to document the difference, or we can assume users understand this when they convert it? |
We might need another PR for updating the output of |
|
Hive does not allow to use a CTAS statement to create a partitioned table, but we allow it in the Create Data Source table syntax. |
We can add an extra check to make sure we don't use CTAS to create partitioned table. |
7b5f226
to
a1dbf61
Compare
Test build #70736 has finished for PR 16296 at commit
|
the hive syntax has data schema and partition schema, while the new syntax only has a table schema(logically data schema + partition schema). This syntax difference already exists between data source table syntax and hive table syntax, and users don't need to convert their old SQL statements, the legacy hive syntax still exists. |
Test build #70742 has finished for PR 16296 at commit
|
In the original |
@gatorsmile , |
val tableType = if (storage.locationUri.isDefined) { | ||
|
||
if (location.isDefined && storage.locationUri.isDefined) { | ||
throw new ParseException("Cannot specify LOCATION when there is 'path' in OPTIONS.", ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more specific at here. These two approaches are the same and we only want users to use one, right?
/** | ||
* Options for the Hive data source. | ||
*/ | ||
class HiveOptions(@transient private val parameters: CaseInsensitiveMap) extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also mention that DetermineHiveSerde will fill in default values based on the file format.
|
||
def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters)) | ||
|
||
val format = parameters.get(FORMAT).map(_.toLowerCase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file format?
|
||
val serde = parameters.get(SERDE) | ||
|
||
for (f <- format if serde.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe using if is easier to read?
case None => | ||
throw new IllegalArgumentException(s"invalid format: '${options.format.get}'") | ||
} | ||
} else if (options.inputFormat.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a helper function to know if inputFormat and outputFormat are set? The current version assumes that the reader know the internal of HiveOptions.
|
||
val v2 = "CREATE TABLE t (c1 int, c2 int) USING hive CLUSTERED BY (c2) INTO 4 BUCKETS" | ||
val e = intercept[AnalysisException](analyzeCreateTable(v2)) | ||
assert(e.message.contains("Cannot create bucketed Hive serde table")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also have a test using both partitioning and bucketing.
assert(table2.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")) | ||
checkAnswer(spark.table("t2"), Row(1, "a")) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also exercise partitioning.
Let's also test a orc's option.
.orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")), | ||
compressed = false, | ||
properties = Map()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a function to generate defaultStorage
? We are having the duplicate codes in the parser.
90c8520
to
83ecc24
Compare
Test build #70905 has finished for PR 16296 at commit
|
Test build #70904 has finished for PR 16296 at commit
|
83ecc24
to
09dce41
Compare
Test build #70912 has finished for PR 16296 at commit
|
09dce41
to
08ec4a7
Compare
Test build #70916 has finished for PR 16296 at commit
|
outputFormat = defaultHiveSerde.flatMap(_.outputFormat) | ||
.orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")), | ||
serde = defaultHiveSerde.flatMap(_.serde) | ||
.orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this version break any test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because they are same fundamentally.
LGTM. |
Merged to master. |
… serde tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <[email protected]> Closes apache#16296 from cloud-fan/create-table.
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes #16487 from cloud-fan/hive-table.
## What changes were proposed in this pull request? In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes apache#16528 from cloud-fan/create-table.
… serde tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <[email protected]> Closes apache#16296 from cloud-fan/create-table.
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes apache#16487 from cloud-fan/hive-table.
## What changes were proposed in this pull request? In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes apache#16528 from cloud-fan/create-table.
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes apache#16487 from cloud-fan/hive-table.
## What changes were proposed in this pull request? In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes apache#16528 from cloud-fan/create-table.
What changes were proposed in this pull request?
Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.
Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details.
TODO(for follow-up PRs):
SHOW CREATE TABLE
should be updated to use the new syntax.SET LOCATION
.How was this patch tested?
new tests