Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateDataSourceTableCommand: It is not recommended to create a table with overlapped data and partition columns #425

Closed
emavgl opened this issue May 15, 2020 · 1 comment

Comments

@emavgl
Copy link

emavgl commented May 15, 2020

Version:

  • pyspark 2.4.5
  • delta 0.5

I have the following warning message while trying to create a Delta table from an existing location.

20/05/13 15:25:07 WARN CreateDataSourceTableCommand: It is not recommended to create a table with overlapped data and partition columns, as Spark cannot store a valid table schema and has to infer it at runtime, which hurts performance. Please check your data files and remove the partition columns in it.

Here the code to reproduce the issue:

from pyspark import Row

spark = SparkSession.builder.appName('Test') \
    .config('spark.jars.packages', 'io.delta:delta-core_2.11:0.5.0') \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()


data = [
 {"entityType": "goal_message", "arrivalHour": 2020051417, "value": 2},
 {"entityType": "goal_message", "arrivalHour": 2020051415, "value": 1},
]
df = spark.createDataFrame(Row(**x) for x in data)
df.write.format('delta').mode('append').partitionBy(['arrivalHour']).save("/tmp/table.delta")
spark.sql("CREATE TABLE example2 USING DELTA LOCATION '/tmp/table.delta'")
20/05/15 14:16:36 WARN CreateDataSourceTableCommand: It is not recommended to create a table with overlapped data and partition columns, as Spark cannot store a valid table schema and has to infer it at runtime, which hurts performance. Please check your data files and remove the partition columns in it.
20/05/15 14:16:36 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider DELTA. Persisting data source table `default`.`example2` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Am I doing something wrong? Thanks.

@liwensun
Copy link
Contributor

@emavgl
DDL commands like CREATE TABLE are not properly supported right now. Here is the tracking issue: #85

So I wouldn't use CREATE TABLE right now even if you command was able to run...

@emavgl emavgl closed this as completed May 19, 2020
tdas pushed a commit to tdas/delta that referenced this issue May 31, 2023
)

* refactor - move options related classes from source package up to options package

* Rename DeltaConfiguration to Delta, extract option validator class

* move options to an internal package

* fix import order

* fix import order

* fix test by passing default connector options

* Add comments

* fix comment

* unused import

* checkstyle fixes

* checkstyle fixes

* fix import order

* fix import order

* Add test for OptionValidator

* Create an exception type for option validation errors

* code review comments

* Move source option validation to use common option validator

* fix comment

* fix indent

* use junit5 api

* checkstyle error

Co-authored-by: Gopi Madabhushi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants