Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for registering Delta tables in the HiveMetastore #85

Closed
marmbrus opened this issue Jul 8, 2019 · 32 comments
Closed

Support for registering Delta tables in the HiveMetastore #85

marmbrus opened this issue Jul 8, 2019 · 32 comments
Labels
enhancement New feature or request
Milestone

Comments

@marmbrus
Copy link
Contributor

marmbrus commented Jul 8, 2019

While delta tracks its own metadata in the transaction log, the Hive Metastore is still important as it enables users to find tables without knowing the path to the data.

This ticket tracks adding the ability to run CREATE TABLE to create a new metastore table, or to register an existing table in the metastore.

@marmbrus marmbrus added the enhancement New feature or request label Jul 8, 2019
@flvndh
Copy link

flvndh commented Jul 25, 2019

Is there any chance that this would be part of the 0.4.0 milestone?

@tdas
Copy link
Contributor

tdas commented Jul 25, 2019

Hard to say. It all depends on the timing of the Spark 3.0.0 release. We are working with the Spark community to add the necessary Datasource V2 APIs that would allow us to plug into all the DDLs like CREATE TABLE, ALTER TABLE, etc. and make them work with Hive metastore + Delta combined setup. Those new APIs are targetted for Spark 3.0.0.

@yucai
Copy link
Contributor

yucai commented Oct 15, 2019

@marmbrus @tdas

    spark.range(10).write.format("delta").save("/tmp/events")
    spark.sql("create table events using delta location '/tmp/events'")
    spark.sql("select * from events").show(100)

Registering an existing table in the metastore seems work already, so this feature is mainly for creating new table like #177?

@dgreenshtein
Copy link

@marmbrus @tdas

    spark.range(10).write.format("delta").save("/tmp/events")
    spark.sql("create table events using delta location '/tmp/events'")
    spark.sql("select * from events").show(100)

Registering an existing table in the metastore seems work already, so this feature is mainly for creating new table like #177?

@yucai I had exactly the exception as it is appears in #177 when I run your code

@dgreenshtein
Copy link

Is there any workaround to register Delta table in Hive metastore using Spark API?

@yucai
Copy link
Contributor

yucai commented Nov 13, 2019

@dgreenshtein it works for me perfectly. I run them like below:

branch-2.4 git:(branch-2.4) ✗ ./bin/spark-shell --packages io.delta:delta-core_2.11:0.3.0

@dgreenshtein
Copy link

dgreenshtein commented Nov 13, 2019

@dgreenshtein it works for me perfectly. I run them like below:

branch-2.4 git:(branch-2.4) ✗ ./bin/spark-shell --packages io.delta:delta-core_2.11:0.3.0

hm, the only difference is delta.io version. I am running io.delta:delta-core_2.11:0.4.0.

Did not work with 0.3.0 as well.

@MLikeWater
Copy link

@dgreenshtein
In fact, when we use the "create table xxx" sql statement, prompt as follows

WARN hive.HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `default`.`events` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Just created metastore_db locally, can't access Hive metastore

@CarreauClement
Copy link

CarreauClement commented Jan 28, 2020

Any news on this ?

Also trying to register my delta table to hive metastore without success.

I'm running Spark 2.4.4 as follows :

./spark-shell --packages io.delta:delta-core_2.11:0.5.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension

And trying to run

spark.range(10).write.format("delta").save("/tmp/events")
spark.sql("create table events using delta location '/tmp/events'")
spark.sql("select * from events").show(100)

I'm having the following error :
hive.HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta.

Any idea ?

@mukulmurthy
Copy link
Collaborator

No updates; we can't support this until Spark 3.0 is released. It's on our future roadmap to support once 3.0 comes out.

@CarreauClement
Copy link

Is there any metastore that is currently supported with the 0.5.0 version ? Does Glue or any other work ? Is there a standalone metastore for delta lake ? I can't seem to find it in the documentation and I need to register my tables to perform the auto compaction optimizations.

Thanks

@mukulmurthy
Copy link
Collaborator

If you're running on Databricks (which I assume you mean by running auto-compaction optimizations, which are only available there), Hive and Glue metastores are both supported. See https://docs.databricks.com/delta/delta-batch.html#create-a-table.

This project is for discussion around open-source Delta Lake. If you have any questions about Databricks, feel free to open a support ticket or ask on Slack.

@CarreauClement
Copy link

I'm using the open-source version. Do I have any metastore available ? Is there a list of features only available on the databricks distribution ? Thanks

@mukulmurthy
Copy link
Collaborator

Delta Lake on Databricks has some performance optimizations as a result of being part of the Databricks Runtime; we're aiming for full API compatibility in OSS Delta Lake (though for some things like metastore support that requires changes only coming in Spark 3.0). Auto compaction is only available in Databricks; if you're talking about the Hive-ACID compaction, that won't work with Delta Lake. You can implement your own compaction using something like https://docs.delta.io/latest/best-practices.html#compact-files.

If you have any further questions, please create a new issue.

@abiratsis
Copy link

We just migrated to Databricks Delta from parquet using Hive metastore. So far everything seems to
work fine, when I try to print out the location of the new Delta table using DESCRIBE EXTENDED my_table the location is correct although it is different than the one found in the hiveMetastore database. When I access the hiveMetastore database I can successfully identify the target table (also provider is correctly set to Delta). To retrieve the previous information I am executing a join between sds, dbs, tbls and table_params tables from hiveMetastore db, filtering by table name as shown next:

val sdsDF = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.SDS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val tblsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TBLS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val dbsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.DBS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val paramsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TABLE_PARAMS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val resDf = sdsDF.join(tblsDf, "SD_ID")
     .join(dbsDf, "DB_ID") 
     .join(paramsDf, "TBL_ID") 
     .where('TBL_NAME.rlike("mytable"))
     .select($"TBL_NAME", $"TBL_TYPE", $"NAME".as("DB_NAME"), $"DB_LOCATION_URI", $"LOCATION".as("TABLE_LOCATION"), $"PARAM_KEY", $"PARAM_VALUE")

All the previous are executed from a databricks notebook.

My question is why I am getting two different locations even if the table name is the same? Where is the correct location for the Delta tables stored if not on hiveMetastore db?

@tdas
Copy link
Contributor

tdas commented Mar 10, 2020

What are the two different locations? can you show the output of resDF?

@abiratsis
Copy link

abiratsis commented Mar 11, 2020

I can't show the actual values although they are completely different S3 paths @tdas none of them is somehow related to the other. i.e root of the other etc. Is there anything I should try? Where should the Delta table location be stored in this case?

@tdas tdas modified the milestones: Future Roadmap, 0.7.0 Mar 11, 2020
@tdas
Copy link
Contributor

tdas commented Mar 11, 2020

Neither of them matches the actual one? Either way, this issue is not related to the root issue of metastore support, so maybe we should make a different issue for this. Though I doubt we can do anything without taking a deeper look. It might better to actually contact databricks support so that we can look at what is going on.

@zsxwing
Copy link
Member

zsxwing commented Mar 11, 2020

One path should look like dbfs:/user/hive/warehouse/XXXXXX-__PLACEHOLDER__. I don't know why Spark needs to write this placeholder path, maybe some codes require something to be there. The other path should be the real table location.

@tdas
Copy link
Contributor

tdas commented Mar 20, 2020

Just a status update on the support for defining Delta-format tables in Hive Metastore. We are going to add support for defining tables and all the associated DDL commands (CREATE, ALTER, DROP, etc.) in Delta Lake 0.7.0 when we will add support for Apache Spark 3.0. Delta Lake 0.7.0 is expected to come out roughly in June/July, whenever the Apache Spark community votes and decides to release 3.0.

@abiratsis
Copy link

abiratsis commented Mar 21, 2020

My excuses for the late reply, I have some updates regarding this one. I tried to execute the next code:

val sdsDF = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.SDS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val tblsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TBLS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val dbsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.DBS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val paramsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TABLE_PARAMS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val resDf = sdsDF.join(tblsDf, "SD_ID")
     .join(dbsDf, "DB_ID") 
     .join(paramsDf, "TBL_ID") 
     .where('TBL_NAME === "my_table" && 'NAME === "db_production")
     .select($"TBL_NAME", $"TBL_TYPE", $"NAME".as("DB_NAME"), $"DB_LOCATION_URI", $"LOCATION".as("TABLE_LOCATION"), $"PARAM_KEY", $"PARAM_VALUE")

Which is similar to the previous code with the difference that in addition I join with hiveMetastore.TABLE_PARAMS. Here is how the output looks like:

+--------+--------------+-------------+-------------------+--------------------------------------------+---------------------------------+-----------------------------+
|TBL_NAME|TBL_TYPE      |DB_NAME      |DB_LOCATION        |TABLE_LOCATION                              |PARAM_KEY                        |PARAM_VALUE                  |
+--------+--------------+-------------+-------------------+--------------------------------------------+---------------------------------+-----------------------------+
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|EXTERNAL                         |TRUE                         |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.create.version         |2.4.3                        |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.sources.provider       |DELTA                        |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.sources.schema.numParts|1                            |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.sources.schema.part.0  |{"type":"struct","fields":[]}|
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|transient_lastDdlTime            |1582204169                   |
+--------+--------------+-------------+-------------------+--------------------------------------------+---------------------------------+-----------------------------+

As you can see DB_LOCATION and TABLE_LOCATION have invalid values and they don't correspond to the actual S3 path. @zsxwing as you can see the path is still S3 not dbfs. @tdas when I do DESCRIBE EXTENDED my_table the path is the correct Delta path. Actually, this is one of my question why I don't see the same/valid path in the hiveMetastore?

Thank you both

@lnmohankumar
Copy link

Is there any tentative timeline to have the functionality to creat the DeltaTable API for manged tables?

@tdas
Copy link
Contributor

tdas commented Jun 8, 2020

We are hoping to make a 0.7.0-preview release on Spark 3.0/RCx in the next couple of weeks. Most of the code changes for metastore support have merged into master. Hence, I am closing this ticket.

@tdas tdas closed this as completed Jun 8, 2020
@tdas tdas unpinned this issue Jun 26, 2020
@BigDataDaddy
Copy link

Was this issue closed on the hope of some code getting integrated into a Spark 3.0 preview release and not from a verified and tested publicly available general release?

If that really poor release management!!!

Please specify exactly what release combinations have been proven to work.

@tdas
Copy link
Contributor

tdas commented Jul 6, 2020

This has been released in Delta 0.7.0 on Spark 3.0. Please see the attached milestone and the corresponding release notes in https://github.com/delta-io/delta/releases

LantaoJin added a commit to LantaoJin/delta that referenced this issue Mar 12, 2021
@0brstmgns
Copy link

Currently using spark-core_2.12:3.1.2, spark-hive_2.12:3.1.2 and delta-core_2.12:1.0.1.

When deploying the table definitions like

spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""")

I still get the message:

WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Is this the correct behavior?

When querying the table like below using HiveContext I get a result, so it looks like the table is at least registered in Hive...

spark-shell --packages io.delta:delta-core_2.12:1.0.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

val sqlContext = new org.apache.spark.sql.hive.HiveContext(spark.sparkContext)
val result = sqlContext.sql("select count(*) FROM <table_name>")

result.show()

@zsxwing
Copy link
Member

zsxwing commented Sep 22, 2022

WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Please ignore this error. This is just saying you cannot use Hive to query this table. As long as you are using Spark, it should be fine.

@cometta
Copy link

cometta commented Sep 25, 2022

instead of executing this command spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""") to create table in hive metastore, is there other command that supported when creating table using spark.sql(..) so that i can query with hive to query the table?

@zsxwing
Copy link
Member

zsxwing commented Sep 27, 2022

instead of executing this command spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""") to create table in hive metastore, is there other command that supported when creating table using spark.sql(..) so that i can query with hive to query the table?

No. This is not supported today. We have an open issue for this in #1045. Please add new comments in this open ticket instead. Thanks!

@danzafar
Copy link

Where is the correct location for the Delta tables stored if not on hiveMetastore db?

To follow up on the side-issue here for future readers, you can get it from the SERDE_PARAMS table in the Hive Metastore, there is an entry path that corresponds to the external location.

@caoergou
Copy link

caoergou commented Sep 25, 2023

WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Please ignore this error. This is just saying you cannot use Hive to query this table. As long as you are using Spark, it should be fine.

It appears that this error should not be disregarded.
We encountered significant traffic issues related to this last Friday.

Here are the details:

  1. We are using Delta as the storage layer and Hive 3.x as the metastore.
  2. We obtained a dataframe with a complex structure as follows:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

val innerStruct = StructType(Array(
  StructField("image", StringType),
  StructField("content", StringType)
))

val outerStruct = StructType(Array(
  StructField("some_detail", innerStruct)
))

val schema = StructType(Array(
  StructField("detail", outerStruct)
))

val data = Seq(
  Row(Row(Row("image_example", "content_text_example")))
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

df.write.format("delta").save("s3://path")

We then created a delta table:

CREATE TABLE test.test_table_name USING delta LOCATION 's3://path'

However, we ended up with a record in the Hive metastore database sds with a Location row like:

hdfs://[our-emr-cluster-id]/xxxxx/yyy

instead of s3://path!
This is a significant issue because when you want to release the current cluster and create a new one, you will encounter a major bug!
You might think that your data and computing resources are separated, but you will find that the metastore still refers to the old, released cluster!
I sincerely hope to receive a response from the community on this issue.

@alberttwong
Copy link

Here's an example of getting it to work with Spark SQL, HMS, MinIO S3 and StarRocks. https://github.com/StarRocks/demo/tree/master/documentation-samples/deltalake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests