Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Hive Metastore compatibility between different systems #1045

Open
zsxwing opened this issue Apr 1, 2022 · 13 comments
Open
Labels
enhancement New feature or request

Comments

@zsxwing
Copy link
Member

zsxwing commented Apr 1, 2022

Feature request

Overview

Currently when we creates a Delta table in Hive Metastore using different systems, we will store different formats in Hive Metastore. This causes the following issue:

  • After create a Delta table in Hive Metastore using Spark, this table can only be accessed by Spark. For example, using Hive to access this table will fail.
  • After create a Delta table in Hive Metastore using Hive, this table can only be accessed by Hive. Using Spark to access this table will fail.

Similar issues happen in Presto and Flink as well. It would be great if we can define a unified format in Hive Metastore for Delta.

Motivation

If we define a unifed format in Hive Metastore, and all of systems (Spark, Hive, Presto, Flink) use the same format, then no matter how a table is created, it can be accessed by all of systems.

@zsxwing zsxwing added the enhancement New feature or request label Apr 1, 2022
@dnskr
Copy link
Contributor

dnskr commented Apr 5, 2022

Hi @zsxwing,
I use Spark 3.2.1 (Spark Thrift Server) to write Delta tables and Presto 271 to read them. Both Spark and Presto use shared Hive Metastore 3.1.2. I don't have any problems as I can see, so could you please elaborate more on the feature request?

@zsxwing
Copy link
Member Author

zsxwing commented Apr 5, 2022

For example, when reading Delta tables using Hive. We need to run the following command to create an external table:

CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'

In other words, Delta connector for Hive requires to see io.delta.hive.DeltaStorageHandler in the Hive Metastore.

However, if we use Spark to create a Delta table today, Spark won't write io.delta.hive.DeltaStorageHandler into the Hive Metastore. Then if you use Hive to read such table, it will fail because Hive doesn't know it needs to use io.delta.hive.DeltaStorageHandler to read this table.

In addition, if we run the above command in Hive, it will write io.delta.hive.DeltaStorageHandler into the Hive Metastore. When we use Spark to read such table, it will throw a ClassNotFoundException because io.delta.hive.DeltaStorageHandler doesn't exist in Spark.

I tried this before and hit the above issue. After we solve it, there may be other incompatibility issue I'm not aware as well.

@TCGOGOGO
Copy link

TCGOGOGO commented Feb 6, 2023

Hi @zsxwing Is there any plan when this feature could be done and released?

@zsxwing
Copy link
Member Author

zsxwing commented Feb 16, 2023

@TCGOGOGO this is a complicated issue. We haven't finished the entire investigation of how HMS works cross different engines. There is no ETA right now.

@mtthsbrr
Copy link

Hi @zsxwing,
I use Spark 3.2.1 (Spark Thrift Server) to write Delta tables and Presto 271 to read them. Both Spark and Presto use shared Hive Metastore 3.1.2. I don't have any problems as I can see, so could you please elaborate more on the feature request?

@dnskr
This is exactly what I'm trying to archive. Could you share some details about your setup? I would be especially interested in your Spark cluster with all its configs. Right now I'm struggling with the "Spark writing delta tables via the Hive Metastore in a way that Presto/Trino can read it" part.

@dnskr
Copy link
Contributor

dnskr commented Apr 1, 2023

@dnskr This is exactly what I'm trying to archive. Could you share some details about your setup? I would be especially interested in your Spark cluster with all its configs. Right now I'm struggling with the "Spark writing delta tables via the Hive Metastore in a way that Presto/Trino can read it" part.

@mtthsbrr I'm running Spark Thrift Server in Kubernetes as Deployment (de facto as one Pod) with the following command:

/opt/spark/sbin/start-thriftserver.sh --name sts --conf spark.driver.host=$(hostname -I) --conf spark.kubernetes.driver.pod.name=$(hostname) && tail -f /opt/spark/logs/*.out

There is a basic config injected to Spark pods through ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  ...
data:
  spark-defaults.conf: >-
    spark.master                                      k8s://https://kubernetes.default.svc
    spark.kubernetes.namespace                        namespace-to-use
    spark.kubernetes.container.image                  private.registry/spark:v3.3.2-delta2.2.0
    spark.kubernetes.container.image.pullSecrets      private.registry.secret.key.name
    spark.kubernetes.authenticate.serviceAccountName  spark-service-account
    spark.hive.metastore.uris                         thrift://hive-metastore.hive-metastore-namespace.svc.cluster.local:9083
    spark.hive.server2.enable.doAs                    false
    spark.sql.catalog.spark_catalog                   org.apache.spark.sql.delta.catalog.DeltaCatalog
    spark.sql.extensions                              io.delta.sql.DeltaSparkSessionExtension
    # Driver and Executor resources properties
    ...

Custom private.registry/spark:v3.3.2-delta2.2.0 image is official Spark image plus delta-core and delta-storage jars.

Catalog definition for Presto, see Delta Lake Connector for more details:

connector.name=delta
hive.metastore.uri=thrift://hive-metastore.hive-metastore-namespace.svc.cluster.local:9083

Create table query executed in Spark Thrift Server:

CREATE OR REPLACE TABLE my_delta_table
USING DELTA
AS SELECT * FROM my_parquet_table;

@jinmu0410
Copy link

Any conclusions about this? I'm having the same problem now

23/04/07 17:12:42 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table ods.t222 into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

@jinmu0410
Copy link

@zsxwing

@murggu
Copy link

murggu commented Apr 21, 2023

@zsxwing I would like to know if this coming soon. Reproduced the issue as per doc https://github.com/delta-io/connectors/tree/master/hive:

  • Delta table created from Hive, not able to query from Spark
  • Delta table created from Spark, not able to query from Hive

Such compute engines (e.g. Spark/Hive) interoperability would be nice, specially in shared hms scenarios.

@umeshpawar2188
Copy link

When this will be available to use. One of our biggest customer who is using hive is not able to use delta data because of this.

We need interoperability between delta tables created in spark and hive.

@caoergou
Copy link

When this will be available to use. One of our biggest customer who is using hive is not able to use delta data because of this.

We need interoperability between delta tables created in spark and hive.

I find myself entangled in this bug as well.
I'm eagerly awaiting assistance from someone who can help unravel this query.

@alberttwong
Copy link

Here's an example of getting it to work with Spark SQL, HMS, MinIO S3 and StarRocks. https://github.com/StarRocks/demo/tree/master/documentation-samples/deltalake

@wangchao316
Copy link

@zsxwing , hi , Do we have any plans to implement compatibility on different systems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants