Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena delta table load it via spark #365

Closed
kkr78 opened this issue Mar 25, 2020 · 3 comments
Closed

Athena delta table load it via spark #365

kkr78 opened this issue Mar 25, 2020 · 3 comments

Comments

@kkr78
Copy link

kkr78 commented Mar 25, 2020

The EMR configured to use Glue Data Catalog as external Hive metastore and lot of pyspark scripts written to query against hive meta store. We have converted few datasets to delta lake and created tables in Glue Data Catalog. We have an issue when directly querying those tables using spark API, but the same works with delta API. any way to make this work?

spark.sql("SELECT * FROM mydb.mytable").show(10)

Caused by: java.lang.RuntimeException: s3://my-buclet/db/schmea/deltatable/_symlink_format_manifest/manifest is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]

@tdas
Copy link
Contributor

tdas commented Mar 25, 2020

Tables defined in metastores are not going to be supported until Apache Spark 3.0 and Delta Lake 0.7.0. See #85

@cabral1888
Copy link

Today, I am using Spark 3.1.1 on AWS EMR integrated with Glue Data Catalog. I cannot read data using spark.sql("SELECT * FROM mydb.mytable").show(10). Additionally, I am using delta 1.0.0.

I think the issue still exists on the versions mentioned above. Any ideas?

tdas pushed a commit to tdas/delta that referenced this issue May 31, 2023
…ersion - bug fix and more tests (delta-io#365)

* PR ColumnsFromDeltaLog - get table schema from Delta Log.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 - Partition support using Delta Log Metadata

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 - Added Delta - Flink - Delta type conversion test.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10 ColumnsFromDeltaLog - Changes after code review

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 Adding tests

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 Adding tests

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 Partition Support

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 Added tests.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 Test fix after merge

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 10.1 cleanup.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 Fix after merge conflicts from master.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 Make RowDataFormat constructor package protected.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 Changes after Code review

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 Fix compilation error after merge from base branch.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - Validation for Inapplicable Option Used + tests.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - Javadocs

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - Adding option type safety

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - Adding option type safety tests

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 11 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - test for options

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - Type conversion for Timestamp based options and adding tests.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - Change TODO's from PR 12 to PR 12.1

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12.1 - Option validation, throw DeltaSourceValidationException + tests.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12 - changes after code review

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12.1 - Validation for option setting and more tests.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12.1 - Validation for option setting and more tests.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12.1 - Validation for set options - BugFixes, tests, changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

* PR 12.1 - Changes after code review.

Signed-off-by: Krzysztof Chmielewski <[email protected]>

Co-authored-by: Krzysztof Chmielewski <[email protected]>
@rohitvarma01
Copy link

rohitvarma01 commented Aug 21, 2023

Hi all,

I'm using Spark version 3.2.0 along with Delta Lake version 1.2.1. I've established connectivity between Spark and the Hive metastore. Additionally, I've connected Presto to Hive using a connector. Through Spark SQL, I've created a table in Hive based on a Delta table stored in an S3 location. I can see this newly created table in both the Hive metadata (specifically in the TBLS table) and in Presto.

However, when I attempt to read this table using either Spark SQL or Presto, I encounter the following error:

java.lang.RuntimeException: s3a://db-postgres-data/enc_bank_account_information/_symlink_format_manifest/manifest is not a Parquet file. Expected magic number at tail, but found [117, 101, 116, 10].

Below is my spark conf:
spark = SparkSession.builder
.appName("presto with Delta Lake")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.jars", "/usr/spark-3.1.2/jars/spark-hive_2.12-3.1.2.jar,/usr/spark-3.1.2/jars/hive-metastore-3.1.3.jar,/usr/spark-3.1.2/jars/postgresql-42.2.27.jre7.jar,/usr/spark-3.1.2/jars/hadoop-aws-3.2.2.jar,/usr/spark-3.1.2/jars/hadoop-common-3.3.6.jar,/usr/spark-3.1.2/jars/aws-java-sdk-bundle-1.11.563.jar,/usr/spark-3.1.2/jars/aws-java-sdk-s3-1.11.563.jar")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.databricks.delta.retentionDurationCheck.enabled", "false")
.config("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.config("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
.config("spark.jars.packages", spark_jars_packages)
.config("spark.executor.memory", "8g")
.config("spark.worker.memory", "8g")
.config("spark.sql.warehouse.dir", "work/spark-warehouse")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.metastore.uris", hive_metastore)
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "true")
.config("spark.hadoop.fs.s3a.fast.upload", "true")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.history.fs.logDirectory", "s3a://storage-for-spark-logs/")
.config("spark.sql.catalogImplementation", "hive")
.master(master)
.enableHiveSupport()
.getOrCreate()

spark.sparkContext.setLogLevel("WARN")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.access.key", aws_access_key_id)
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key)
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")

Below code for creating hive table:
spark.sql("""
CREATE TABLE IF NOT EXISTS bank_info (
__v double,
_id string,
user_id string,
createdAt string,
updatedAt string,
ifsc_code string,
account_number_mac string,
account_number_encryptedData string,
_airbyte_ab_id string,
_airbyte_emitted_at timestamp,
_airbyte_normalized_at timestamp,
_airbyte_bank_account_information_hashid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3a://db-postgres-data/enc_bank_account_information/_symlink_format_manifest'
""")

Please help to resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants