Athena delta table load it via spark #365

kkr78 · 2020-03-25T15:49:27Z

The EMR configured to use Glue Data Catalog as external Hive metastore and lot of pyspark scripts written to query against hive meta store. We have converted few datasets to delta lake and created tables in Glue Data Catalog. We have an issue when directly querying those tables using spark API, but the same works with delta API. any way to make this work?

spark.sql("SELECT * FROM mydb.mytable").show(10)

Caused by: java.lang.RuntimeException: s3://my-buclet/db/schmea/deltatable/_symlink_format_manifest/manifest is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]

tdas · 2020-03-25T17:35:49Z

Tables defined in metastores are not going to be supported until Apache Spark 3.0 and Delta Lake 0.7.0. See #85

cabral1888 · 2021-07-28T21:21:55Z

Today, I am using Spark 3.1.1 on AWS EMR integrated with Glue Data Catalog. I cannot read data using spark.sql("SELECT * FROM mydb.mytable").show(10). Additionally, I am using delta 1.0.0.

I think the issue still exists on the versions mentioned above. Any ideas?

…ersion - bug fix and more tests (delta-io#365) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Partition support using Delta Log Metadata Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 - Added Delta - Flink - Delta type conversion test. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Partition Support Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Added tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Test fix after merge Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 cleanup. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Fix after merge conflicts from master. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Make RowDataFormat constructor package protected. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Changes after Code review Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 Fix compilation error after merge from base branch. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Validation for Inapplicable Option Used + tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Javadocs Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Adding option type safety Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Adding option type safety tests Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - test for options Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Type conversion for Timestamp based options and adding tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Change TODO's from PR 12 to PR 12.1 Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Option validation, throw DeltaSourceValidationException + tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - changes after code review Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Validation for option setting and more tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Validation for option setting and more tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Validation for set options - BugFixes, tests, changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> Co-authored-by: Krzysztof Chmielewski <[email protected]>

rohitvarma01 · 2023-08-21T05:58:21Z

Hi all,

I'm using Spark version 3.2.0 along with Delta Lake version 1.2.1. I've established connectivity between Spark and the Hive metastore. Additionally, I've connected Presto to Hive using a connector. Through Spark SQL, I've created a table in Hive based on a Delta table stored in an S3 location. I can see this newly created table in both the Hive metadata (specifically in the TBLS table) and in Presto.

However, when I attempt to read this table using either Spark SQL or Presto, I encounter the following error:

java.lang.RuntimeException: s3a://db-postgres-data/enc_bank_account_information/_symlink_format_manifest/manifest is not a Parquet file. Expected magic number at tail, but found [117, 101, 116, 10].

Below is my spark conf:
spark = SparkSession.builder
.appName("presto with Delta Lake")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.jars", "/usr/spark-3.1.2/jars/spark-hive_2.12-3.1.2.jar,/usr/spark-3.1.2/jars/hive-metastore-3.1.3.jar,/usr/spark-3.1.2/jars/postgresql-42.2.27.jre7.jar,/usr/spark-3.1.2/jars/hadoop-aws-3.2.2.jar,/usr/spark-3.1.2/jars/hadoop-common-3.3.6.jar,/usr/spark-3.1.2/jars/aws-java-sdk-bundle-1.11.563.jar,/usr/spark-3.1.2/jars/aws-java-sdk-s3-1.11.563.jar")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.databricks.delta.retentionDurationCheck.enabled", "false")
.config("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.config("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
.config("spark.jars.packages", spark_jars_packages)
.config("spark.executor.memory", "8g")
.config("spark.worker.memory", "8g")
.config("spark.sql.warehouse.dir", "work/spark-warehouse")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.metastore.uris", hive_metastore)
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "true")
.config("spark.hadoop.fs.s3a.fast.upload", "true")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.history.fs.logDirectory", "s3a://storage-for-spark-logs/")
.config("spark.sql.catalogImplementation", "hive")
.master(master)
.enableHiveSupport()
.getOrCreate()

spark.sparkContext.setLogLevel("WARN")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.access.key", aws_access_key_id)
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key)
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")

Below code for creating hive table:
spark.sql("""
CREATE TABLE IF NOT EXISTS bank_info (
__v double,
_id string,
user_id string,
createdAt string,
updatedAt string,
ifsc_code string,
account_number_mac string,
account_number_encryptedData string,
_airbyte_ab_id string,
_airbyte_emitted_at timestamp,
_airbyte_normalized_at timestamp,
_airbyte_bank_account_information_hashid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3a://db-postgres-data/enc_bank_account_information/_symlink_format_manifest'
""")

Please help to resolve this issue.

tdas closed this as completed Apr 29, 2020

faycal-merouane mentioned this issue Jun 23, 2021

manifest is not a Parquet file. expected magic number #706

Closed

michael-j-thomas mentioned this issue Aug 5, 2021

manifest is not a Parquet file. expected magic number #736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Athena delta table load it via spark #365

Athena delta table load it via spark #365

kkr78 commented Mar 25, 2020 •

edited

Loading

tdas commented Mar 25, 2020

cabral1888 commented Jul 28, 2021

rohitvarma01 commented Aug 21, 2023 •

edited

Loading

Athena delta table load it via spark #365

Athena delta table load it via spark #365

Comments

kkr78 commented Mar 25, 2020 • edited Loading

tdas commented Mar 25, 2020

cabral1888 commented Jul 28, 2021

rohitvarma01 commented Aug 21, 2023 • edited Loading

kkr78 commented Mar 25, 2020 •

edited

Loading

rohitvarma01 commented Aug 21, 2023 •

edited

Loading