-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Athena delta table load it via spark #365
Comments
Tables defined in metastores are not going to be supported until Apache Spark 3.0 and Delta Lake 0.7.0. See #85 |
Today, I am using Spark 3.1.1 on AWS EMR integrated with Glue Data Catalog. I cannot read data using spark.sql("SELECT * FROM mydb.mytable").show(10). Additionally, I am using delta 1.0.0. I think the issue still exists on the versions mentioned above. Any ideas? |
…ersion - bug fix and more tests (delta-io#365) * PR ColumnsFromDeltaLog - get table schema from Delta Log. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR ColumnsFromDeltaLog - get table schema from Delta Log. Using SnapshotVersion in Enumerator factory. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 ColumnsFromDeltaLog_Tests - extra test for Delta table Schema discovery. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Partition support using Delta Log Metadata Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 - Added Delta - Flink - Delta type conversion test. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10 ColumnsFromDeltaLog - Prevent user to set internal options via source builder. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10 ColumnsFromDeltaLog - Changes after code review Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Adding tests Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Partition Support Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 Option validation - Get BATCH_SIZE for Format builder from Source options. Add validation for option names and values. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 Added tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Test fix after merge Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 10.1 cleanup. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Fix after merge conflicts from master. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Make RowDataFormat constructor package protected. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 Changes after Code review Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 Fix compilation error after merge from base branch. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Validation for Inapplicable Option Used + tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Javadocs Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Adding option type safety Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Adding option type safety tests Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 11 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - test for options Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Type conversion for Timestamp based options and adding tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - Change TODO's from PR 12 to PR 12.1 Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Option validation, throw DeltaSourceValidationException + tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12 - changes after code review Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Validation for option setting and more tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Validation for option setting and more tests. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Validation for set options - BugFixes, tests, changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> * PR 12.1 - Changes after code review. Signed-off-by: Krzysztof Chmielewski <[email protected]> Co-authored-by: Krzysztof Chmielewski <[email protected]>
Hi all, I'm using Spark version 3.2.0 along with Delta Lake version 1.2.1. I've established connectivity between Spark and the Hive metastore. Additionally, I've connected Presto to Hive using a connector. Through Spark SQL, I've created a table in Hive based on a Delta table stored in an S3 location. I can see this newly created table in both the Hive metadata (specifically in the TBLS table) and in Presto. However, when I attempt to read this table using either Spark SQL or Presto, I encounter the following error: java.lang.RuntimeException: s3a://db-postgres-data/enc_bank_account_information/_symlink_format_manifest/manifest is not a Parquet file. Expected magic number at tail, but found [117, 101, 116, 10]. Below is my spark conf: spark.sparkContext.setLogLevel("WARN") Below code for creating hive table: Please help to resolve this issue. |
The EMR configured to use Glue Data Catalog as external Hive metastore and lot of pyspark scripts written to query against hive meta store. We have converted few datasets to delta lake and created tables in Glue Data Catalog. We have an issue when directly querying those tables using spark API, but the same works with delta API. any way to make this work?
spark.sql("SELECT * FROM mydb.mytable").show(10)
Caused by: java.lang.RuntimeException: s3://my-buclet/db/schmea/deltatable/_symlink_format_manifest/manifest is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]
The text was updated successfully, but these errors were encountered: