-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858
Conversation
* ``glue.lakeformation.enabled`` determines whether to enable lakeformation for queries. Default value is ``"false"`` if not specified | ||
* ``glue.iceberg.enabled`` determines whether to enable Iceberg for the session. Default value is ``"false"`` if not specified. | ||
* ``glue.lakeformation.enabled`` determines whether to enable Lake Formation for queries when Iceberg is also enabled. If Iceberg is not enabled, then this property has no effect. Default value is ``"false"`` if not specified. | ||
* ``glue.lakeformation.session_tag`` what session tag to use when assuming the data source role. This property is required when both Iceberg and Lake Formation are enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we introduced validations on these conditions while creating Glue datasource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! I've added them now.
75683cc
to
726c24f
Compare
79b8f9a
to
af13394
Compare
2a2843d
to
d5ae63a
Compare
async-query-core/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java
Show resolved
Hide resolved
async-query-core/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java
Outdated
Show resolved
Hide resolved
Failure does not look related to this change. |
"glue.auth.role_arn": "role_arn", | ||
"glue.indexstore.opensearch.uri": "http://adsasdf.amazonopensearch.com:9200", | ||
"glue.indexstore.opensearch.auth" :"awssigv4", | ||
"glue.indexstore.opensearch.auth.region" :"awssigv4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. The value should be region name.
public static final String SPARK_CATALOG_CATALOG_IMPL = | ||
"spark.sql.catalog.spark_catalog.catalog-impl"; | ||
public static final String ICEBERG_SPARK_JARS = | ||
"org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.0,software.amazon.awssdk:bundle:2.26.30"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to specify the specific version? When do we notice the issue if we have version inconsistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to say. Some things I observed here is that using the Iceberg version in EMR was causing issues in EMR versions prior to 7.2 (Spark 3.5.1). So specifying iceberg from Maven central is more stable than that. On the AWS version, I'm not sure. The AWS sdk v2 is only used with Iceberg in the EMR 6.x versions, and this version doesn't conflict. That's not to say it couldn't be an issue in EMR 7.x.
Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. Signed-off-by: Adi Suresh <[email protected]>
This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. Signed-off-by: Adi Suresh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BWC Test failed, it is not related to this PR.
Track it seperatelly.
…source type. (#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]> (cherry picked from commit 05c961e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…source type. (opensearch-project#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]>
…source type. (opensearch-project#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]>
…source type. (#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]> (cherry picked from commit 05c961e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…source type. (#2858) (#2978) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- (cherry picked from commit 05c961e) Signed-off-by: Adi Suresh <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…source type. (#2858) (#2934) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- (cherry picked from commit 05c961e) Signed-off-by: Adi Suresh <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option.
Additionally, the Lake Formation flag enabled Lake Formation for the EMR Serverless job, but this did not work as expected. Setting Lake Formation enabled for the entire EMR Serverless job result in EMR Serverless system space being used. This system space does not work with Flint. The full limitations are documented in https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html. Instead Lake Formation can be enabled using the Iceberg catalog implementation.
Testing
Built and deployed to cluster and then queried Iceberg tables to verify functionality still works as well as testing the new functionality.
Setup
Created 2 data sources:
Scenario 1 (S3 IAM permissions, Yes Iceberg and No Lake Formation flags set) Iceberg table should work
Scenario 2 (LF permissions, Yes Iceberg and Yes Lake Formation flags set) Iceberg table should work
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.