Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858

asuresh8 · 2024-07-24T21:27:32Z

Description

Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR Serverless job, but this did not work as expected. Setting Lake Formation enabled for the entire EMR Serverless job result in EMR Serverless system space being used. This system space does not work with Flint. The full limitations are documented in https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html. Instead Lake Formation can be enabled using the Iceberg catalog implementation.

Testing

Built and deployed to cluster and then queried Iceberg tables to verify functionality still works as well as testing the new functionality.

Setup

Created 2 data sources:

curl \
--request POST \
--url http://localhost:9200/_plugins/_query/_datasources \
--header 'content-type: application/x-ndjson' \
--data '{"name": "gdc2","description": "","connector": "S3GLUE","allowedRoles": [],
"properties": {"glue.auth.type": "iam_role","glue.auth.role_arn": "arn:aws:iam::476834799096:role/DirectQueryWithPublicLakeFormation","glue.indexstore.opensearch.uri": "http://ip-10-1-41-51.us-west-2.compute.internal:9200","glue.indexstore.
opensearch.auth": "noauth", "glue.iceberg.enabled": "true"}}'

curl \
--request POST \
--url http://localhost:9200/_plugins/_query/_datasources \
--header 'content-type: application/x-ndjson' \
--data '{"name": "gdc3","description": "","connector": "S3GLUE","allowedRoles": [],"properties": {"glue.auth.type": "iam_role","glue.auth.role_arn": "arn:aws:iam::476834799096:role/DirectQueryWithPublicLakeFormation","glue.indexstore.opensearch.uri": "http://ip-10-1-41-51.us-west-2.compute.internal:9200","glue.indexstore.opensearch.auth": "noauth", "glue.iceberg.enabled": "true", "glue.lakeformation.enabled": "true", "glue.lakeformation.session_tag": "directquery"}}'

Scenario 1 (S3 IAM permissions, Yes Iceberg and No Lake Formation flags set) Iceberg table should work

curl --request  POST --url http://localhost:9200/_plugins/_async_query --header 'content-type: application/x-ndjson' --data '{"datasource": "gdc2","lang": "sql","query": "SELECT * FROM gdc2.amazon_security_lake_glue_db_us_west_2.amazon_security_lake_table_us_west_2_vpc_flow_2_0 LIMIT 1;"}'
{
  "queryId": "aWt3aDQyM0JxaGdkYzI=",
  ...
}

curl --request GET --url http://localhost:9200/_plugins/_async_query/aWt3aDQyM0JxaGdkYzI=

Scenario 2 (LF permissions, Yes Iceberg and Yes Lake Formation flags set) Iceberg table should work

curl --request  POST --url http://localhost:9200/_plugins/_async_query --header 'content-type: application/x-ndjson' --data '{"datasource": "gdc3","lang": "sql","query": "SELECT * FROM gdc3.amazon_security_lake_glue_db_us_west_2.amazon_security_lake_table_us_west_2_vpc_flow_2_0 LIMIT 1;"}'
{
  "queryId": "T3A3RWlTSGVRSWdkYzM=",
  ...
}

curl --request GET --url http://localhost:9200/_plugins/_async_query/T3A3RWlTSGVRSWdkYzM=
{
  "status": "SUCCESS",
...
}

Check List

New functionality includes testing.
- All tests pass, including unit test, integration test and doctest
New functionality has been documented.
- New functionality has javadoc added
- New functionality has user manual doc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

vmmusings · 2024-07-30T17:22:59Z

docs/user/ppl/admin/connectors/s3glue_connector.rst

-* ``glue.lakeformation.enabled`` determines whether to enable lakeformation for queries. Default value is ``"false"`` if not specified
+* ``glue.iceberg.enabled`` determines whether to enable Iceberg for the session. Default value is ``"false"`` if not specified.
+* ``glue.lakeformation.enabled`` determines whether to enable Lake Formation for queries when Iceberg is also enabled. If Iceberg is not enabled, then this property has no effect. Default value is ``"false"`` if not specified.
+* ``glue.lakeformation.session_tag`` what session tag to use when assuming the data source role. This property is required when both Iceberg and Lake Formation are enabled.


Have we introduced validations on these conditions while creating Glue datasource?

Thanks for catching this! I've added them now.

async-query-core/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java

penghuo · 2024-08-09T17:06:41Z

@ykmr1224 could u take a look.
@asuresh8 could u check Java CI failure.

asuresh8 · 2024-08-09T18:05:42Z

@asuresh8 could u check Java CI failure.

Failure does not look related to this change.

docs/user/ppl/admin/connectors/s3glue_connector.rst

ykmr1224 · 2024-08-12T16:02:00Z

docs/user/ppl/admin/connectors/security_lake_connector.rst

+                "glue.auth.role_arn": "role_arn",
+                "glue.indexstore.opensearch.uri": "http://adsasdf.amazonopensearch.com:9200",
+                "glue.indexstore.opensearch.auth" :"awssigv4",
+                "glue.indexstore.opensearch.auth.region" :"awssigv4",


Same as above. The value should be region name.

docs/user/ppl/admin/connectors/security_lake_connector.rst

ykmr1224 · 2024-08-12T16:10:43Z

async-query-core/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java

-  public static final String SPARK_CATALOG_CATALOG_IMPL =
-      "spark.sql.catalog.spark_catalog.catalog-impl";
+  public static final String ICEBERG_SPARK_JARS =
+      "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.0,software.amazon.awssdk:bundle:2.26.30";


Is it safe to specify the specific version? When do we notice the issue if we have version inconsistency?

This is hard to say. Some things I observed here is that using the Iceberg version in EMR was causing issues in EMR versions prior to 7.2 (Spark 3.5.1). So specifying iceberg from Maven central is more stable than that. On the AWS version, I'm not sure. The AWS sdk v2 is only used with Iceberg in the EMR 6.x versions, and this version doesn't conflict. That's not to say it couldn't be an issue in EMR 7.x.

Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. Signed-off-by: Adi Suresh <[email protected]>

This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. Signed-off-by: Adi Suresh <[email protected]>

penghuo

BWC Test failed, it is not related to this PR.
Track it seperatelly.

…source type. (#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]> (cherry picked from commit 05c961e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…source type. (opensearch-project#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]>

…source type. (#2858) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- Signed-off-by: Adi Suresh <[email protected]> (cherry picked from commit 05c961e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…source type. (#2858) (#2978) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- (cherry picked from commit 05c961e) Signed-off-by: Adi Suresh <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…source type. (#2858) (#2934) Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option. Additionally, the Lake Formation flag enabled Lake Formation for the EMR job. This did not work as expected because EMR system space does not work with Flint. Instead Lake Formation can be enabled using the Iceberg catalog implementation. This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. --------- (cherry picked from commit 05c961e) Signed-off-by: Adi Suresh <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

asuresh8 marked this pull request as ready for review July 24, 2024 22:02

asuresh8 requested review from ps48, kavithacm, derek-ho, joshuali925, dai-chen, YANG-DB, rupal-bq, mengweieric, vmmusings, Swiddis, penghuo, seankao-az, MaxKsyunz, Yury-Fridlyand, anirudha, forestmvey, acarbonetto, GumpacG, ykmr1224 and LantaoJin as code owners July 24, 2024 22:02

vmmusings assigned vmmusings and asuresh8 and unassigned vmmusings Jul 30, 2024

vmmusings reviewed Jul 30, 2024

View reviewed changes

asuresh8 force-pushed the iceberg_lf branch 2 times, most recently from 75683cc to 726c24f Compare July 30, 2024 21:30

asuresh8 force-pushed the iceberg_lf branch 2 times, most recently from 79b8f9a to af13394 Compare August 7, 2024 18:56

asuresh8 changed the title ~~Add flag for iceberg and correct flag for Lake Formation.~~ Add flags for Iceberg and Lake Formation and Security Lake as a data source type. Aug 7, 2024

asuresh8 force-pushed the iceberg_lf branch 3 times, most recently from 2a2843d to d5ae63a Compare August 7, 2024 21:55

engechas approved these changes Aug 8, 2024

View reviewed changes

penghuo reviewed Aug 9, 2024

View reviewed changes

async-query-core/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java Show resolved Hide resolved

async-query-core/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java Outdated Show resolved Hide resolved

asuresh8 force-pushed the iceberg_lf branch from d5ae63a to 42efd1b Compare August 9, 2024 17:20

ykmr1224 reviewed Aug 12, 2024

View reviewed changes

asuresh8 added 2 commits August 13, 2024 13:34

Add Security Lake data source type.

c1027d7

This changes adds Security Lake as a data source type. Security Lake as a data source is simply specific options set on top of the base S3Glue data source. Signed-off-by: Adi Suresh <[email protected]>

asuresh8 force-pushed the iceberg_lf branch from 42efd1b to c1027d7 Compare August 13, 2024 13:34

asuresh8 requested review from penghuo and ykmr1224 August 13, 2024 13:39

ykmr1224 approved these changes Aug 13, 2024

View reviewed changes

penghuo approved these changes Aug 13, 2024

View reviewed changes

penghuo added the backport 2.x label Aug 13, 2024

penghuo merged commit 05c961e into opensearch-project:main Aug 13, 2024
13 of 15 checks passed

opensearch-trigger-bot bot mentioned this pull request Aug 13, 2024

[Backport 2.x] Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2934

Merged

ykmr1224 added the backport 2.17 label Sep 4, 2024

opensearch-trigger-bot bot mentioned this pull request Sep 4, 2024

[Backport 2.17] Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2978

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858

Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858

asuresh8 commented Jul 24, 2024

vmmusings Jul 30, 2024

asuresh8 Jul 30, 2024

penghuo commented Aug 9, 2024

asuresh8 commented Aug 9, 2024

ykmr1224 Aug 12, 2024

ykmr1224 Aug 12, 2024

asuresh8 Aug 13, 2024

penghuo left a comment

Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858

Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858

Conversation

asuresh8 commented Jul 24, 2024

Description

Testing

Setup

Scenario 1 (S3 IAM permissions, Yes Iceberg and No Lake Formation flags set) Iceberg table should work

Scenario 2 (LF permissions, Yes Iceberg and Yes Lake Formation flags set) Iceberg table should work

Check List

vmmusings Jul 30, 2024

Choose a reason for hiding this comment

asuresh8 Jul 30, 2024

Choose a reason for hiding this comment

penghuo commented Aug 9, 2024

asuresh8 commented Aug 9, 2024

ykmr1224 Aug 12, 2024

Choose a reason for hiding this comment

ykmr1224 Aug 12, 2024

Choose a reason for hiding this comment

asuresh8 Aug 13, 2024

Choose a reason for hiding this comment

penghuo left a comment

Choose a reason for hiding this comment