Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iceberg support to EMR serverless jobs. #2602

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

asuresh8
Copy link
Contributor

@asuresh8 asuresh8 commented Apr 3, 2024

Description

This commit adds support for query workbench to access Iceberg tables. Iceberg is a commonly used table format for data stored on S3. Specifically, security lake data is stored in Iceberg format. This change enables security lake data to be queried using query workbench.

Testing

Built and deployed to cluster and then queried Hive, and Iceberg tables to verify functionality still works as well as testing the new functionality.

First round of testing to make sure Flint is compatible is in opensearch-project/opensearch-spark#301

End to end test using this package was performed with following steps:

  1. Built this package locally
  2. Started an EC2 instance
  3. Downloaded OpenSearch 2.12 onto the ec2 instance
  4. Replaced opensearch-sql folder with artifacts built locally
  5. Created an EMR application
  6. Ran `echo "plugins.query.executionengine.spark.config: '{"applicationId":"xxxxxxxxxx","executionRoleARN":"arn:aws:iam::xxxxxxxxx:role/emr-job-execution-role","region":"us-west-2", "sparkSubmitParameters": "--conf spark.dynamicAllocation.enabled=false"}'" >> config/opensearch.yml
  7. Started OpenSearch
  8. Ran a Hive query to verify existing functionality still works
$ curl --request  POST   --url http://localhost:9200/_plugins/_async_query   --header 'content-type: application/x-ndjson'   --data '{"datasource": "maws_s3","lang": "sql","query": "SELECT * FROM maws_s3.amazon_security_lake_glue_db_us_east_1.amazon_security_lake_table_us_east_1_vpc_flow_1_0 LIMIT 1"}'
{
  "queryId": "ZU9qMXFYTm5lQW1hd3NfczM=",
  "sessionId": "TjBIQUZqUEpWSG1hd3NfczM="
$ curl --request  GET --url http://localhost:9200/_plugins/_async_query/ZU9qMXFYTm5lQW1hd3NfczM=
{
  "status": "SUCCESS",
  "schema": [
    {
      "name": "metadata",
      "type": "struct"
    },
  ...
  "total": 1,
  "size": 1
}
  1. Ran an Iceberg query to test new functionality
$ curl --request  POST   --url http://localhost:9200/_plugins/_async_query   --header 'content-type: application/x-ndjson'   --data '{"dtasource": "maws_s3","lang": "sql","query": "SELECT * FROM maws_s3.amazon_security_lake_glue_db_us_east_1.amazon_security_lake_table_us_east_1_vpc_flow_2_0 LIMIT 1", "sessionId": "TjBIQUZqUEpWSG1hd3NfczM="}'
{
  "queryId": "WkRHN3hZTlB0VW1hd3NfczM=",
  "sessionId": "TjBIQUZqUEpWSG1hd3NfczM="
}
$ curl --request  GET --url http://localhost:9200/_plugins/_async_query/WkRHN3hZTlB0VW1hd3NfczM=
{
  "status": "SUCCESS",
  "schema": [
    {
      "name": "metadata",
      "type": "struct"
    },
  ...
  "total": 1,
  "size": 1
}

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@vmmusings
Copy link
Member

@asuresh8 can we add more details in description. What is iceberg and why are we adding this?

@asuresh8
Copy link
Contributor Author

asuresh8 commented Apr 4, 2024

Add more details to description

@vmmusings
Copy link
Member

Could you please add screeen shots or alteast mention the scenarios that we have tested.

Copy link

codecov bot commented Apr 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.37%. Comparing base (e153609) to head (03b9372).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2602   +/-   ##
=========================================
  Coverage     95.37%   95.37%           
  Complexity     5131     5131           
=========================================
  Files           490      490           
  Lines         14428    14430    +2     
  Branches        968      968           
=========================================
+ Hits          13760    13762    +2     
  Misses          643      643           
  Partials         25       25           
Flag Coverage Δ
sql-engine 95.37% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vmmusings vmmusings added enhancement New feature or request backport 2.x labels Apr 8, 2024
@vmmusings vmmusings merged commit 39c0222 into opensearch-project:main Apr 8, 2024
27 of 31 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 8, 2024
Signed-off-by: Adi Suresh <[email protected]>
(cherry picked from commit 39c0222)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
vmmusings pushed a commit that referenced this pull request Apr 22, 2024
(cherry picked from commit 39c0222)

Signed-off-by: Adi Suresh <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants