AWS IAM Profile credentials (#210)

* Temp pom change, to be reverted * Rely on default credentials providers if not explicitly set * Increment version * Retrieve aws creds from env, build identifier incrementally * PK fix * Fix with latest version * Update avr to fix CVE-2023-39410 * Put explicit deps in child poms * Fix with PK version used in CI * Pin ubuntu version to fix 'VM Crashed' issue * Update doc/changes/changes_2.1.3.md Co-authored-by: Christoph Pirkl <[email protected]> * Update PK version in CI * Tests for ExasolS3Table changes * Rename and remove abstract from base query generator class * Query generator tests, fix logic bug in query generator * PK fix * Try to decrease DB version * Revert "Try to decrease DB version" This reverts commit 3904d2d. * Add jdbc connection timeout * Doc updates * Doc ref * Changes * Update exasol-s3/src/test/java/com/exasol/spark/s3/S3TableConfTest.java Co-authored-by: Christoph Pirkl <[email protected]> * Update exasol-s3/src/main/java/com/exasol/spark/s3/BaseQueryGenerator.java Co-authored-by: Christoph Pirkl <[email protected]> * Update doc/user_guide/user_guide.md Co-authored-by: Christoph Pirkl <[email protected]> * Doc fixes * Fix tests * T1 * T2 * Adding missing Testcontainers annotation * Remove print * Let's try to release today :) * Revert experiments * Add unit tests to make sonarcloud happy * Sonarcloud, part 2 * Sonarcloud, part 3 * Revert ubuntu pin, setup java 17 for sonar --------- Co-authored-by: Christoph Pirkl <[email protected]>
exasol · Oct 20, 2023 · 07bb831 · 07bb831
1 parent e3f997e
commit 07bb831
Show file tree

Hide file tree

Showing 18 changed files with 459 additions and 95 deletions.
diff --git a/.github/workflows/broken_links_checker.yml b/.github/workflows/broken_links_checker.yml
diff --git a/.github/workflows/ci-build.yml b/.github/workflows/ci-build.yml
@@ -24,11 +24,13 @@ jobs:
         uses: actions/checkout@v3
         with:
           fetch-depth: 0
-      - name: Set up JDK 11
+      - name: Set up JDK 11 & 17
         uses: actions/setup-java@v3
         with:
           distribution: 'temurin'
-          java-version: 11
+          java-version: |
+            17
+            11
           cache: 'maven'
       - name: Cache SonarCloud packages
         uses: actions/cache@v3
@@ -42,7 +44,7 @@ jobs:
         run: docker pull exasol/docker-db:${{ matrix.exasol-docker-version }}
       - name: Run tests and build with Maven
         run: |
-          mvn --batch-mode verify ${{ matrix.profile }} \
+          JAVA_HOME=$JAVA_HOME_11_X64 mvn --batch-mode verify ${{ matrix.profile }} \
               -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn \
               -DtrimStackTrace=false
         env:
@@ -55,7 +57,7 @@ jobs:
       - name: Sonar analysis
         if: ${{ env.SONAR_TOKEN != null && matrix.profile == '-Pspark3.4' }}
         run: |
-          mvn --batch-mode org.sonarsource.scanner.maven:sonar-maven-plugin:sonar \
+          JAVA_HOME=$JAVA_HOME_17_X64 mvn --batch-mode org.sonarsource.scanner.maven:sonar-maven-plugin:sonar \
               -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn \
               -DtrimStackTrace=false \
               -Dsonar.organization=exasol \

diff --git a/.github/workflows/pk-verify.yml b/.github/workflows/pk-verify.yml
@@ -32,4 +32,4 @@ jobs:
           key: ${{ runner.os }}-sonar
           restore-keys: ${{ runner.os }}-sonar
       - name: Run Project Keeper Separately
-        run: mvn --batch-mode -DtrimStackTrace=false com.exasol:project-keeper-maven-plugin:2.9.11:verify --projects .
+        run: mvn --batch-mode -DtrimStackTrace=false com.exasol:project-keeper-maven-plugin:2.9.12:verify --projects .
diff --git a/dependencies.md b/dependencies.md
diff --git a/doc/changes/changelog.md b/doc/changes/changelog.md
diff --git a/doc/changes/changes_2.1.3.md b/doc/changes/changes_2.1.3.md
@@ -0,0 +1,29 @@
+# Spark Connector 2.1.3, released 2023-10-20
+
+Code name: More flexibility for AWS Credentials specification in spark-connector-s3
+
+## Summary
+In addition to explicit AWS Credentials specification we now support environment variables and EC2 instance profiles.
+Fixes CVE-2023-39410 in apache avro (transitive dependency).
+
+## Features
+
+* 192: Add support for AWS IAM Profile Credentials for s3 connector.
+
+## Dependency Updates
+
+### Spark Exasol Connector With JDBC
+
+#### Compile Dependency Updates
+
+* Added `org.apache.avro:avro:1.11.3`
+
+### Spark Exasol Connector With S3
+
+#### Compile Dependency Updates
+
+* Added `org.apache.avro:avro:1.11.3`
+
+#### Test Dependency Updates
+
+* Added `org.junit-pioneer:junit-pioneer:2.1.0`
diff --git a/doc/user_guide/user_guide.md b/doc/user_guide/user_guide.md
@@ -10,6 +10,7 @@ Exasol tables.
 - [Versioning](#versioning)
 - [Format](#format)
 - [Using as Dependency](#using-as-dependency)
+- [AWS Authentication](#aws-authentication)
 - [Configuration Parameters](#configuration-options)
 - [Creating a Spark DataFrame From Exasol Query](#creating-a-spark-dataframe-from-exasol-query)
 - [Saving Spark DataFrame to an Exasol Table](#saving-spark-dataframe-to-an-exasol-table)
@@ -31,7 +32,7 @@ Additionally, please make sure that the Exasol nodes are reachable from the Spar
 
 ### S3
 
-When using with S3 intermediate storage please make sure that there is access to an S3 bucket. And please prepare AWS access and secret keys with enough permissions for the S3 bucket.
+When using with S3 intermediate storage please make sure that there is access to an S3 bucket. In details, AWS Authentication is described in the [corresponding section of this document] (#aws-authentication).
 
 ## Versioning
 
@@ -145,6 +146,41 @@ For example, S3 variant with version `2.0.0-spark-3.4.1`:
 spark-shell --jars spark-connector-s3_2.13-2.0.0-spark-3.4.1-assembly.jar
 ```
 
+## AWS Authentication
+
+If S3 intermediate storage is used, proper AWS Authentication parameters has to be provided:
+
+* Spark has to be able to read and write into S3 (to export and import dataframe's data);
+* Database has to be able to read and write into S3 (to perform `IMPORT` and `EXPORT` statements).
+
+There are several ways to provide AWS credentials and concrete method depends on configuration of your cloud infrastructure. Here we cover main scenarios and configuration options you can tweak.
+
+### Credential Providers
+
+The first option is `awsCredentialsProvider` with which you can specify list of ways credentials are retrieved from your spark environment. This parameter is not required and if not specified, the default list of credentials providers is being used. At the moment of writing, this list includes the following credentials providers:
+
+* `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider`: credentials are explicitly set with options `awsAccessKeyId` and `awsSecretAccessKey`.
+* `com.amazonaws.auth.EnvironmentVariableCredentialsProvider`: credentials are retrieved from environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` (of Spark process).
+* `com.amazonaws.auth.InstanceProfileCredentialsProvider`: credentials are retrieved from EC2 instance IAM role.
+
+There are many other credential providers in Amazon Hadoop library and 3rd party libraries. If you need to change default behaviour, you can set `awsCredentialsProvider` option to list of comma-separated class names.
+
+In details you can read about Credentials Providers in [this document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3).
+
+### Explicitly provided credentials
+If you want to specify Access Key ID and Secret Access Key explicitly you can set options `awsAccessKeyId` and `awsSecretAccessKey`.
+
+Alternatively, you can set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in your Spark cluster configuration.
+
+In both cases, credentials will be used for S3 operations from Spark's side and forwarded to the database in `IMPORT` and `EXPORT` commands (as `USER 'key' IDENTIFIED BY 'secret_key'` parameters).
+
+### Using EC2 Instance Profile
+In AWS you can attach permissions to the role associated with EC2 instance your Spark cluster is working. In that case, S3 credentials are extracted from instance profile automatically by `InstanceProfileCredentialsProvider`, so you don't need to pass any options.
+
+In this scenario, no credentials are being put in `IMPORT` and `EXPORT` DB commands, so you need to make sure that DB has proper access to S3 bucket you're using for intermediate storage. 
+
+If database is running in EC2, it is possible to use EC2 Instance Profiles, but it has to be enabled explicitly, as described in [this document](https://exasol.my.site.com/s/article/Changelog-content-15155?language=en_US).
+
 ## Configuration Options
 
 In this section, we describe the common configuration parameters that are used for both JDBC and S3 variants to facilitate the integration between Spark and Exasol clusters.
@@ -208,8 +244,9 @@ When using the `S3` variant of the connector you should provide the following ad
 | Parameter             | Default            | Required | Description                                                         |
 |-----------------------|:------------------:|:--------:|-------------------------------------------------------------------- |
 | `s3Bucket`            |                    |    ✓     | A bucket name for intermediate storage                              |
-| `awsAccessKeyId`      |                    |    ✓     | AWS Access Key for accessing bucket                                 |
-| `awsSecretAccessKey`  |                    |    ✓     | AWS Secret Key for accessing bucket                                 |
+| `awsAccessKeyId`      |                    |          | AWS Access Key for accessing bucket                                 |
+| `awsSecretAccessKey`  |                    |          | AWS Secret Key for accessing bucket                                 |
+| `awsCredentialsProvider` | [default providers](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3) | | List of classes used to extract credentials information from the runtime environment. |
 | `numPartitions`       | `8`                |          | Number of partitions that will match number of files in `S3` bucket |
 | `awsRegion`           | `us-east-1`        |          | AWS Region for provided bucket                                      |
 | `awsEndpointOverride` | (default endpoint) |          | AWS S3 Endpoint for bucket, set this value for custom endpoints     |

diff --git a/exasol-jdbc/pom.xml b/exasol-jdbc/pom.xml
@@ -121,6 +121,11 @@
                 </exclusion>
             </exclusions>
         </dependency>
+        <dependency>
+            <!-- Added here to reference avro version from parent-pom -->
+            <groupId>org.apache.avro</groupId>
+            <artifactId>avro</artifactId>
+        </dependency>
         <!-- Test Dependencies -->
         <dependency>
             <groupId>org.scalatest</groupId>

diff --git a/exasol-s3/pom.xml b/exasol-s3/pom.xml
@@ -53,6 +53,11 @@
             <artifactId>wildfly-openssl</artifactId>
             <version>2.2.5.Final</version>
         </dependency>
+        <dependency>
+            <!-- Added here to reference avro version from parent-pom -->
+            <groupId>org.apache.avro</groupId>
+            <artifactId>avro</artifactId>
+        </dependency>
         <!-- Test Dependencies -->
         <dependency>
             <groupId>org.junit.jupiter</groupId>
@@ -64,6 +69,11 @@
             <artifactId>junit-jupiter-api</artifactId>
             <scope>test</scope>
         </dependency>
+        <dependency>
+            <groupId>org.junit-pioneer</groupId>
+            <artifactId>junit-pioneer</artifactId>
+            <scope>test</scope>
+        </dependency>
         <dependency>
             <groupId>com.exasol</groupId>
             <artifactId>test-db-builder-java</artifactId>

diff --git a/exasol-s3/src/main/java/com/exasol/spark/s3/AbstractImportExportQueryGenerator.java b/exasol-s3/src/main/java/com/exasol/spark/s3/AbstractImportExportQueryGenerator.java