Skip to content

Commit

Permalink
AWS IAM Profile credentials (#210)
Browse files Browse the repository at this point in the history
* Temp pom change, to be reverted

* Rely on default credentials providers if not explicitly set

* Increment version

* Retrieve aws creds from env, build identifier incrementally

* PK fix

* Fix with latest version

* Update avr to fix CVE-2023-39410

* Put explicit deps in child poms

* Fix with PK version used in CI

* Pin ubuntu version to fix 'VM Crashed' issue

* Update doc/changes/changes_2.1.3.md

Co-authored-by: Christoph Pirkl <[email protected]>

* Update PK version in CI

* Tests for ExasolS3Table changes

* Rename and remove abstract from base query generator class

* Query generator tests, fix logic bug in query generator

* PK fix

* Try to decrease DB version

* Revert "Try to decrease DB version"

This reverts commit 3904d2d.

* Add jdbc connection timeout

* Doc updates

* Doc ref

* Changes

* Update exasol-s3/src/test/java/com/exasol/spark/s3/S3TableConfTest.java

Co-authored-by: Christoph Pirkl <[email protected]>

* Update exasol-s3/src/main/java/com/exasol/spark/s3/BaseQueryGenerator.java

Co-authored-by: Christoph Pirkl <[email protected]>

* Update doc/user_guide/user_guide.md

Co-authored-by: Christoph Pirkl <[email protected]>

* Doc fixes

* Fix tests

* T1

* T2

* Adding missing Testcontainers annotation

* Remove print

* Let's try to release today :)

* Revert experiments

* Add unit tests to make sonarcloud happy

* Sonarcloud, part 2

* Sonarcloud, part 3

* Revert ubuntu pin, setup java 17 for sonar

---------

Co-authored-by: Christoph Pirkl <[email protected]>
  • Loading branch information
Shmuma and kaklakariada authored Oct 20, 2023
1 parent e3f997e commit 07bb831
Show file tree
Hide file tree
Showing 18 changed files with 459 additions and 95 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/broken_links_checker.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 6 additions & 4 deletions .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,13 @@ jobs:
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Set up JDK 11
- name: Set up JDK 11 & 17
uses: actions/setup-java@v3
with:
distribution: 'temurin'
java-version: 11
java-version: |
17
11
cache: 'maven'
- name: Cache SonarCloud packages
uses: actions/cache@v3
Expand All @@ -42,7 +44,7 @@ jobs:
run: docker pull exasol/docker-db:${{ matrix.exasol-docker-version }}
- name: Run tests and build with Maven
run: |
mvn --batch-mode verify ${{ matrix.profile }} \
JAVA_HOME=$JAVA_HOME_11_X64 mvn --batch-mode verify ${{ matrix.profile }} \
-Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn \
-DtrimStackTrace=false
env:
Expand All @@ -55,7 +57,7 @@ jobs:
- name: Sonar analysis
if: ${{ env.SONAR_TOKEN != null && matrix.profile == '-Pspark3.4' }}
run: |
mvn --batch-mode org.sonarsource.scanner.maven:sonar-maven-plugin:sonar \
JAVA_HOME=$JAVA_HOME_17_X64 mvn --batch-mode org.sonarsource.scanner.maven:sonar-maven-plugin:sonar \
-Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn \
-DtrimStackTrace=false \
-Dsonar.organization=exasol \
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pk-verify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ jobs:
key: ${{ runner.os }}-sonar
restore-keys: ${{ runner.os }}-sonar
- name: Run Project Keeper Separately
run: mvn --batch-mode -DtrimStackTrace=false com.exasol:project-keeper-maven-plugin:2.9.11:verify --projects .
run: mvn --batch-mode -DtrimStackTrace=false com.exasol:project-keeper-maven-plugin:2.9.12:verify --projects .
22 changes: 13 additions & 9 deletions dependencies.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions doc/changes/changelog.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 29 additions & 0 deletions doc/changes/changes_2.1.3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Spark Connector 2.1.3, released 2023-10-20

Code name: More flexibility for AWS Credentials specification in spark-connector-s3

## Summary
In addition to explicit AWS Credentials specification we now support environment variables and EC2 instance profiles.
Fixes CVE-2023-39410 in apache avro (transitive dependency).

## Features

* 192: Add support for AWS IAM Profile Credentials for s3 connector.

## Dependency Updates

### Spark Exasol Connector With JDBC

#### Compile Dependency Updates

* Added `org.apache.avro:avro:1.11.3`

### Spark Exasol Connector With S3

#### Compile Dependency Updates

* Added `org.apache.avro:avro:1.11.3`

#### Test Dependency Updates

* Added `org.junit-pioneer:junit-pioneer:2.1.0`
43 changes: 40 additions & 3 deletions doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Exasol tables.
- [Versioning](#versioning)
- [Format](#format)
- [Using as Dependency](#using-as-dependency)
- [AWS Authentication](#aws-authentication)
- [Configuration Parameters](#configuration-options)
- [Creating a Spark DataFrame From Exasol Query](#creating-a-spark-dataframe-from-exasol-query)
- [Saving Spark DataFrame to an Exasol Table](#saving-spark-dataframe-to-an-exasol-table)
Expand All @@ -31,7 +32,7 @@ Additionally, please make sure that the Exasol nodes are reachable from the Spar

### S3

When using with S3 intermediate storage please make sure that there is access to an S3 bucket. And please prepare AWS access and secret keys with enough permissions for the S3 bucket.
When using with S3 intermediate storage please make sure that there is access to an S3 bucket. In details, AWS Authentication is described in the [corresponding section of this document] (#aws-authentication).

## Versioning

Expand Down Expand Up @@ -145,6 +146,41 @@ For example, S3 variant with version `2.0.0-spark-3.4.1`:
spark-shell --jars spark-connector-s3_2.13-2.0.0-spark-3.4.1-assembly.jar
```

## AWS Authentication

If S3 intermediate storage is used, proper AWS Authentication parameters has to be provided:

* Spark has to be able to read and write into S3 (to export and import dataframe's data);
* Database has to be able to read and write into S3 (to perform `IMPORT` and `EXPORT` statements).

There are several ways to provide AWS credentials and concrete method depends on configuration of your cloud infrastructure. Here we cover main scenarios and configuration options you can tweak.

### Credential Providers

The first option is `awsCredentialsProvider` with which you can specify list of ways credentials are retrieved from your spark environment. This parameter is not required and if not specified, the default list of credentials providers is being used. At the moment of writing, this list includes the following credentials providers:

* `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider`: credentials are explicitly set with options `awsAccessKeyId` and `awsSecretAccessKey`.
* `com.amazonaws.auth.EnvironmentVariableCredentialsProvider`: credentials are retrieved from environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` (of Spark process).
* `com.amazonaws.auth.InstanceProfileCredentialsProvider`: credentials are retrieved from EC2 instance IAM role.

There are many other credential providers in Amazon Hadoop library and 3rd party libraries. If you need to change default behaviour, you can set `awsCredentialsProvider` option to list of comma-separated class names.

In details you can read about Credentials Providers in [this document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3).

### Explicitly provided credentials
If you want to specify Access Key ID and Secret Access Key explicitly you can set options `awsAccessKeyId` and `awsSecretAccessKey`.

Alternatively, you can set environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in your Spark cluster configuration.

In both cases, credentials will be used for S3 operations from Spark's side and forwarded to the database in `IMPORT` and `EXPORT` commands (as `USER 'key' IDENTIFIED BY 'secret_key'` parameters).

### Using EC2 Instance Profile
In AWS you can attach permissions to the role associated with EC2 instance your Spark cluster is working. In that case, S3 credentials are extracted from instance profile automatically by `InstanceProfileCredentialsProvider`, so you don't need to pass any options.

In this scenario, no credentials are being put in `IMPORT` and `EXPORT` DB commands, so you need to make sure that DB has proper access to S3 bucket you're using for intermediate storage.

If database is running in EC2, it is possible to use EC2 Instance Profiles, but it has to be enabled explicitly, as described in [this document](https://exasol.my.site.com/s/article/Changelog-content-15155?language=en_US).

## Configuration Options

In this section, we describe the common configuration parameters that are used for both JDBC and S3 variants to facilitate the integration between Spark and Exasol clusters.
Expand Down Expand Up @@ -208,8 +244,9 @@ When using the `S3` variant of the connector you should provide the following ad
| Parameter | Default | Required | Description |
|-----------------------|:------------------:|:--------:|-------------------------------------------------------------------- |
| `s3Bucket` | || A bucket name for intermediate storage |
| `awsAccessKeyId` | || AWS Access Key for accessing bucket |
| `awsSecretAccessKey` | || AWS Secret Key for accessing bucket |
| `awsAccessKeyId` | | | AWS Access Key for accessing bucket |
| `awsSecretAccessKey` | | | AWS Secret Key for accessing bucket |
| `awsCredentialsProvider` | [default providers](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3) | | List of classes used to extract credentials information from the runtime environment. |
| `numPartitions` | `8` | | Number of partitions that will match number of files in `S3` bucket |
| `awsRegion` | `us-east-1` | | AWS Region for provided bucket |
| `awsEndpointOverride` | (default endpoint) | | AWS S3 Endpoint for bucket, set this value for custom endpoints |
Expand Down
5 changes: 5 additions & 0 deletions exasol-jdbc/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,11 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<!-- Added here to reference avro version from parent-pom -->
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
</dependency>
<!-- Test Dependencies -->
<dependency>
<groupId>org.scalatest</groupId>
Expand Down
10 changes: 10 additions & 0 deletions exasol-s3/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@
<artifactId>wildfly-openssl</artifactId>
<version>2.2.5.Final</version>
</dependency>
<dependency>
<!-- Added here to reference avro version from parent-pom -->
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
</dependency>
<!-- Test Dependencies -->
<dependency>
<groupId>org.junit.jupiter</groupId>
Expand All @@ -64,6 +69,11 @@
<artifactId>junit-jupiter-api</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit-pioneer</groupId>
<artifactId>junit-pioneer</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.exasol</groupId>
<artifactId>test-db-builder-java</artifactId>
Expand Down

This file was deleted.

Loading

0 comments on commit 07bb831

Please sign in to comment.