Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing and Enhancing S3 connector Documentation #616

Merged
merged 5 commits into from
Nov 6, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 63 additions & 160 deletions spiceaidocs/docs/components/data-connectors/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,75 +4,74 @@ sidebar_label: 'S3 Data Connector'
description: 'S3 Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The S3 Data Connector enables federated SQL query on files stored in S3 or S3-compatible systems (e.g. MinIO, Cloudflare R2).
The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder is provided, all child files will be loaded.

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

Example `spicepod.yml`:

```yaml
datasets:
# Using access keys
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_auth: key
s3_key: ${secrets:S3_KEY}
s3_secret: ${secrets:S3_SECRET}

# Using IAM roles or Kubernetes service accounts with assigned IAM roles
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset2.parquet
name: cool_dataset2
params:
s3_auth: iam_role

# Using a public bucket
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
```

## Dataset Schema Reference

## Configuration

### `from`

The S3-compatible URI to a folder or object in form `from: s3://<bucket>/<file>`
S3-compatible URI to a folder or file, in the format `s3://<bucket>/<path>`

Example: `from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet`
Example: `from: s3://my-bucket/path/to/file.parquet`

### `name`

The dataset name.
The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
```

```sql
SELECT COUNT(*) FROM cool_dataset;
```

Example: `name: cool_dataset`
```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

### `params`

- `file_format`: Specifies the data file format. Required if the format cannot be inferred by from the `from` path.
- `parquet`: Parquet file format.
- `csv`: CSV file format.
- `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server`
- `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1`
- `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s`
- `hive_partitioning_enabled`: Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`
| Parameter Name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data format. Required if not inferrable from from. Options: `parquet`, `csv`, `json`. |
| `s3_endpoint` | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server` |
| `s3_region` | S3 bucket region. Default: `us-east-1`. |
| `client_timeout` | Timeout for S3 operations. Default: `30s`. |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
| `s3_auth` | Authentication type. Options: `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`. |
| `s3_key` | Access key (e.g. `AWS_ACCESS_KEY_ID` for AWS) |
| `s3_secret` | Secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS) |

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)
For additional CSV parameters, see [CSV Parameters](/reference/file_format.md#csv)

## Auth
## Authentication

Optional for public endpoints. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_dremio_pass}`.
No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. If using iam_role, the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the running instance is used.

- `s3_auth`: (Optional) The authentication method to use. Values are `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`.
- `s3_key`: The access key (e.g. `AWS_ACCESS_KEY_ID` for AWS)
- `s3_secret`: The secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS)

For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_auth: iam_role` will use the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the currently running instance. The following IAM policy shows the least privileged policy required for the S3 connector:
Minimum IAM policy for S3 access:

```json
{
Expand All @@ -81,146 +80,40 @@ For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_a
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets"
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets/*"
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}
```

<Tabs>
<TabItem value="env" label="Env">

```bash
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE \
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
spice run
# Or using the CLI to configure the secrets into an `.env` file
spice login s3 -k AKIAIOSFODNN7EXAMPLE -s wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

`.env`
```bash
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

`spicepod.yaml`
```yaml
version: v1beta1
kind: Spicepod
name: spice-app

secrets:
- from: env
name: env

datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${env:SPICE_S3_KEY}
s3_secret: ${env:SPICE_S3_SECRET}
```

Learn more about [Env Secret Store](/components/secret-stores/env).

</TabItem>
<TabItem value="k8s" label="Kubernetes">
```bash
kubectl create secret generic s3 \
--from-literal=key='AKIAIOSFODNN7EXAMPLE' \
--from-literal=secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
```

`spicepod.yaml`
```yaml
version: v1beta1
kind: Spicepod
name: spice-app

secrets:
- from: kubernetes:s3
name: s3

datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${s3:key}
s3_secret: ${s3:secret}
```

Learn more about [Kubernetes Secret Store](/components/secret-stores/kubernetes).

</TabItem>
<TabItem value="keyring" label="Keyring">
Add new keychain entries (macOS) for the key and secret:

```bash
# Add Key to keychain
security add-generic-secret -l "S3 Key" \
-a spiced -s spice_s3_key \
-w AKIAIOSFODNN7EXAMPLE
# Add Secret to keychain
security add-generic-secret -l "S3 Secret" \
-a spiced -s spice_s3_secret \
-w wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

`spicepod.yaml`
```yaml
version: v1beta1
kind: Spicepod
name: spice-app

secrets:
- from: keyring
name: keyring

datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${keyring:spice_s3_key}
s3_secret: ${keyring:spice_s3_secret}
```

Learn more about [Keyring Secret Store](/components/secret-stores/keyring).

</TabItem>
</Tabs>

## Examples

### MinIO Example
### Public bucket Example

Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.
Create a dataset named `taxi_trips` from a public S3 folder.

```yaml
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
s3_endpoint: https://my.minio.server
s3_region: 'us-east-1' # Best practice for Minio
file_format: parquet
```

### S3 Public Example
### MinIO Example

Create a dataset named `taxi_trips` from a public S3 folder.
Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.

```yaml
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
file_format: parquet
s3_endpoint: https://my.minio.server
s3_region: 'us-east-1' # Best practice for Minio
```

### Hive Partitioning Example
Expand Down Expand Up @@ -248,3 +141,13 @@ datasets:
file_format: parquet
hive_partitioning_enabled: true
```

## Secrets

Spice supports three types of [secret stores](/components/secret-stores):

* [Environment variables](/components/secret-stores/env)
* [Kubernetes Secret Store](/components/secret-stores/kubernetes)
* [Keyring Secret Store](/components/secret-stores/keyring)

Explore the different options to manage sensitive data securely.