Skip to content

Commit

Permalink
Standardizing and Enhancing S3 connector Documentation (#616)
Browse files Browse the repository at this point in the history
* Standardize and enhance the S3 connector documentation

* Removing 'your'

* Typo fix, removing excess detail

* Adding Keyring secrets example

* Conciseness and clarity pass
  • Loading branch information
Scott Lyons authored Nov 6, 2024
1 parent 5d230f6 commit 032d01d
Showing 1 changed file with 63 additions and 160 deletions.
223 changes: 63 additions & 160 deletions spiceaidocs/docs/components/data-connectors/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,75 +4,74 @@ sidebar_label: 'S3 Data Connector'
description: 'S3 Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The S3 Data Connector enables federated SQL query on files stored in S3 or S3-compatible systems (e.g. MinIO, Cloudflare R2).
The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder is provided, all child files will be loaded.

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

Example `spicepod.yml`:

```yaml
datasets:
# Using access keys
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_auth: key
s3_key: ${secrets:S3_KEY}
s3_secret: ${secrets:S3_SECRET}

# Using IAM roles or Kubernetes service accounts with assigned IAM roles
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset2.parquet
name: cool_dataset2
params:
s3_auth: iam_role

# Using a public bucket
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
```
## Dataset Schema Reference
## Configuration
### `from`

The S3-compatible URI to a folder or object in form `from: s3://<bucket>/<file>`
S3-compatible URI to a folder or file, in the format `s3://<bucket>/<path>`

Example: `from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet`
Example: `from: s3://my-bucket/path/to/file.parquet`

### `name`

The dataset name.
The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
```

```sql
SELECT COUNT(*) FROM cool_dataset;
```

Example: `name: cool_dataset`
```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

### `params`

- `file_format`: Specifies the data file format. Required if the format cannot be inferred by from the `from` path.
- `parquet`: Parquet file format.
- `csv`: CSV file format.
- `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server`
- `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1`
- `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s`
- `hive_partitioning_enabled`: Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`
| Parameter Name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data format. Required if not inferrable from from. Options: `parquet`, `csv`, `json`. |
| `s3_endpoint` | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server` |
| `s3_region` | S3 bucket region. Default: `us-east-1`. |
| `client_timeout` | Timeout for S3 operations. Default: `30s`. |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
| `s3_auth` | Authentication type. Options: `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`. |
| `s3_key` | Access key (e.g. `AWS_ACCESS_KEY_ID` for AWS) |
| `s3_secret` | Secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS) |

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)
For additional CSV parameters, see [CSV Parameters](/reference/file_format.md#csv)

## Auth
## Authentication

Optional for public endpoints. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_dremio_pass}`.
No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. If using iam_role, the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the running instance is used.

- `s3_auth`: (Optional) The authentication method to use. Values are `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`.
- `s3_key`: The access key (e.g. `AWS_ACCESS_KEY_ID` for AWS)
- `s3_secret`: The secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS)

For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_auth: iam_role` will use the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the currently running instance. The following IAM policy shows the least privileged policy required for the S3 connector:
Minimum IAM policy for S3 access:

```json
{
Expand All @@ -81,146 +80,40 @@ For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_a
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets"
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets/*"
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}
```

<Tabs>
<TabItem value="env" label="Env">

```bash
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE \
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
spice run
# Or using the CLI to configure the secrets into an `.env` file
spice login s3 -k AKIAIOSFODNN7EXAMPLE -s wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

`.env`
```bash
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

`spicepod.yaml`
```yaml
version: v1beta1
kind: Spicepod
name: spice-app
secrets:
- from: env
name: env
datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${env:SPICE_S3_KEY}
s3_secret: ${env:SPICE_S3_SECRET}
```

Learn more about [Env Secret Store](/components/secret-stores/env).

</TabItem>
<TabItem value="k8s" label="Kubernetes">
```bash
kubectl create secret generic s3 \
--from-literal=key='AKIAIOSFODNN7EXAMPLE' \
--from-literal=secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
```

`spicepod.yaml`
```yaml
version: v1beta1
kind: Spicepod
name: spice-app
secrets:
- from: kubernetes:s3
name: s3
datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${s3:key}
s3_secret: ${s3:secret}
```

Learn more about [Kubernetes Secret Store](/components/secret-stores/kubernetes).

</TabItem>
<TabItem value="keyring" label="Keyring">
Add new keychain entries (macOS) for the key and secret:

```bash
# Add Key to keychain
security add-generic-secret -l "S3 Key" \
-a spiced -s spice_s3_key \
-w AKIAIOSFODNN7EXAMPLE
# Add Secret to keychain
security add-generic-secret -l "S3 Secret" \
-a spiced -s spice_s3_secret \
-w wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

`spicepod.yaml`
```yaml
version: v1beta1
kind: Spicepod
name: spice-app
secrets:
- from: keyring
name: keyring
datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${keyring:spice_s3_key}
s3_secret: ${keyring:spice_s3_secret}
```

Learn more about [Keyring Secret Store](/components/secret-stores/keyring).

</TabItem>
</Tabs>

## Examples

### MinIO Example
### Public bucket Example

Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.
Create a dataset named `taxi_trips` from a public S3 folder.

```yaml
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
s3_endpoint: https://my.minio.server
s3_region: 'us-east-1' # Best practice for Minio
file_format: parquet
```

### S3 Public Example
### MinIO Example

Create a dataset named `taxi_trips` from a public S3 folder.
Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.

```yaml
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
file_format: parquet
s3_endpoint: https://my.minio.server
s3_region: 'us-east-1' # Best practice for Minio
```

### Hive Partitioning Example
Expand Down Expand Up @@ -248,3 +141,13 @@ datasets:
file_format: parquet
hive_partitioning_enabled: true
```

## Secrets

Spice supports three types of [secret stores](/components/secret-stores):

* [Environment variables](/components/secret-stores/env)
* [Kubernetes Secret Store](/components/secret-stores/kubernetes)
* [Keyring Secret Store](/components/secret-stores/keyring)

Explore the different options to manage sensitive data securely.

0 comments on commit 032d01d

Please sign in to comment.