diff --git a/spiceaidocs/docs/components/data-connectors/s3.md b/spiceaidocs/docs/components/data-connectors/s3.md index 6c0f8631..73726bba 100644 --- a/spiceaidocs/docs/components/data-connectors/s3.md +++ b/spiceaidocs/docs/components/data-connectors/s3.md @@ -4,75 +4,74 @@ sidebar_label: 'S3 Data Connector' description: 'S3 Data Connector Documentation' --- -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; - -The S3 Data Connector enables federated SQL query on files stored in S3 or S3-compatible systems (e.g. MinIO, Cloudflare R2). +The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2). If a folder is provided, all child files will be loaded. File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). -Example `spicepod.yml`: - ```yaml datasets: - # Using access keys - - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet - name: cool_dataset - params: - s3_auth: key - s3_key: ${secrets:S3_KEY} - s3_secret: ${secrets:S3_SECRET} - - # Using IAM roles or Kubernetes service accounts with assigned IAM roles - - from: s3://s3-bucket-name/path/to/parquet/cool_dataset2.parquet - name: cool_dataset2 - params: - s3_auth: iam_role - - # Using a public bucket - from: s3://spiceai-demo-datasets/taxi_trips/2024/ name: taxi_trips params: file_format: parquet ``` -## Dataset Schema Reference + +## Configuration ### `from` -The S3-compatible URI to a folder or object in form `from: s3:///` +S3-compatible URI to a folder or file, in the format `s3:///` -Example: `from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet` +Example: `from: s3://my-bucket/path/to/file.parquet` ### `name` -The dataset name. +The dataset name. This will be used as the table name within Spice. + +Example: +```yaml +datasets: + - from: s3://s3-bucket-name/taxi_sample.csv + name: cool_dataset + params: + file_format: csv +``` + +```sql +SELECT COUNT(*) FROM cool_dataset; +``` -Example: `name: cool_dataset` +```shell ++----------+ +| count(*) | ++----------+ +| 6001215 | ++----------+ +``` ### `params` -- `file_format`: Specifies the data file format. Required if the format cannot be inferred by from the `from` path. - - `parquet`: Parquet file format. - - `csv`: CSV file format. -- `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server` -- `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1` -- `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s` -- `hive_partitioning_enabled`: Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` +| Parameter Name | Description | +| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `file_format` | Specifies the data format. Required if not inferrable from from. Options: `parquet`, `csv`, `json`. | +| `s3_endpoint` | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server` | +| `s3_region` | S3 bucket region. Default: `us-east-1`. | +| `client_timeout` | Timeout for S3 operations. Default: `30s`. | +| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` | +| `s3_auth` | Authentication type. Options: `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`. | +| `s3_key` | Access key (e.g. `AWS_ACCESS_KEY_ID` for AWS) | +| `s3_secret` | Secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS) | -More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv) +For additional CSV parameters, see [CSV Parameters](/reference/file_format.md#csv) -## Auth +## Authentication -Optional for public endpoints. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_dremio_pass}`. +No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. If using iam_role, the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the running instance is used. -- `s3_auth`: (Optional) The authentication method to use. Values are `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`. -- `s3_key`: The access key (e.g. `AWS_ACCESS_KEY_ID` for AWS) -- `s3_secret`: The secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS) - -For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_auth: iam_role` will use the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the currently running instance. The following IAM policy shows the least privileged policy required for the S3 connector: +Minimum IAM policy for S3 access: ```json { @@ -81,146 +80,40 @@ For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_a { "Effect": "Allow", "Action": ["s3:ListBucket"], - "Resource": "arn:aws:s3:::yourcompany-bucketname-datasets" + "Resource": "arn:aws:s3:::company-bucketname-datasets" }, { "Effect": "Allow", "Action": ["s3:GetObject"], - "Resource": "arn:aws:s3:::yourcompany-bucketname-datasets/*" + "Resource": "arn:aws:s3:::company-bucketname-datasets/*" } ] } ``` - - - - ```bash - SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE \ - SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \ - spice run - # Or using the CLI to configure the secrets into an `.env` file - spice login s3 -k AKIAIOSFODNN7EXAMPLE -s wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - ``` - - `.env` - ```bash - SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE - SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - ``` - - `spicepod.yaml` - ```yaml - version: v1beta1 - kind: Spicepod - name: spice-app - - secrets: - - from: env - name: env - - datasets: - - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet - name: cool_dataset - params: - s3_region: us-east-1 - s3_key: ${env:SPICE_S3_KEY} - s3_secret: ${env:SPICE_S3_SECRET} - ``` - - Learn more about [Env Secret Store](/components/secret-stores/env). - - - - ```bash - kubectl create secret generic s3 \ - --from-literal=key='AKIAIOSFODNN7EXAMPLE' \ - --from-literal=secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' - ``` - - `spicepod.yaml` - ```yaml - version: v1beta1 - kind: Spicepod - name: spice-app - - secrets: - - from: kubernetes:s3 - name: s3 - - datasets: - - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet - name: cool_dataset - params: - s3_region: us-east-1 - s3_key: ${s3:key} - s3_secret: ${s3:secret} - ``` - - Learn more about [Kubernetes Secret Store](/components/secret-stores/kubernetes). - - - - Add new keychain entries (macOS) for the key and secret: - - ```bash - # Add Key to keychain - security add-generic-secret -l "S3 Key" \ - -a spiced -s spice_s3_key \ - -w AKIAIOSFODNN7EXAMPLE - # Add Secret to keychain - security add-generic-secret -l "S3 Secret" \ - -a spiced -s spice_s3_secret \ - -w wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - ``` - - `spicepod.yaml` - ```yaml - version: v1beta1 - kind: Spicepod - name: spice-app - - secrets: - - from: keyring - name: keyring - - datasets: - - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet - name: cool_dataset - params: - s3_region: us-east-1 - s3_key: ${keyring:spice_s3_key} - s3_secret: ${keyring:spice_s3_secret} - ``` - - Learn more about [Keyring Secret Store](/components/secret-stores/keyring). - - - - ## Examples -### MinIO Example +### Public bucket Example -Create a dataset named `cool_dataset` from a Parquet file stored in MinIO. +Create a dataset named `taxi_trips` from a public S3 folder. ```yaml -- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet - name: cool_dataset +- from: s3://spiceai-demo-datasets/taxi_trips/2024/ + name: taxi_trips params: - s3_endpoint: https://my.minio.server - s3_region: 'us-east-1' # Best practice for Minio + file_format: parquet ``` -### S3 Public Example +### MinIO Example -Create a dataset named `taxi_trips` from a public S3 folder. +Create a dataset named `cool_dataset` from a Parquet file stored in MinIO. ```yaml -- from: s3://spiceai-demo-datasets/taxi_trips/2024/ - name: taxi_trips +- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet + name: cool_dataset params: - file_format: parquet + s3_endpoint: https://my.minio.server + s3_region: 'us-east-1' # Best practice for Minio ``` ### Hive Partitioning Example @@ -248,3 +141,13 @@ datasets: file_format: parquet hive_partitioning_enabled: true ``` + +## Secrets + +Spice supports three types of [secret stores](/components/secret-stores): + +* [Environment variables](/components/secret-stores/env) +* [Kubernetes Secret Store](/components/secret-stores/kubernetes) +* [Keyring Secret Store](/components/secret-stores/keyring) + +Explore the different options to manage sensitive data securely. \ No newline at end of file