Standardizing and Enhancing S3 connector Documentation (#616)

* Standardize and enhance the S3 connector documentation * Removing 'your' * Typo fix, removing excess detail * Adding Keyring secrets example * Conciseness and clarity pass
spiceai · Nov 6, 2024 · 032d01d · 032d01d
1 parent 5d230f6
commit 032d01d
Showing 1 changed file with 63 additions and 160 deletions.
diff --git a/spiceaidocs/docs/components/data-connectors/s3.md b/spiceaidocs/docs/components/data-connectors/s3.md
@@ -4,75 +4,74 @@ sidebar_label: 'S3 Data Connector'
 description: 'S3 Data Connector Documentation'
 ---
 
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-The S3 Data Connector enables federated SQL query on files stored in S3 or S3-compatible systems (e.g. MinIO, Cloudflare R2).
+The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).
 
 If a folder is provided, all child files will be loaded.
 
 File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).
 
-Example `spicepod.yml`:
-
 ```yaml
 datasets:
-  # Using access keys
-  - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
-    name: cool_dataset
-    params:
-      s3_auth: key
-      s3_key: ${secrets:S3_KEY}
-      s3_secret: ${secrets:S3_SECRET}
-
-  # Using IAM roles or Kubernetes service accounts with assigned IAM roles
-  - from: s3://s3-bucket-name/path/to/parquet/cool_dataset2.parquet
-    name: cool_dataset2
-    params:
-      s3_auth: iam_role
-
-  # Using a public bucket
   - from: s3://spiceai-demo-datasets/taxi_trips/2024/
     name: taxi_trips
     params:
       file_format: parquet
 ```
 
-## Dataset Schema Reference
+
+## Configuration
 
 ### `from`
 
-The S3-compatible URI to a folder or object in form `from: s3://<bucket>/<file>`
+S3-compatible URI to a folder or file, in the format `s3://<bucket>/<path>`
 
-Example: `from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet`
+Example: `from: s3://my-bucket/path/to/file.parquet`
 
 ### `name`
 
-The dataset name.
+The dataset name. This will be used as the table name within Spice.
+
+Example:
+```yaml
+datasets:
+  - from: s3://s3-bucket-name/taxi_sample.csv
+    name: cool_dataset
+    params:
+      file_format: csv
+```
+
+```sql
+SELECT COUNT(*) FROM cool_dataset;
+```
 
-Example: `name: cool_dataset`
+```shell
++----------+
+| count(*) |
++----------+
+| 6001215  |
++----------+
+```
 
 ### `params`
 
-- `file_format`: Specifies the data file format. Required if the format cannot be inferred by from the `from` path.
-  - `parquet`: Parquet file format.
-  - `csv`: CSV file format.
-- `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server`
-- `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1`
-- `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s`
-- `hive_partitioning_enabled`: Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`
+| Parameter Name              | Description                                                                                                                                                                          |
+| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `file_format`               | Specifies the data format. Required if not inferrable from from. Options: `parquet`, `csv`, `json`. |
+| `s3_endpoint`               | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server`                        |
+| `s3_region`                 | S3 bucket region. Default: `us-east-1`.                                                                                |
+| `client_timeout`            | Timeout for S3 operations. Default: `30s`.                                                                                           |
+| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`                                                                                     |
+| `s3_auth`                   | Authentication type. Options: `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`.         |
+| `s3_key`                    | Access key (e.g. `AWS_ACCESS_KEY_ID` for AWS)                                                                                                                                    |
+| `s3_secret`                 | Secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS)                                                                                                                                |
 
-More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)
+For additional CSV parameters, see [CSV Parameters](/reference/file_format.md#csv)
 
-## Auth
+## Authentication
 
-Optional for public endpoints. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_dremio_pass}`.
+No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. If using iam_role, the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the running instance is used.
 
-- `s3_auth`: (Optional) The authentication method to use. Values are `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`.
-- `s3_key`: The access key (e.g. `AWS_ACCESS_KEY_ID` for AWS)
-- `s3_secret`: The secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS)
-
-For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_auth: iam_role` will use the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the currently running instance. The following IAM policy shows the least privileged policy required for the S3 connector:
+Minimum IAM policy for S3 access:
 
 ```json
 {
@@ -81,146 +80,40 @@ For non-public buckets, `s3_auth: key` or `s3_auth: iam_role` is required. `s3_a
     {
       "Effect": "Allow",
       "Action": ["s3:ListBucket"],
-      "Resource": "arn:aws:s3:::yourcompany-bucketname-datasets"
+      "Resource": "arn:aws:s3:::company-bucketname-datasets"
     },
     {
       "Effect": "Allow",
       "Action": ["s3:GetObject"],
-      "Resource": "arn:aws:s3:::yourcompany-bucketname-datasets/*"
+      "Resource": "arn:aws:s3:::company-bucketname-datasets/*"
     }
   ]
 }
 ```
 
-<Tabs>
-  <TabItem value="env" label="Env">
-
-    ```bash
-    SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE \
-    SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
-    spice run
-    # Or using the CLI to configure the secrets into an `.env` file
-    spice login s3 -k AKIAIOSFODNN7EXAMPLE -s wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
-    ```
-
-    `.env`
-    ```bash
-    SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE
-    SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
-    ```
-
-    `spicepod.yaml`
-    ```yaml
-    version: v1beta1
-    kind: Spicepod
-    name: spice-app
-
-    secrets:
-      - from: env
-        name: env
-
-    datasets:
-      - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
-        name: cool_dataset
-        params:
-          s3_region: us-east-1
-          s3_key: ${env:SPICE_S3_KEY}
-          s3_secret: ${env:SPICE_S3_SECRET}
-    ```
-
-    Learn more about [Env Secret Store](/components/secret-stores/env).
-
-  </TabItem>
-  <TabItem value="k8s" label="Kubernetes">
-    ```bash
-    kubectl create secret generic s3 \
-      --from-literal=key='AKIAIOSFODNN7EXAMPLE' \
-      --from-literal=secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
-    ```
-
-    `spicepod.yaml`
-    ```yaml
-    version: v1beta1
-    kind: Spicepod
-    name: spice-app
-
-    secrets:
-      - from: kubernetes:s3
-        name: s3
-
-    datasets:
-      - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
-        name: cool_dataset
-        params:
-          s3_region: us-east-1
-          s3_key: ${s3:key}
-          s3_secret: ${s3:secret}
-    ```
-
-    Learn more about [Kubernetes Secret Store](/components/secret-stores/kubernetes).
-
-  </TabItem>
-  <TabItem value="keyring" label="Keyring">
-    Add new keychain entries (macOS) for the key and secret:
-
-    ```bash
-    # Add Key to keychain
-    security add-generic-secret -l "S3 Key" \
-    -a spiced -s spice_s3_key \
-    -w AKIAIOSFODNN7EXAMPLE
-    # Add Secret to keychain
-    security add-generic-secret -l "S3 Secret" \
-    -a spiced -s spice_s3_secret \
-    -w wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
-    ```
-
-    `spicepod.yaml`
-    ```yaml
-    version: v1beta1
-    kind: Spicepod
-    name: spice-app
-
-    secrets:
-      - from: keyring
-        name: keyring
-
-    datasets:
-      - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
-        name: cool_dataset
-        params:
-          s3_region: us-east-1
-          s3_key: ${keyring:spice_s3_key}
-          s3_secret: ${keyring:spice_s3_secret}
-    ```
-
-    Learn more about [Keyring Secret Store](/components/secret-stores/keyring).
-
-  </TabItem>
-</Tabs>
-
 ## Examples
 
-### MinIO Example
+### Public bucket Example
 
-Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.
+Create a dataset named `taxi_trips` from a public S3 folder.
 
 ```yaml
-- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
-  name: cool_dataset
+- from: s3://spiceai-demo-datasets/taxi_trips/2024/
+  name: taxi_trips
   params:
-    s3_endpoint: https://my.minio.server
-    s3_region: 'us-east-1' # Best practice for Minio
+    file_format: parquet
 ```
 
-### S3 Public Example
+### MinIO Example
 
-Create a dataset named `taxi_trips` from a public S3 folder.
+Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.
 
 ```yaml
-- from: s3://spiceai-demo-datasets/taxi_trips/2024/
-  name: taxi_trips
+- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
+  name: cool_dataset
   params:
-    file_format: parquet
+    s3_endpoint: https://my.minio.server
+    s3_region: 'us-east-1' # Best practice for Minio
 ```
 
 ### Hive Partitioning Example
@@ -248,3 +141,13 @@ datasets:
       file_format: parquet
       hive_partitioning_enabled: true
 ```
+
+## Secrets
+
+Spice supports three types of [secret stores](/components/secret-stores):
+
+* [Environment variables](/components/secret-stores/env)
+* [Kubernetes Secret Store](/components/secret-stores/kubernetes)
+* [Keyring Secret Store](/components/secret-stores/keyring)
+
+Explore the different options to manage sensitive data securely.