Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing File, FTP/SFTP and HTTPS connector docs #609

Merged
merged 9 commits into from
Nov 14, 2024
81 changes: 70 additions & 11 deletions spiceaidocs/docs/components/data-connectors/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ sidebar_label: 'File Data Connector'
description: 'File Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The File Data Connector enables federated SQL queries on files stored by locally accessible filesystems. It supports querying individual files or entire directories, where all child files within the directory will be loaded and queried.
slyons marked this conversation as resolved.
Show resolved Hide resolved

Expand All @@ -19,20 +17,45 @@ datasets:
name: customer
params:
file_format: parquet
```

## Configuration

### `from`

The `from` field for the File connector takes the form `file://path` where `path` is the path to the file to read from. See the [examples](#examples) below for examples of relative and absolute paths

### `name`

- from: file://path/to/orders.csv
name: orders
The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: file://path/to/customer.parquet
name: cool_dataset
params:
file_format: csv
csv_has_header: false
...
```

## Parameters
```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

| Parameter name | Description |
|------------------------|-------------------------------------------------------------------------------------------------------|
| `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. |
| `hive_partitioning_enabled`| Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
### `params`

| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

For CSV-specific parameters, see [CSV Parameters](/reference/file_format.md#csv).

Expand All @@ -52,3 +75,39 @@ datasets:
```

When the file is modified, the acceleration will be refreshed and will include the latest data.

## Examples

### Absolute path

In this example, `path` is an absolute path to the file on the filesystem.

```yaml
datasets:
- from: file://path/to/customer.parquet
name: customer
params:
file_format: parquet
```

### Relative path

In this example, the path is relative to the directory where the `spicepod.yaml` is located.

```bash
├── foo
│   └── yellow_tripdata_2024-01.parquet
└── spicepod.yaml
```

```yaml
datasets:
- from: file:foo/yellow_tripdata_2024-01.parquet
slyons marked this conversation as resolved.
Show resolved Hide resolved
name: trip_data
params:
file_format: parquet
```

## Secrets
slyons marked this conversation as resolved.
Show resolved Hide resolved

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
157 changes: 96 additions & 61 deletions spiceaidocs/docs/components/data-connectors/ftp.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,104 @@ sidebar_label: 'FTP/SFTP Data Connector'
description: 'FTP/SFTP Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
FTP (File Transfer Protocol) and SFTP (SSH File Transfer Protocol) are network protocols used for transferring files between a client and server, with FTP being less secure and SFTP providing encrypted file transfer over SSH.

The FTP/SFTP Data Connector enables federated SQL query across Parquet/CSV files stored in FTP/SFTP servers.
The FTP/SFTP Data Connector enables federated SQL query across [supported file formats](/components/data-connectors/index.md#object-store-file-formats) stored in FTP/SFTP servers.
slyons marked this conversation as resolved.
Show resolved Hide resolved

If a folder is provided, all child Parquet/CSV files will be loaded.
```yaml
datasets:
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 20
slyons marked this conversation as resolved.
Show resolved Hide resolved
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
```

## Configuration

<Tabs>
<TabItem value="ftp" label="FTP" default>
### Parameters

The connection to FTP can be configured by providing the following params:

- `file_format`: Specifies the data file format. Required if the format cannot be inferred by from the `from` path. See [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).
- `ftp_port`: Optional, specifies the port of the FTP server. Default is 21. E.g. `ftp_port: 21`
- `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user`
- `ftp_pass`: The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client.
- `hive_partitioning_enabled`: Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

### Examples
```yaml
- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass: ${secrets:my_ftp_password}
hive_partitioning_enabled: false
```

</TabItem>
<TabItem value="sftp" label="SFTP">
### Parameters

The connection to SFTP can be configured by providing the following params:

- `file_format`: Optional, specifies the requested file format.
- `parquet`: (default) Parquet file format.
- `csv`: CSV file format.
- `sftp_port`: Optional, specifies the port of the SFTP server. Default is 22. E.g. `sftp_port: 22`
- `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user`
- `sftp_pass`: The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client.
- `hive_partitioning_enabled`: Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

### Examples
```yaml
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 20
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
hive_partitioning_enabled: true
```

</TabItem>
</Tabs>
### `from`

The `from` field takes one of two forms: `ftp://path` or `sftp://path` where `path` is the path to the file or directory to read from.
slyons marked this conversation as resolved.
Show resolved Hide resolved

If a folder is provided, all child files will be loaded.

### `name`

The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: sftp://remote-sftp-server.com/path/to/folder/
name: cool_dataset
params:
...
```

```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

### `params`

#### FTP

| Parameter Name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data file format. Required if the format cannot be inferred by from the `from` path. See [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). |
| `ftp_port` | Optional, specifies the port of the FTP server. Default is 21. E.g. `ftp_port: 21` |
| `ftp_user` | The username for the FTP server. E.g. `ftp_user: my-ftp-user` |
| `ftp_pass` | The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`. |
| `client_timeout` | Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client. |
| `hive_partitioning_enabled` | Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

#### SFTP
| Parameter Name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data file format. Required if the format cannot be inferred by from the `from` path. See [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). |
| `sftp_port` | Optional, specifies the port of the SFTP server. Default is 22. E.g. `sftp_port: 22` |
| `sftp_user` | The username for the SFTP server. E.g. `sftp_user: my-sftp-user` |
| `sftp_pass` | The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`. |
| `client_timeout` | Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client. |
| `hive_partitioning_enabled` | Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

## Examples

### Connecting to FTP

```yaml
- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass: ${secrets:my_ftp_password}
hive_partitioning_enabled: false
```

### Connecting to SFTP

```yaml
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 20
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
hive_partitioning_enabled: false
```

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
66 changes: 58 additions & 8 deletions spiceaidocs/docs/components/data-connectors/https.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,76 @@ description: 'HTTP(s) Data Connector Documentation'
pagination_prev: null
---

The HTTP(s) Data Connector enables federated SQL query against a variety of tabular formatted (e.g. Parquet/CSV) files stored at a HTTP endpoint.
The HTTP(s) Data Connector enables federated SQL query across [supported file formats](/components/data-connectors/index.md#object-store-file-formats) stored at an HTTP(s) endpoint.
slyons marked this conversation as resolved.
Show resolved Hide resolved

The connector supports Basic HTTP authentication via `param` values.
```yaml
datasets:
- from: http://static_username@localhost:3001/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}
```

## Configuration

### `from`

The `from` field must contain a valid URI to the location of a [supported file](/components/data-connectors/index.md#object-store-file-formats). For example, `http://static_username@localhost:3001/report.csv`.

### `name`

The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: http://static_username@localhost:3001/report.csv
name: cool_dataset
params:
...
```

### Parameters
```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

- `http_port`: Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively.
- `http_username`: Optional. Username to provide connection for HTTP basic authentication. Default: None.
- `http_password`: Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_http_pass}`.
- `client_timeout`: Optional. Specifies timeout for HTTP operations. Default value is `30s` E.g. `client_timeout: 60s`
### `params`

### Examples
The connector supports Basic HTTP authentication via `param` values.

| Parameter Name | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `http_port` | Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively. |
| `http_username` | Optional. Username to provide connection for HTTP basic authentication. Default: None. |
| `http_password` | Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_http_pass}`. |
| `client_timeout` | Optional. Specifies timeout for HTTP operations. Default value is `30s` E.g. `client_timeout: 60s` |

## Examples

### Basic example
```yaml
datasets:
- from: https://github.com/LAION-AI/audio-dataset/raw/7fd6ae3cfd7cde619f6bed817da7aa2202a5bc28/metadata/freesound/parquet/freesound_parquet.parquet
name: laion_freesound
```

### Using Basic Authentication
```yaml
datasets:
- from: http://static_username@localhost:3001/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}
```

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).