Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs update on data connectors & data accelerators #693

Merged
merged 10 commits into from
Dec 24, 2024
2 changes: 1 addition & 1 deletion spiceaidocs/docs/api/tls/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ pagination_next: null

## Pre-requisites

A valid TLS certificate and private key in [PEM](https://en.wikipedia.org/wiki/Privacy-Enhanced_Mail) format are required. To generate certificates for testing, follow the [TLS Cookbook Recipe](https://github.com/spiceai/cookbook/tree/trunk/tls).
A valid TLS certificate and private key in [PEM](https://en.wikipedia.org/wiki/Privacy-Enhanced_Mail) format are required. To generate certificates for testing, follow the [TLS Cookbook](https://github.com/spiceai/cookbook/tree/trunk/tls).

## Enable TLS via command line arguments

Expand Down
4 changes: 4 additions & 0 deletions spiceaidocs/docs/components/data-accelerators/arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,7 @@ When accelerating a dataset using the In-Memory Arrow Data Accelerator, some or
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.

:::

## Cookbook

- A cookbook recipe to configure In-Memory Arrow as data accelerator in Spice. [In-Memory Arrow Data Accelerator](https://github.com/spiceai/cookbook/tree/trunk/arrow#readme)
22 changes: 14 additions & 8 deletions spiceaidocs/docs/components/data-accelerators/data-refresh.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,11 @@ Typically only a working subset of an entire dataset is used in an application o

### Refresh SQL

| | |
| --------------------------- | --------- |
| Supported in `refresh_mode` | Any |
| Required | No |
| Default Value | Unset |
| | |
| --------------------------- | ----- |
| Supported in `refresh_mode` | Any |
| Required | No |
| Default Value | Unset |

Refresh SQL supports specifying filters for data accelerated from the connected source using arbitrary SQL.

Expand Down Expand Up @@ -158,7 +158,7 @@ In this example, `refresh_data_window` is converted into an effective Refresh SQ

This parameter relies on the `time_column` dataset parameter specifying a column that is a timestamp type. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`.

*Example with `refresh_sql`:*
_Example with `refresh_sql`:_

```yaml
datasets:
Expand All @@ -176,7 +176,7 @@ datasets:

This example will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old.

*Example with `on_zero_results`:*
_Example with `on_zero_results`:_

```yaml
datasets:
Expand Down Expand Up @@ -446,7 +446,13 @@ This acceleration configuration applies a number of different behaviors:
1. A `refresh_data_window` was specified. When Spice starts, it will apply this `refresh_data_window` to the `refresh_sql`, and retrieve only the last day's worth of logs with an `asset = 'asset_id'`.
2. Because a `refresh_sql` is specified, every refresh (including initial load) will have the filter applied to the refresh query.
3. 10 minutes after loading, as specified by the `refresh_check_interval`, the first refresh will occur - retrieving new rows where `asset = 'asset_id'`.
4. Running a query to retrieve logs with an `asset` that is *not* `asset_id` will fall back to the source, because of the `on_zero_results: use_source` parameter.
4. Running a query to retrieve logs with an `asset` that is _not_ `asset_id` will fall back to the source, because of the `on_zero_results: use_source` parameter.
5. Running a query to retrieve a log longer than 1 day ago will fall back to the source, because of the `on_zero_results: use_source` parameter.
6. Running a query to retrieve logs within a range of now to longer than 1 day ago will only return logs from the last day. This is due to the `refresh_data_window` only accelerating the last day's worth of logs, which will return some results. Because results are returned, Spice will not fall back to the source even though `on_zero_results: use_source` is specified.
7. Spice will retain newly appended log rows for 7 days before discarding them, as specified by the `retention_*` parameters.

## Cookbook

- Configure accelerated dataset retention policy. [Accelerated Dataset Retention Policy](https://github.com/spiceai/cookbook/tree/trunk/retention#readme)
- Dynamically refresh specific data at runtime by programmatically updating refresh_sql and triggering data refreshes. [Advanced Data Refresh](https://github.com/spiceai/cookbook/tree/trunk/acceleration/data-refresh#readme)
- Configure `refresh_data_window` to filter refreshed data to recent data [Refresh Data Window](https://github.com/spiceai/cookbook/tree/trunk/refresh-data-window#readme)
4 changes: 4 additions & 0 deletions spiceaidocs/docs/components/data-accelerators/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,7 @@ When accelerating a dataset using `mode: memory` (the default), some or all of t
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.

:::

## Cookbook

- A cookbook recipe to configure DuckDB as a data accelerator in Spice. [DuckDB Data Accelerator](https://github.com/spiceai/cookbook/tree/trunk/duckdb/accelerator#readme)
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,7 @@ The table below lists the supported [Apache Arrow data types](https://arrow.apac
| `Duration` | `BigInteger` | `bigint` |
| `List` / `LargeList` / `FixedSizeList` | `Array` | `array` |
| `Struct` | `N/A` | `Composite` (Custom type) |

## Cookbook

- A cookbook recipe to configure PostgreSQL as a data accelerator in Spice. [PostgreSQL Data Accelerator](https://github.com/spiceai/cookbook/tree/trunk/postgres/accelerator#readme)
6 changes: 5 additions & 1 deletion spiceaidocs/docs/components/data-accelerators/sqlite.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ datasets:
- The SQLite accelerator only supports arrow `List` types of primitive data types; lists with structs are not supported.
- The SQLite accelerator doesn't support advanced grouping features such as `ROLLUP` and `GROUPING`.
- In SQLite, `CAST(value AS DECIMAL)` doesn't convert an integer to a floating-point value if the casted value is an integer. Operations like `CAST(1 AS DECIMAL) / CAST(2 AS DECIMAL)` will be treated as integer division, resulting in 0 instead of the expected 0.5.
Use `FLOAT` to ensure conversion to a floating-point value: `CAST(1 AS FLOAT) / CAST(2 AS FLOAT)`.
Use `FLOAT` to ensure conversion to a floating-point value: `CAST(1 AS FLOAT) / CAST(2 AS FLOAT)`.
- Updating a dataset with SQLite acceleration while the Spice Runtime is running (hot-reload) will cause SQLite accelerator query federation to disable until the Runtime is restarted.

:::
Expand All @@ -54,3 +54,7 @@ When accelerating a dataset using `mode: memory` (the default), some or all of t
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.

:::

## Cookbook

- A cookbook recipe to configure SQLite as a data accelerator in Spice. [SQLite Data Accelerator](https://github.com/spiceai/cookbook/tree/trunk/sqlite/accelerator#readme)
30 changes: 15 additions & 15 deletions spiceaidocs/docs/components/data-connectors/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar_label: 'Azure BlobFS Data Connector'
description: 'Azure BlobFS Data Connector Documentation'
---

The Azure BlobFS (ABFS) Data Connector enables federated/accelerated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.
The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.

When a folder path is provided, all the contained files will be loaded.

Expand Down Expand Up @@ -58,20 +58,20 @@ SELECT COUNT(*) FROM cool_dataset;

#### Basic parameters

| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. |
| `abfs_account` | Azure storage account name |
| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization |
| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL |
| `abfs_proxy_ca_certificate` | CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Disable tagging objects. Use this if your backing store doesn't support tags |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
| Parameter name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. Refer to [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats) for details. |
| `abfs_account` | Azure storage account name |
| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization |
| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL |
| `abfs_proxy_ca_certificate` | CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Disable tagging objects. Use this if your backing store doesn't support tags |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

#### Authentication parameters

Expand Down
Loading
Loading