Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for zero results fallback, clearer filtered refresh #212

Merged
merged 4 commits into from
Apr 27, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 67 additions & 3 deletions spiceaidocs/docs/data-accelerators/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,25 @@ Currently supported Data Accelerators include:
## Data types
Data accelerators may not support all possible Apache Arrow data types. For complete compatibility, see [specifications](../reference/datatypes.md).

## Refresh SQL
## Filtered Refresh

For datasets configured with a `full` refresh mode, this is an optional setting that filters the locally accelerated data to a smaller working set. This can be useful if your application/dashboard only ever uses a subset of the data stored in the federated table.
Often only a subset of the data in a federated table is used in applications or dashboards. Use the following options to filter the data Spice will accelerate to a working subset and reduce the amount of data that needs to be transferred and stored locally.

- [Refresh SQL](#refresh-sql) - Specify the filter as arbitrary SQL to be pushed down to the remote source.
- [Refresh Data Period](#refresh-data-period) - Filters out data from the federated source older than the specified period.

### Refresh SQL

Specify filters for the data accelerated from the federated source via arbitrary SQL. Only supported for datasets configured with a `full` refreh mode (the default).
phillipleblanc marked this conversation as resolved.
Show resolved Hide resolved

Filters will be pushed down to the remote source, and only the requested data will be transferred over the network.

Example:

```yaml
datasets:
- name: accelerated_dataset
- from: databricks:my_dataset
name: accelerated_dataset
acceleration:
enabled: true
refresh_mode: full
Expand All @@ -61,6 +71,60 @@ For the complete reference, view the `refresh_sql` section of [datasets](../refe
- Queries for data that have been filtered out will not fall back to querying against the federated table.
:::

### Refresh Data Period

Filters data from the federated source older than the specified period. Only supported for datasets configured with a `full` refresh mode (the default).

Used in combination with the [`time_column`](../reference/spicepod/datasets.md#time_column) to identify the column that contains the timestamps to filter on. The [`time_format`](../reference/spicepod/datasets.md#time_format) column (optional) can be used to instruct the Spice runtime how to interpret the timestamps in the `time_column`.

Can also be combined with `refresh_sql` to further filter the data based on the temporal dimension.

Example:

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
time_column: created_at
acceleration:
enabled: true
refresh_mode: full
refresh_check_interval: 10m
refresh_sql: |
SELECT * FROM accelerated_dataset WHERE city = 'Seattle'
refresh_data_period: 1d
```

This configuration will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old.

## Behavior on Zero Results

By default, accelerated datasets will only return results that have been accelerated locally. If the locally accelerated data is a subset of the full dataset in the federated source, i.e. through setting `refresh_sql`, `refresh_data_period` or configuring retention policies, queries against the accelerated dataset may return zero results, where the federated table would return results.

Control this behavior by setting `on_zero_results` in the acceleration configuration.

`on_zero_results`:
- `return_empty` (Default) - Return an empty result set when no data is found in the accelerated dataset.
- `use_source` - Fall back to querying the federated table when no data is found in the accelerated dataset.

Example:

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
acceleration:
enabled: true
refresh_sql: SELECT * FROM accelerated_dataset where city = 'Seattle'
on_zero_results: use_source
```

In this example a query against `accelerated_dataset` within Spice like `SELECT * FROM accelerated_dataset WHERE city = 'Portland'` would initially query against the accelerated data, see that it returns zero results and then fallback to querying against the federated table in Databricks.

:::warning
It is possible that even though the accelerated table returns some results, it may not contain all the data that would be returned by the federated table. `on_zero_results` only controls the behavior in the simple case where no data is returned by the acceleration for a given query.
:::

## Data Accelerator Docs

import DocCardList from '@theme/DocCardList';
Expand Down
2 changes: 1 addition & 1 deletion spiceaidocs/docs/reference/datatypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pagination_prev: 'reference/index'
pagination_next: null
---

Spice adheres to Apache Arrow data [types](https://docs.rs/arrow/latest/arrow/datatypes/index.html). Data accelerators do no support all Arrow data types. The table below outlines the data type compatibility for each accelerator, and datatype used within the accelerator.
Spice adheres to Apache Arrow data [types](https://docs.rs/arrow/latest/arrow/datatypes/index.html). Data accelerators do not support all Arrow data types. The table below outlines the data type compatibility for each accelerator, and datatype used within the accelerator.

| Arrow Type | Description | [DuckDB](https://duckdb.org/docs/sql/data_types/overview) | [SQLite](https://sqlite.org/datatype3.html) | [Postgres](https://www.postgresql.org/docs/current/datatype.html#DATATYPE-TABLE) |
|--------------------------|------------------------------------------------------------------------------|-------------------------------|-------------------|--------------------|
Expand Down
2 changes: 1 addition & 1 deletion spiceaidocs/docs/reference/spicepod/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Optional. The format of the `time_column`. The following values are supported:
- `ISO8601` - [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format.

:::warning[Current Limitations]
- any string-based column is assumed to be ISO8601 format.
- String-based columns are assumed to be ISO8601 format.
:::

## `acceleration`
Expand Down