Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Clarify acceleration refresh options #682

Merged
merged 2 commits into from
Dec 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 179 additions & 24 deletions spiceaidocs/docs/components/data-accelerators/data-refresh.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,28 +30,34 @@ datasets:

### Append

If the dataset definition includes a `time_column` and the refresh mode is `append`, data will be incrementally refreshed for data where the `time_column` value in the remote source is greater-than (gt) the `max(time_column)` value in the local acceleration.
Using `refresh_mode: append` requires the use of a [`time_column` dataset parameter](/reference/spicepod/datasets.md#time_column), specifying a column to compare the local acceleration against the remote source. Data will be incrementally refreshed where the `time_column` value in the remote source is greater-than (gt) the `max(time_column)` value in the local acceleration.

E.g.

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
time_column: timestamp
time_column: created_at
acceleration:
refresh_mode: append # In conjuction with time_column, only fetch data greater than the latest local timestamp
refresh_mode: append
refresh_check_interval: 10m
```

When using `mode: append`, if late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See [`acceleration.refresh_append_overlap`](/reference/spicepod/datasets#accelerationrefresh_append_overlap).
If late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See [`acceleration.refresh_append_overlap`](/reference/spicepod/datasets#accelerationrefresh_append_overlap).

### Changes (CDC)

Datasets configured with acceleration `refresh_mode: changes` require a [Change Data Capture (CDC)](/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/components/data-connectors/debezium.md).
Datasets configured with acceleration `refresh_mode: changes` requires a [Change Data Capture (CDC)](/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/components/data-connectors/debezium.md).

## Ready State

| | |
| --------------------------- | --------- |
| Supported in `refresh_mode` | Any |
| Required | No |
| Default Value | `on_load` |

By default, Spice will return an error for queries against an accelerated dataset that is still loading its initial data. The endpoint [`/v1/ready`](/api/http/ready) is used in production deployments to control when queries are sent to the Spice runtime.

The ready state for an accelerated dataset can be configured using the [`ready_state`](/reference/spicepod/datasets#ready_state) parameter in the dataset configuration.
Expand Down Expand Up @@ -79,7 +85,13 @@ Typically only a working subset of an entire dataset is used in an application o

### Refresh SQL

Specify filters for data accelerated from the connected source using arbitrary SQL. Supported for `full` and `append` refresh modes.
| | |
| --------------------------- | --------- |
| Supported in `refresh_mode` | Any |
| Required | No |
| Default Value | Unset |

Refresh SQL supports specifying filters for data accelerated from the connected source using arbitrary SQL.

Filters will be pushed down to the remote source when possible, so only the requested data will be transferred over the network.

Expand Down Expand Up @@ -112,25 +124,59 @@ curl -i -X PATCH \
127.0.0.1:8090/v1/datasets/accelerated_dataset/acceleration
```

Queries that return zero results will fallback to the behavior specified by the [`on_zero_results` parameter](#behavior-on-zero-results), and will not have the `refresh_sql` applied to the results from the fallback. The `refresh_sql` only applies to acceleration refresh tasks.

For the complete reference, view the `refresh_sql` section of [datasets](/reference/spicepod/datasets.md#accelerationrefresh_sql).

:::warning[Limitations]

- When `refresh_mode: changes` is specified, Refresh SQL can only modify the selected columns and cannot apply filters.
- Running queries while using refresh SQL will not fallback to the source if any query returns more than zero rows, even when querying against columns that are not explicitly filtered by the refresh SQL. This may result in queries returning partial data, depending on the filters applied in the refresh SQL.
- Refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
- Queries for data that have been filtered out will not fallback to querying the federated table.
- Refresh SQL modifications made via API are temporary and will revert after a runtime restart.

:::

### Refresh Data Window

Filters data from the federated source that falls outside the specified time window. The only supported window is a lookback period starting from `now() - refresh_data_window` to `now()`. This flag is supported datasets configured with the default `full` refresh mode.
| | |
| --------------------------- | ---------------- |
| Supported in `refresh_mode` | `full`, `append` |
| Required | No |
| Default Value | Unset |

The `refresh_data_window` parameter supports refreshing data that falls within the specified time window. The `refresh_data_window` is applied cumulatively to any filters specified by the [`refresh_sql`](#refresh-sql), and applies a time filter based on `now() - refresh_data_window`. For example, the following configuration:

This filter works with the `time_column` to identify the column containing timestamps for filtering. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`.
```yaml
time_column: column_time
acceleration:
refresh_sql: "SELECT * FROM my_dataset WHERE column_one = 'value'"
refresh_data_window: 1d
```

It can also be used alongside `refresh_sql` to apply additional filtering based on time-related criteria.
In this example, `refresh_data_window` is converted into an effective Refresh SQL of `SELECT * FROM my_dataset WHERE column_one = 'value' AND column_time > (now() - interval '1' day)`. The `time_column` column can be specified in the `refresh_sql` in conjunction with the `refresh_data_window`, and both filters are combined with `AND`.

Example:
This parameter relies on the `time_column` dataset parameter specifying a column that is a timestamp type. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`.

*Example with `refresh_sql`:*

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
time_column: created_at
acceleration:
enabled: true
refresh_mode: full
refresh_check_interval: 10m
refresh_sql: |
SELECT * FROM accelerated_dataset WHERE city = 'Seattle'
refresh_data_window: 1d
```

This example will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old.

*Example with `on_zero_results`:*

```yaml
datasets:
Expand All @@ -144,18 +190,25 @@ datasets:
refresh_sql: |
SELECT * FROM accelerated_dataset WHERE city = 'Seattle'
refresh_data_window: 1d
on_zero_results: use_source
```

This configuration will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old.
This example will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old. If a query against the accelerated data returns zero results, the query will fallback to the source and return the direct results without any filtering.

If a query against the accelerated data returns some results, the query will not fall back. For example, attempting to query for the last 2 days of data would only return the last 1 day of data without falling back.

## Behavior on Zero Results

| | |
| --------------------------- | ---------------- |
| Supported in `refresh_mode` | `full`, `append` |
| Required | No |
| Default Value | `return_empty` |

By default, accelerated datasets only return locally materialized data. If this local data is a subset of the full dataset in the federated source—due to settings like `refresh_sql`, `refresh_data_window`, or retention policies—queries against the accelerated dataset may return zero results, even when the federated table would return results.

To address this, `on_zero_results: use_source` can be configured in the acceleration configuration. Queries returning zero results will fall back to the federated source, returning results from querying the underlying data.

The `on_zero_results: use_source` setting applies only to `full` and `append` refresh modes (not `changes).

`on_zero_results`:

- `return_empty` (Default) - Return an empty result set when no data is found in the accelerated dataset.
Expand All @@ -176,12 +229,20 @@ datasets:
In this example a query against `accelerated_dataset` within Spice like `SELECT * FROM accelerated_dataset WHERE city = 'Portland'` would initially query against the accelerated data, see that it returns zero results and then fallback to querying against the federated table in Databricks.

:::warning
It is possible that even though the accelerated table returns some results, it may not contain all the data that would be returned by the federated table. `on_zero_results` only controls the behavior in the simple case where no data is returned by the acceleration for a given query.

- It is possible that even though an accelerated table returns some results, it may not contain all the data that would be returned by the federated table. `on_zero_results` only controls the behavior in the simple case where no data is returned by the acceleration for a given query.

:::

## Refresh Interval

For accelerated datasets in `full` mode, the [`refresh_check_interval`](/reference/spicepod/datasets#accelerationrefresh_check_interval) parameter controls how often the accelerated dataset is refreshed.
| | |
| --------------------------- | ---------------- |
| Supported in `refresh_mode` | `full`, `append` |
| Required | No |
| Default Value | Unset |

The [`refresh_check_interval`](/reference/spicepod/datasets#accelerationrefresh_check_interval) parameter controls how often the accelerated dataset is refreshed.

Example:

Expand All @@ -199,9 +260,13 @@ This configuration will refresh `eth.recent_blocks` data every 10 seconds.

## Refresh On-Demand

Accelerated datasets can be refreshed on-demand via the `refresh` CLI command or `POST /v1/datasets/:name/acceleration/refresh` API endpoint.
:::info

Supported for accelerators with a `refresh_mode` of `full` or `append`.

:::

On-demand refresh applies only to `full` and `append` refresh modes (not `changes).
Accelerated datasets can be refreshed on-demand via the `refresh` CLI command or `POST /v1/datasets/:name/acceleration/refresh` API endpoint.

CLI example:

Expand Down Expand Up @@ -232,13 +297,18 @@ On-demand refresh always initiates a new refresh, terminating any in-progress re

## Refresh Retries

By default, data refreshes for accelerated datasets are retried on transient errors (connectivity issues, compute warehouse goes idle, etc.) using [Fibonacci](https://en.wikipedia.org/wiki/Fibonacci_sequence) backoff strategy.
| | |
| ------------------------------------ | ---------------- |
| Supported in `refresh_mode` | `full`, `append` |
| Required | No |
| Default `refresh_retry_enabled` | `false` |
| Default `refresh_retry_max_attempts` | Unset |

Retry behavior can be configured using the [`acceleration.refresh_retry_enabled`](/reference/spicepod/datasets#accelerationrefresh_retry_enabled) and [`acceleration.refresh_retry_max_attempts`](/reference/spicepod/datasets#accelerationrefresh_retry_max_attempts) parameters.
By default, data refreshes for accelerated datasets are retried on transient errors (connectivity issues, compute warehouse goes idle, etc.) using a [Fibonacci](https://en.wikipedia.org/wiki/Fibonacci_sequence) backoff strategy.

Data refresh retry applies to `full` and `append` refresh modes not `changes` which inherently supports data integrity and consistency through the CDC mechanism.
Retry behavior can be configured using the [`acceleration.refresh_retry_enabled`](/reference/spicepod/datasets#accelerationrefresh_retry_enabled) and [`acceleration.refresh_retry_max_attempts`](/reference/spicepod/datasets#accelerationrefresh_retry_max_attempts) parameters.

Example: Disable rertries
Example: Disable retries

```yaml
datasets:
Expand All @@ -262,14 +332,29 @@ datasets:

## Retention Policy

Accelerated datasets can be set to automatically evict time-series data exceeding a retention period by setting a retention policy based on the configured `time_column` and `acceleration.retention_period`.
| | |
| ---------------------------------- | ---------------- |
| Supported in `refresh_mode` | `full`, `append` |
| Required | No |
| Default `retention_check_enabled` | `false` |
| Default `retention_period` | Unset |
| Default `retention_check_interval` | Unset |

Retention policies apply to `full` and `append` refresh modes (not `changes`).
Accelerated datasets can be set to automatically evict time-series data exceeding a retention period by setting a retention policy based on the configured `time_column` and `acceleration.retention_period`.

The policy is set using the [`acceleration.retention_check_enabled`](/reference/spicepod/datasets#accelerationretention_check_enabled), [`acceleration.retention_period`](/reference/spicepod/datasets#accelerationretention_period) and [`acceleration.retention_check_interval`](/reference/spicepod/datasets#accelerationretention_check_interval) parameters, along with the [`time_column`](/reference/spicepod/datasets#time_column) and [`time_format`](/reference/spicepod/datasets#time_format) dataset parameters.

When `retention_check_enabled` is set to `true`, `retention_check_interval` and `retention_period` are required parameters.

## Refresh Jitter

| | |
| -------------------------------- | ---------------- |
| Supported in `refresh_mode` | `full`, `append` |
| Required | No |
| Default `refresh_jitter_enabled` | `false` |
| Default `refresh_jitter_max` | Unset |

Accelerated datasets can include a random jitter in their refresh interval to prevent the [Thundering herd problem](https://en.wikipedia.org/wiki/Thundering_herd_problem), where multiple datasets refresh simultaneously. The jitter is a random value between 0 and `refresh_jitter_max`, which is added to or subtracted from the base `refresh_check_interval`. If `refresh_jitter_max` is not specified, it defaults to 10% of `refresh_check_interval`.

Refresh Jitter applies to the initial dataset load. If multiple similarly configured Spice instances are restarted at the same time, they will load with a jitter between 0 and `refresh_jitter_max`.
Expand All @@ -295,3 +380,73 @@ Refresh jitter configuration:

- [`refresh_jitter_enabled`](/reference/spicepod/datasets#accelerationrefresh_jitter_enabled)
- [`refresh_jitter_max`](/reference/spicepod/datasets#accelerationrefresh_jitter_max)

## Configuration Examples

### Accelerating a full set of data that sometimes changes

In this example, Spice connects with a dataset that changes infrequently and is not configured for CDC. For example, a list of product categories.

```yaml
datasets:
- from: mysql:product_categories
name: product_categories
acceleration:
refresh_mode: full
refresh_check_interval: 8h
```

In this scenario, Spice uses a simple acceleration configuration - full refreshes on an 8 hour schedule. No additional behaviors are enabled, so queries matching for new product codes will return no results until the next refresh cycle.

### Accelerating a subset of data that changes frequently

In this example, Spice connects with a dataset that has frequently changing data that is not configured for CDC. For example, user's posts on a social media platform.

```yaml
datasets:
- from: mysql:posts
name: posts
acceleration:
refresh_mode: full
refresh_check_interval: 10m
refresh_sql: "SELECT * FROM posts WHERE updated_at > now() - interval '1' day"
on_zero_results: use_source
```

With this configuration, Spice will refresh every 10 minutes accelerating posts that have been updated in the last day.

When querying for posts by direct ID, if a post is not accelerated Spice will fallback to retrieving the post from the non-accelerated source due to the behavior of `on_zero_results: use_source`.

However, if querying for a range of posts that includes some which have updated in the last day Spice will only return those results without falling back to the source. This could result in queries for a range of posts excluding posts that exist in the non-accelerated source because they have been filtered out due to their `updated_at` value.

### Accelerating application logs

In this example, Spice connects to a data source that is immutable, receives new rows, and is not configured for CDC. For example, a database that contains some application logs.

```yaml
datasets:
- from: duckdb:logs
name: logs
time_column: created_at
params:
duckdb_open: logs.duckdb
acceleration:
refresh_mode: append
refresh_check_interval: 10m
refresh_sql: "SELECT * FROM logs WHERE asset = 'asset_id'"
refresh_data_window: 1d
on_zero_results: use_source
retention_check_enabled: true
retention_period: 7d
retention_check_interval: 10m
```

This acceleration configuration applies a number of different behaviors:

1. A `refresh_data_window` was specified. When Spice starts, it will apply this `refresh_data_window` to the `refresh_sql`, and retrieve only the last day's worth of logs with an `asset = 'asset_id'`.
2. Because a `refresh_sql` is specified, every refresh (including initial load) will have the filter applied to the refresh query.
3. 10 minutes after loading, as specified by the `refresh_check_interval`, the first refresh will occur - retrieving new rows where `asset = 'asset_id'`.
4. Running a query to retrieve logs with an `asset` that is *not* `asset_id` will fall back to the source, because of the `on_zero_results: use_source` parameter.
5. Running a query to retrieve a log longer than 1 day ago will fall back to the source, because of the `on_zero_results: use_source` parameter.
6. Running a query to retrieve logs within a range of now to longer than 1 day ago will only return logs from the last day. This is due to the `refresh_data_window` only accelerating the last day's worth of logs, which will return some results. Because results are returned, Spice will not fall back to the source even though `on_zero_results: use_source` is specified.
7. Spice will retain newly appended log rows for 7 days before discarding them, as specified by the `retention_*` parameters.
Loading