diff --git a/spiceaidocs/docs/components/data-accelerators/data-refresh.md b/spiceaidocs/docs/components/data-accelerators/data-refresh.md index c2fe9e0d3..0ff323c1a 100644 --- a/spiceaidocs/docs/components/data-accelerators/data-refresh.md +++ b/spiceaidocs/docs/components/data-accelerators/data-refresh.md @@ -30,7 +30,7 @@ datasets: ### Append -If the dataset definition includes a `time_column` and the refresh mode is `append`, data will be incrementally refreshed for data where the `time_column` value in the remote source is greater-than (gt) the `max(time_column)` value in the local acceleration. +Using `refresh_mode: append` requires the use of a [`time_column` dataset parameter](/reference/spicepod/datasets.md#time_column), specifying a column to compare the local acceleration against the remote source. Data will be incrementally refreshed where the `time_column` value in the remote source is greater-than (gt) the `max(time_column)` value in the local acceleration. E.g. @@ -38,17 +38,17 @@ E.g. datasets: - from: databricks:my_dataset name: accelerated_dataset - time_column: timestamp + time_column: created_at acceleration: - refresh_mode: append # In conjuction with time_column, only fetch data greater than the latest local timestamp + refresh_mode: append refresh_check_interval: 10m ``` -When using `mode: append`, if late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See [`acceleration.refresh_append_overlap`](/reference/spicepod/datasets#accelerationrefresh_append_overlap). +If late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See [`acceleration.refresh_append_overlap`](/reference/spicepod/datasets#accelerationrefresh_append_overlap). ### Changes (CDC) -Datasets configured with acceleration `refresh_mode: changes` require a [Change Data Capture (CDC)](/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/components/data-connectors/debezium.md). +Datasets configured with acceleration `refresh_mode: changes` requires a [Change Data Capture (CDC)](/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/components/data-connectors/debezium.md). ## Ready State @@ -79,7 +79,13 @@ Typically only a working subset of an entire dataset is used in an application o ### Refresh SQL -Specify filters for data accelerated from the connected source using arbitrary SQL. Supported for `full` and `append` refresh modes. +:::info + +Supported for accelerators with any `refresh_mode`. + +::: + +Specify filters for data accelerated from the connected source using arbitrary SQL. Filters will be pushed down to the remote source when possible, so only the requested data will be transferred over the network. @@ -112,25 +118,41 @@ curl -i -X PATCH \ 127.0.0.1:8090/v1/datasets/accelerated_dataset/acceleration ``` +Queries that return zero results will fallback to the behavior specified by the [`on_zero_results` parameter](#behavior-on-zero-results), and will not have the `refresh_sql` applied to the results from the fallback. The `refresh_sql` only applies to acceleration refresh tasks. + For the complete reference, view the `refresh_sql` section of [datasets](/reference/spicepod/datasets.md#accelerationrefresh_sql). :::warning[Limitations] +- When `refresh_mode: changes` is specified, Refresh SQL can only modify the selected columns and cannot apply filters. +- Running queries while using refresh SQL will not fallback to the source if any query returns more than zero rows, even when querying against columns that are not explicitly filtered by the refresh SQL. This may result in queries returning partial data, depending on the filters applied in the refresh SQL. - Refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported. -- Queries for data that have been filtered out will not fallback to querying the federated table. - Refresh SQL modifications made via API are temporary and will revert after a runtime restart. ::: ### Refresh Data Window -Filters data from the federated source that falls outside the specified time window. The only supported window is a lookback period starting from `now() - refresh_data_window` to `now()`. This flag is supported datasets configured with the default `full` refresh mode. +:::info -This filter works with the `time_column` to identify the column containing timestamps for filtering. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`. +Supported for accelerators with a `refresh_mode` of `full` or `append`. -It can also be used alongside `refresh_sql` to apply additional filtering based on time-related criteria. +::: -Example: +The `refresh_data_window` parameter supports refreshing data that falls within the specified time window. The `refresh_data_window` is applied cumulatively to any filters specified by the [`refresh_sql`](#refresh-sql), and applies a time filter based on `now() - refresh_data_window`. For example, the following configuration: + +```yaml +time_column: column_time +acceleration: + refresh_sql: "SELECT * FROM my_dataset WHERE column_one = 'value'" + refresh_data_window: 1d +``` + +Is converted into an effective Refresh SQL of `SELECT * FROM my_dataset WHERE column_one = 'value' AND column_time > (now() - interval '1' day)`. The `time_column` column can be specified in the `refresh_sql` in conjunction with the `refresh_data_window`, and both filters are combined with `AND`. + +This parameter relies on the `time_column` dataset parameter specifying a column that is a timestamp type. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`. + +*Example with `refresh_sql`:* ```yaml datasets: @@ -146,16 +168,41 @@ datasets: refresh_data_window: 1d ``` -This configuration will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old. +This example will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old. + +*Example with `on_zero_results`:* + +```yaml +datasets: + - from: databricks:my_dataset + name: accelerated_dataset + time_column: created_at + acceleration: + enabled: true + refresh_mode: full + refresh_check_interval: 10m + refresh_sql: | + SELECT * FROM accelerated_dataset WHERE city = 'Seattle' + refresh_data_window: 1d + on_zero_results: use_source +``` + +This example will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old. If a query against the accelerated data returns zero results, the query will fallback to the source and return the direct results without any filtering. + +If a query against the accelerated data returns some results, the query will not fall back. For example, attempting to query for the last 2 days of data would only return the last 1 day of data without falling back. ## Behavior on Zero Results +:::info + +Supported for accelerators with a `refresh_mode` of `full` or `append`. + +::: + By default, accelerated datasets only return locally materialized data. If this local data is a subset of the full dataset in the federated source—due to settings like `refresh_sql`, `refresh_data_window`, or retention policies—queries against the accelerated dataset may return zero results, even when the federated table would return results. To address this, `on_zero_results: use_source` can be configured in the acceleration configuration. Queries returning zero results will fall back to the federated source, returning results from querying the underlying data. -The `on_zero_results: use_source` setting applies only to `full` and `append` refresh modes (not `changes). - `on_zero_results`: - `return_empty` (Default) - Return an empty result set when no data is found in the accelerated dataset. @@ -176,12 +223,20 @@ datasets: In this example a query against `accelerated_dataset` within Spice like `SELECT * FROM accelerated_dataset WHERE city = 'Portland'` would initially query against the accelerated data, see that it returns zero results and then fallback to querying against the federated table in Databricks. :::warning -It is possible that even though the accelerated table returns some results, it may not contain all the data that would be returned by the federated table. `on_zero_results` only controls the behavior in the simple case where no data is returned by the acceleration for a given query. + +- It is possible that even though an accelerated table returns some results, it may not contain all the data that would be returned by the federated table. `on_zero_results` only controls the behavior in the simple case where no data is returned by the acceleration for a given query. + ::: ## Refresh Interval -For accelerated datasets in `full` mode, the [`refresh_check_interval`](/reference/spicepod/datasets#accelerationrefresh_check_interval) parameter controls how often the accelerated dataset is refreshed. +:::info + +Supported for accelerators with a `refresh_mode` of `full` or `append`. + +::: + +The [`refresh_check_interval`](/reference/spicepod/datasets#accelerationrefresh_check_interval) parameter controls how often the accelerated dataset is refreshed. Example: @@ -199,9 +254,13 @@ This configuration will refresh `eth.recent_blocks` data every 10 seconds. ## Refresh On-Demand -Accelerated datasets can be refreshed on-demand via the `refresh` CLI command or `POST /v1/datasets/:name/acceleration/refresh` API endpoint. +:::info -On-demand refresh applies only to `full` and `append` refresh modes (not `changes). +Supported for accelerators with a `refresh_mode` of `full` or `append`. + +::: + +Accelerated datasets can be refreshed on-demand via the `refresh` CLI command or `POST /v1/datasets/:name/acceleration/refresh` API endpoint. CLI example: @@ -232,13 +291,17 @@ On-demand refresh always initiates a new refresh, terminating any in-progress re ## Refresh Retries -By default, data refreshes for accelerated datasets are retried on transient errors (connectivity issues, compute warehouse goes idle, etc.) using [Fibonacci](https://en.wikipedia.org/wiki/Fibonacci_sequence) backoff strategy. +:::info -Retry behavior can be configured using the [`acceleration.refresh_retry_enabled`](/reference/spicepod/datasets#accelerationrefresh_retry_enabled) and [`acceleration.refresh_retry_max_attempts`](/reference/spicepod/datasets#accelerationrefresh_retry_max_attempts) parameters. +Supported for accelerators with a `refresh_mode` of `full` or `append`. -Data refresh retry applies to `full` and `append` refresh modes not `changes` which inherently supports data integrity and consistency through the CDC mechanism. +::: -Example: Disable rertries +By default, data refreshes for accelerated datasets are retried on transient errors (connectivity issues, compute warehouse goes idle, etc.) using a [Fibonacci](https://en.wikipedia.org/wiki/Fibonacci_sequence) backoff strategy. + +Retry behavior can be configured using the [`acceleration.refresh_retry_enabled`](/reference/spicepod/datasets#accelerationrefresh_retry_enabled) and [`acceleration.refresh_retry_max_attempts`](/reference/spicepod/datasets#accelerationrefresh_retry_max_attempts) parameters. + +Example: Disable retries ```yaml datasets: @@ -262,14 +325,24 @@ datasets: ## Retention Policy -Accelerated datasets can be set to automatically evict time-series data exceeding a retention period by setting a retention policy based on the configured `time_column` and `acceleration.retention_period`. +:::info + +Supported for accelerators with a `refresh_mode` of `full` or `append`. + +::: -Retention policies apply to `full` and `append` refresh modes (not `changes`). +Accelerated datasets can be set to automatically evict time-series data exceeding a retention period by setting a retention policy based on the configured `time_column` and `acceleration.retention_period`. The policy is set using the [`acceleration.retention_check_enabled`](/reference/spicepod/datasets#accelerationretention_check_enabled), [`acceleration.retention_period`](/reference/spicepod/datasets#accelerationretention_period) and [`acceleration.retention_check_interval`](/reference/spicepod/datasets#accelerationretention_check_interval) parameters, along with the [`time_column`](/reference/spicepod/datasets#time_column) and [`time_format`](/reference/spicepod/datasets#time_format) dataset parameters. ## Refresh Jitter +:::info + +Supported for accelerators with a `refresh_mode` of `full` or `append`. + +::: + Accelerated datasets can include a random jitter in their refresh interval to prevent the [Thundering herd problem](https://en.wikipedia.org/wiki/Thundering_herd_problem), where multiple datasets refresh simultaneously. The jitter is a random value between 0 and `refresh_jitter_max`, which is added to or subtracted from the base `refresh_check_interval`. If `refresh_jitter_max` is not specified, it defaults to 10% of `refresh_check_interval`. Refresh Jitter applies to the initial dataset load. If multiple similarly configured Spice instances are restarted at the same time, they will load with a jitter between 0 and `refresh_jitter_max`. @@ -295,3 +368,58 @@ Refresh jitter configuration: - [`refresh_jitter_enabled`](/reference/spicepod/datasets#accelerationrefresh_jitter_enabled) - [`refresh_jitter_max`](/reference/spicepod/datasets#accelerationrefresh_jitter_max) + +## Configuration Examples + +### Accelerating a subset of data that changes frequently + +In this example, Spice connects with a dataset that has frequently changing data that is not configured for CDC. For example, user's posts on a social media platform. + +```yaml +datasets: + - from: mysql:posts + name: posts + acceleration: + refresh_mode: full + refresh_check_interval: 10m + refresh_sql: "SELECT * FROM posts WHERE updated_at > now() - interval '1' day" + on_zero_results: use_source +``` + +With this configuration, Spice will refresh every 10 minutes accelerating posts that have been updated in the last day. + +When querying for posts by direct ID, if a post is not accelerated Spice will fallback to retrieving the post from the non-accelerated source due to the behavior of `on_zero_results: use_source`. + +However, if querying for a range of posts that includes some which have updated in the last day Spice will only return those results without falling back to the source. This could result in queries for a range of posts excluding posts that exist in the non-accelerated source because they have been filtered out due to their `updated_at` value. + +### Accelerating application logs + +In this example, Spice connects to a data source that is immutable, receives new rows, and is not configured for CDC. For example, a database that contains some application logs. + +```yaml +datasets: + - from: duckdb:logs + name: logs + time_column: created_at + params: + duckdb_open: logs.duckdb + acceleration: + refresh_mode: append + refresh_check_interval: 10m + refresh_sql: "SELECT * FROM logs WHERE asset = 'asset_id'" + refresh_data_window: 1d + on_zero_results: use_source + retention_check_enabled: true + retention_period: 7d + retention_check_interval: 10m +``` + +This acceleration configuration applies a number of different behaviors: + +1. A `refresh_data_window` was specified. When Spice starts, it will apply this `refresh_data_window` to the `refresh_sql`, and retrieve only the last day's worth of logs with an `asset = 'asset_id'`. +2. Because a `refresh_sql` is specified, every refresh (including initial load) will have the filter applied to the refresh query. +3. 10 minutes after loading, as specified by the `refresh_check_interval`, the first refresh will occur - retrieving new rows where `asset = 'asset_id'`. +4. Running a query to retrieve logs with an `asset` that is *not* `asset_id` will fall back to the source, because of the `on_zero_results: use_source` parameter. +5. Running a query to retrieve a log longer than 1 day ago will fall back to the source, because of the `on_zero_results: use_source` parameter. +6. Running a query to retrieve logs within a range of now to longer than 1 day ago will only return logs from the last day. This is due to the `refresh_data_window` only accelerating the last day's worth of logs, which will return some results. Because results are returned, Spice will not fall back to the source even though `on_zero_results: use_source` is specified. +7. Spice will retain newly appended log rows for 7 days before discarding them, as specified by the `retention_*` parameters.