Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add primary key/indexes documentation #283

Merged
merged 3 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion spiceaidocs/docs/features/federated-queries/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ While the query in step 8 successfully returned results from federated remote da
To improve query performance, step 9 demonstrates the same query executed against locally materialized and accelerated datasets using [Data Accelerators](/data-accelerators/index.md), resulting in significant performance gains.

:::warning[Limitations]
- **Query Optimization:** Filter/Join/Aggregation pushdown is not supported, potentially leading to suboptimal query plan.
- **Query Performance:** Without acceleration, federated queries will be slower than local queries due to network latency and data transfer.
- **Query Capabilities:** Not all SQL features and data types are supported across all data sources. More complex data type queries may not work as expected.
:::
144 changes: 144 additions & 0 deletions spiceaidocs/docs/features/local-acceleration/constraints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
title: 'Constraints'
sidebar_label: 'Constraints'
sidebar_position: 2
description: 'Learn how to add/configure constraints on local acceleration tables in Spice.'
---

Constraints are rules that enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality, as well as configuring the behavior for inserting data updates that violate constraints.

Constraints are specified in the Spicepod via the `primary_key` field in the acceleration configuration. Additional unique constraints are specified via the [`indexes`](./indexes.md) field with the value `unique`.

The behavior of inserting data that violates the constraint can be configured via the `on_conflict` field to either `drop` the data that violates the constraint or `upsert` that data into the accelerated table (i.e. update all values other than the columns that are part of the constraint to match the incoming data).

If there are multiple rows in the incoming data that violate any constraint, the entire incoming batch of data will be dropped.
digadeesh marked this conversation as resolved.
Show resolved Hide resolved

Example Spicepod:

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash # Define a primary key on the `hash` column
indexes:
"(number, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `number` and `timestamp` columns
on_conflict:
# Upsert the incoming data when the primary key constraint on "hash" is violated,
# alternatively "drop" can be used instead of "upsert" to drop the data update.
hash: upsert
```

## Column References

Column references can be used to specify which columns are part of the constraint. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

Examples

- `number`: Reference a constraint on the `number` column
- `(hash, timestamp)`: Reference a constraint on the `hash` and `timestamp` columns

## Limitations

- **Not supported for in-memory Arrow:** The default in-memory Arrow acceleration engine does not support constraints. Use [DuckDB](../../data-accelerators/duckdb.md), [SQLite](../../data-accelerators/sqlite.md), or [PostgreSQL](../../data-accelerators/postgres/index.md) as the acceleration engine to enable constraint checking.
- **Single on_conflict target supported**: Only a single `on_conflict` target can be specified, unless all `on_conflict` targets are specified with drop.

- <details>
<summary>Examples for valid/invalid `on_conflict` targets</summary>
<div>
The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert`:

:::danger[Invalid]
```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
on_conflict:
hash: upsert
"(number, timestamp)": upsert
```
:::

The following Spicepod is valid because it specifies multiple `on_conflict` targets with `drop`, which is allowed:

:::tip[Valid]
```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
on_conflict:
hash: drop
"(number, timestamp)": drop
```
:::


The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert` and `drop`:

:::danger[Invalid]
```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
on_conflict:
hash: upsert
"(number, timestamp)": drop
```
:::

</div>
</details>

- **DuckDB Limitations:**
- DuckDB does not support `upsert` for datasets with List or Map types.
- Standard indexes unexpectedly act like unique indexes and block updates when `upsert` is configured.
- <details>
<summary>Standard indexes blocking updates</summary>
<div>
The following Spicepod specifies a standard index on the `number` column, which blocks updates when `upsert` is configured for the `hash` column:

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: duckdb
primary_key: hash
indexes:
number: enabled
on_conflict:
hash: upsert
```

The following error is returned when attempting to upsert data into the `eth.recent_blocks` table:

```bash
ERROR runtime::accelerated_table::refresh: Error adding data for eth.recent_blocks: External error:
Unable to insert into duckdb table: Binder Error: Can not assign to column 'number' because
it has a UNIQUE/PRIMARY KEY constraint
```

This is a limitation in DuckDB.
</div>
</details>
5 changes: 3 additions & 2 deletions spiceaidocs/docs/features/local-acceleration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ sidebar_label: 'Local Acceleration'
description: 'Learn how to use local acceleration in Spice.'
sidebar_position: 3
pagination_prev: null
pagination_next: null
---

Datasets can be locally accelerated by the Spice runtime, pulling data from any [Data Connector](/data-connectors) and storing it locally in a [Data Accelerator](/data-accelerators) for faster access. Additionally, the data is kept up to date in realtime, so you always have the latest data locally for querying.

## Benefits

When a dataset is locally accelerated by the Spice runtime, the data is stored alongside your application, providing much faster query times by cutting out network latency to make the request. This benefit is accentuated when the result of a query is large because the data does not need to be transferred over the network. Depending on the [Acceleration Engine](/data-accelerators) chosen, the locally accelerated data can also be stored in-memory, further reducing query times.
When a dataset is locally accelerated by the Spice runtime, the data is stored alongside your application, providing much faster query times by cutting out network latency to make the request. This benefit is accentuated when the result of a query is large because the data does not need to be transferred over the network. Depending on the [Acceleration Engine](/data-accelerators) chosen, the locally accelerated data can also be stored in-memory, further reducing query times. [Indexes](./indexes.md) can also be applied, further speeding up certain types of queries.

Locally accelerated datasets can also have [primary key constraints](./constraints.md) applied. This feature comes with the ability to specify what should happen when a constraint is violated, either drop the specific row that violates the constraint or upsert that row into the accelerated table.

## Example Use Case

Expand Down
44 changes: 44 additions & 0 deletions spiceaidocs/docs/features/local-acceleration/indexes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: 'Indexes'
sidebar_label: 'Indexes'
sidebar_position: 1
description: 'Learn how to add indexes to local acceleration tables in Spice.'
---

Database indexes are an essential tool for optimizing the performance of queries. Learn how to add indexes to the tables that Spice creates to accelerate data locally.

Example Spicepod:

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
indexes:
number: enabled # Index the `number` column
"(hash, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
```

## Column References

Column references can be used to specify which columns to index. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

Examples

- `number`: Index the `number` column
- `(hash, timestamp)`: Index the `hash` and `timestamp` columns

## Index Types

There are two types of indexes that can be specified in a Spicepod:

- `enabled`: Creates a standard index on the specified column(s).
- Similar to specifying `CREATE INDEX my_index ON my_table (my_column)`.
- `unique`: Creates a unique index on the specified column(s). See [Constraints](./constraints.md) for more information on working with unique constraints on locally accelerated tables.
- Similar to specifying `CREATE UNIQUE INDEX my_index ON my_table (my_column)`.

:::warning[Limitations]
- **Not supported for in-memory Arrow:** The default in-memory Arrow acceleration engine does not support indexes. Use [DuckDB](../../data-accelerators/duckdb.md), [SQLite](../../data-accelerators/sqlite.md), or [PostgreSQL](../../data-accelerators/postgres/index.md) as the acceleration engine to enable indexing.
:::
72 changes: 72 additions & 0 deletions spiceaidocs/docs/reference/spicepod/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,3 +200,75 @@ Optional. How often the retention policy should be checked.
Required when `acceleration.retention_check_enabled` is `true`.

See [Duration](../duration/index.md)

## `acceleration.indexes`

Optional. Specify which indexes should be applied to the locally accelerated table. Not supported for in-memory Arrow acceleration engine.

The `indexes` field is a map where the key is the column reference and the value is the index type.

A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

See [Indexes](../../features/local-acceleration/indexes.md)

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
indexes:
number: enabled # Index the `number` column
"(hash, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
```

## `acceleration.primary_key`

Optional. Specify the primary key constraint on the locally accelerated table. Not supported for in-memory Arrow acceleration engine.

The `primary_key` field is a string that represents the column reference that should be used as the primary key. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

See [Constraints](../../features/local-acceleration/constraints.md)

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash # Define a primary key on the `hash` column
```

## `acceleration.on_conflict`

Optional. Specify what should happen when a constraint is violated. Not supported for in-memory Arrow acceleration engine.

The `on_conflict` field is a map where the key is the column reference and the value is the conflict resolution strategy.

A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

Only a single `on_conflict` target can be specified, unless all `on_conflict` targets are specified with `drop`.

The possible conflict resolution strategies are:
- `upsert` - Upsert the incoming data when the primary key constraint is violated.
- `drop` - Drop the data when the primary key constraint is violated.

See [Constraints](../../features/local-acceleration/constraints.md)

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
on_conflict:
# Upsert the incoming data when the primary key constraint on "hash" is violated,
# alternatively "drop" can be used instead of "upsert" to drop the data update.
hash: upsert
```
14 changes: 7 additions & 7 deletions spiceaidocs/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.