Skip to content

Commit

Permalink
Add chunking documentation (#424)
Browse files Browse the repository at this point in the history
* add chunking documentation

* Update datasets.md

* search feature docs

* linking

* Apply suggestions from code review

* Improvements

---------

Co-authored-by: Luke Kim <[email protected]>
  • Loading branch information
Jeadie and lukekim authored Oct 1, 2024
1 parent cbd9e60 commit 6184418
Show file tree
Hide file tree
Showing 3 changed files with 179 additions and 3 deletions.
8 changes: 5 additions & 3 deletions spiceaidocs/docs/api/http/search.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ pagination_prev: null
pagination_next: null
---

Performs a basic vector similarity search from one or more dataset(s).
Performs a basic vector similarity search across one or more datasets.

Request Body
- `datasets` (array of strings): Dataset component names to perform similarity search against. Each dataset is expected to have one and only one column augmented with an embedding.
- `datasets` (array of strings): Names of the dataset components to perform the similarity search on. Each dataset must have exactly one column augmented with an embedding.
- `text` (string): Query plaintext used to retrieve similar rows from the underlying datasets listed in the `from` request key.
- `limit` (integer): The number of rows to return, per `from` dataset. Default: 3.
- `where` (string): An SQL filter predicate to apply within the search.
- `additional_columns` (array of strings): Additional columns, from the datasets, to return in the response (under `.matches[*].metadata`).

#### Example

Spicepod
Expand Down Expand Up @@ -67,3 +67,5 @@ Response
"duration_ms": 42,
}
```

The `v1/search` endpoint supports [chunked](/features/search/index.md#chunking) embedding columns.
138 changes: 138 additions & 0 deletions spiceaidocs/docs/features/search/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
title: 'Search Functionality'
sidebar_label: 'Search'
description: 'Learn how Spice can search across datasets using database-native and vector-search methods.'
sidebar_position: 8
pagination_prev: null
pagination_next: null
---

Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and vector-based search functionality.

## SQL-Based Search

Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's `LIKE` clause:

```sql
SELECT id, text_column
FROM my_table
WHERE
LOWER(text_column) LIKE '%search_term%'
AND
date_published > '2021-01-01'
```

## Vector Search

In addition to SQL, Spice provides advanced vector-based search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:

1. Local embedding models, e.g. [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
2. Remote embedding providers, e.g. [OpenAI](https://platform.openai.com/docs/api-reference/embeddings/create).

Embedding models are defined in the `spicepod.yaml` file as top-level components.

```yaml
embeddings:
- from: openai
name: remote_service
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

- name: local_embedding_model
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
```
Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
```yaml
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
acceleration:
enabled: true
embeddings:
- column: body # The text column in the `spiceai.issues` dataset
use: local_embedding_model # Embedding model used for this column
```
By defining embeddings on the `body` column, Spice is now configured to execute similarity searches on the dataset.

```shell
curl -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "author=\"jeadie\"",
"additional_columns": ["title", "state"],
"limit": 2
}'
```

For more details, see the [API reference for /v1/search](/api/http/search).

### Chunking Support

Spice also supports chunking of content before embedding, which is useful for large text columns such as those found in [Document Tables](/components/data-connectors/index.md#document-support). Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.

```yaml
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
acceleration:
enabled: true
embeddings:
- column: body
use: local_embedding_model
chunking:
enabled: true
target_chunk_size: 512
```

The `body` column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).

### Document Retrieval

When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the `additional_columns` list.

For example:

```shell
curl -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "array_has(assignees, \"jeadie\")",
"additional_columns": ["title", "state", "body"],
"limit": 2
}'
```

Response:

```json
{
"matches": [
{
"value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
"dataset": "spiceai.issues",
"metadata": {
"title": "Improve scalar UDF array_distance",
"state": "Closed",
"body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
}
},
{
"value": "est external tools being returned for toolusing models",
"dataset": "spiceai.issues",
"metadata": {
"title": "Automatic NSQL retries in /v1/nsql ",
"state": "Open",
"body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n "sql": "SELECT ..."\n}\n```"
}
}
],
"duration_ms": 45
}
```
36 changes: 36 additions & 0 deletions spiceaidocs/docs/reference/spicepod/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,3 +358,39 @@ The embedding model to use, specific the component name `embeddings[*].name`.
## `embeddings[*].column_pk`

Optional. For datasets without a primary key, explicitly specify column(s) that uniquely identify a row.

## `embeddings[*].chunking`

Optional. The configuration to enable and define the chunking strategy for the embedding column.

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
embeddings:
- column: extra_data
use: hf_minilm
chunking:
enabled: true
target_chunk_size: 512
overlap_size: 128
trim_whitespace: false
```

## `embeddings[*].chunking.enabled`

Optional. Enable or disable chunking for the embedding column. Defaults to `false`.

## `embeddings[*].chunking.target_chunk_size`

The desired size of each chunk, in tokens.

If the desired chunk size is larger than the maximum size of the embedding model, the maximum size will be used.

## `embeddings[*].chunking.overlap_size`

Optional. The number of tokens to overlap between chunks. Defaults to `0`.

## `embeddings[*].chunking.trim_whitespace`

Optional. If enabled, the content of each chunk will be trimmed to remove leading and trailing whitespace. Defaults to `true`.

0 comments on commit 6184418

Please sign in to comment.