Skip to content

Commit

Permalink
Documentation improvements for RC.1
Browse files Browse the repository at this point in the history
  • Loading branch information
lukekim committed Nov 28, 2024
1 parent 6f46ca7 commit a9451e6
Show file tree
Hide file tree
Showing 32 changed files with 218 additions and 161 deletions.
5 changes: 5 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,18 @@ Remember to be concise, but do not omit useful information. Pay attention to det

Use plain, clear, simple, easy-to-understand language. Do not use hyperbole or hype.

Avoid "allows" to describe functionality.

Always provide references and citations with links.

Adhere to the instructions in CONTRIBUTING.md.

Never use the words:

- delve
- seamlessly
- empower / empowering
- supercharge
- countless
- enhance / enhancing
- allow / allowing
10 changes: 7 additions & 3 deletions spiceaidocs/docs/components/embeddings/huggingface.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,23 @@ sidebar_label: 'HuggingFace'
sidebar_position: 2
---

To run an embedding model from HuggingFace, specify the `huggingface` path in `from`. This will handle downloading and running the embedding model locally.
To use an embedding model from HuggingFace with Spice, specify the `huggingface` path in the `from` field of your configuration. The model and its related files will be automatically downloaded, loaded, and served locally by Spice.

Here is an example configuration in `spicepod.yaml`:

```yaml
embeddings:
- from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
name: all_minilm_l6_v2
```
Supported models include:
- All models tagged as [text-embeddings-inference](https://huggingface.co/models?other=text-embeddings-inference) on Huggingface
- Any Huggingface repository with the correct files to be loaded as a [local embedding model](/components/embeddings/local.md).
- All models tagged as [text-embeddings-inference](https://huggingface.co/models?other=text-embeddings-inference) on Huggingface
- Any Huggingface repository with the correct files to be loaded as a [local embedding model](/components/embeddings/local.md).
With the same semantics as [language models](/components/models/huggingface#access-tokens), `spice` can run private HuggingFace embedding models:

```yaml
embeddings:
- from: huggingface:huggingface.co/secret-company/awesome-embedding-model
Expand Down
11 changes: 6 additions & 5 deletions spiceaidocs/docs/components/embeddings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ pagination_prev: null
pagination_next: null
---

Embedding models are used to convert raw text into a numerical representation that can be used by machine learning models.

Spice supports running embedding models locally, or use remote services such as OpenAI, or [la Plateforme](https://console.mistral.ai/).
Embedding models convert raw text into numerical representations that can be used by machine learning models. Spice supports running embedding models locally or using remote services such as OpenAI or [la Plateforme](https://console.mistral.ai/).

Embedding models are defined in the `spicepod.yaml` file as top-level components.

Example configuration in `spicepod.yaml`:

```yaml
embeddings:
- from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Expand All @@ -31,5 +31,6 @@ embeddings:
```
Embedding models can be used either by:
- An OpenAI-compatible [endpoint](/api/http/embeddings.md)
- By augmenting a dataset with column-level [embeddings](/reference/spicepod/datasets.md#embeddings), to provide vector-based [search functionality](/features/search/index.md#vector-search).
- An OpenAI-compatible [endpoint](/api/http/embeddings.md)
- By augmenting a dataset with column-level [embeddings](/reference/spicepod/datasets.md#embeddings), to provide vector-based [search functionality](/features/search/index.md#vector-search).
13 changes: 9 additions & 4 deletions spiceaidocs/docs/components/embeddings/local.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@ sidebar_label: 'Local'
sidebar_position: 3
---

Embedding models can be run with files stored locally.
Embedding models can be run with files stored locally. This method is useful for using models that are not hosted on remote services.

### Example Configuration

To configure an embedding model using local files, you can specify the details in the `spicepod.yaml` file as shown below:

```yaml
embeddings:
Expand All @@ -16,6 +20,7 @@ embeddings:
```
## Required Files
- Model file, one of: `model.safetensors`, `pytorch_model.bin`.
- A tokenizer file with the filename `tokenizer.json`.
- A config file with the filename `config.json`.
- Model file, one of: `model.safetensors`, `pytorch_model.bin`.
- A tokenizer file with the filename `tokenizer.json`.
- A config file with the filename `config.json`.
22 changes: 10 additions & 12 deletions spiceaidocs/docs/components/embeddings/openai.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,18 @@ sidebar_label: 'OpenAI'
sidebar_position: 1
---

To use a hosted OpenAI (or compatible) embedding model, specify the `openai` path in `from`.
To use a hosted OpenAI (or compatible) embedding model, specify the `openai` path in the `from` field of your configuration. If you want to use a specific model, include its model ID in the `from` field. If no model ID is specified, it defaults to `"text-embedding-3-small"`.

For a specific model, include it as the model ID in `from` (see example below). Defaults to `"text-embedding-3-small"`.
These parameters are specific to OpenAI models:
The following parameters are specific to OpenAI models:

| Parameter | Description | Default |
| ----- | ----------- | ------- |
| `openai_api_key` | The OpenAI API key. | - |
| `openai_org_id` | The OpenAI organization id. | - |
| `openai_project_id` | The OpenAI project id. | - |
| `endpoint` | The OpenAI API base endpoint. | `https://api.openai.com/v1` |
| Parameter | Description | Default |
| ------------------- | ------------------------------------- | --------------------------- |
| `openai_api_key` | The API key for accessing OpenAI. | - |
| `openai_org_id` | The organization ID for OpenAI. | - |
| `openai_project_id` | The project ID for OpenAI. | - |
| `endpoint` | The base endpoint for the OpenAI API. | `https://api.openai.com/v1` |


Example:
Below is an example configuration in `spicepod.yaml`:

```yaml
models:
Expand All @@ -31,4 +29,4 @@ models:
params:
endpoint: https://api.mistral.ai/v1
api_key: ${ secrets:SPICE_MISTRAL_API_KEY }
```
```
13 changes: 8 additions & 5 deletions spiceaidocs/docs/components/models/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar_label: 'Filesystem'
sidebar_position: 3
---

To use a model hosted on a filesystem, specify the path to the model file in `from`.
To use a model hosted on a filesystem, specify the path to the model file in the `from` field.

Supported formats include ONNX for traditional machine learning models and GGUF, GGML, and SafeTensor for large language models (LLMs).

Expand Down Expand Up @@ -50,15 +50,17 @@ models:
```
### Example: Loading from a directory
```yaml
models:
- name: hello
from: file:models/llms/llama3.2-1b-instruct/
```
Note: The folder provided should contain all the expected files (see examples above) to load a model in the base level.
Note: The folder provided should contain all the expected files (see examples above) to load a model in the base level.
### Example: Overriding the Chat Template
Chat templates convert the OpenAI compatible chat messages (see [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages)) and other components of a request
into a stream of characters for the language model. It follows Jinja3 templating [syntax](https://jinja.palletsprojects.com/en/3.1.x/templates/).
Expand All @@ -81,6 +83,7 @@ models:
```
#### Templating Variables
- `messages`: List of chat messages, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).
- `add_generation_prompt`: Boolean flag whether to add a [generation prompt](https://huggingface.co/docs/transformers/main/chat_templating#what-are-generation-prompts).
- `tools`: List of callable tools, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools).
- `messages`: List of chat messages, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).
- `add_generation_prompt`: Boolean flag whether to add a [generation prompt](https://huggingface.co/docs/transformers/main/chat_templating#what-are-generation-prompts).
- `tools`: List of callable tools, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools).
9 changes: 5 additions & 4 deletions spiceaidocs/docs/components/models/openai.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,17 @@ sidebar_label: 'OpenAI'
sidebar_position: 4
---

To use a language model hosted on OpenAI (or compatible), specify the `openai` path in `from`.
To use a language model hosted on OpenAI (or compatible), specify the `openai` path in the `from` field.

For a specific model, include it as the model ID in the `from` field (see example below). The default model is `"gpt-3.5-turbo"`.

For a specific model, include it as the model ID in `from` (see example below). Defaults to `"gpt-3.5-turbo"`.
These parameters are specific to OpenAI models:

| Param | Description | Default |
| ------------------- | ----------------------------- | --------------------------- |
| `openai_api_key` | The OpenAI API key. | - |
| `openai_org_id` | The OpenAI organization id. | - |
| `openai_project_id` | The OpenAI project id. | - |
| `openai_org_id` | The OpenAI organization ID. | - |
| `openai_project_id` | The OpenAI project ID. | - |
| `endpoint` | The OpenAI API base endpoint. | `https://api.openai.com/v1` |

Example:
Expand Down
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/models/spiceai.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar_label: 'Spice Cloud Platform'
sidebar_position: 2
---

To use a model hosted on the [Spice Cloud Platform](https://docs.spice.ai/building-blocks/spice-models), specify the `spice.ai` path in `from`.
To use a model hosted on the [Spice Cloud Platform](https://docs.spice.ai/building-blocks/spice-models), specify the `spice.ai` path in the `from` field.

Example:

Expand Down
8 changes: 5 additions & 3 deletions spiceaidocs/docs/components/views/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
---
title: 'Views'
sidebar_label: 'Views'
description: 'Documentation for defining Views'
description: 'Documentation for defining Views in Spice'
sidebar_position: 7
---

Views in Spice are virtual tables defined by SQL queries. They simplify complex queries and support reuse across applications.
Views in Spice are virtual tables defined by SQL queries. They help simplify complex queries and promote reuse across different applications by encapsulating query logic in a single, reusable entity.

## Defining a View

To define a view in `spicepod.yaml`, specify the `views` section. Each view requires a `name` and a `sql` field.
To define a view in the `spicepod.yaml` configuration file, specify the `views` section. Each view definition must include a `name` and a `sql` field.

### Example

The following example demonstrates how to define a view named `rankings` that lists the top five products based on the total count of orders:

```yaml
views:
- name: rankings
Expand Down
14 changes: 7 additions & 7 deletions spiceaidocs/docs/features/cdc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,33 @@ pagination_prev: null
pagination_next: null
---

Change Data Capture (CDC) is a technique that captures changed rows from a database's transaction log and delivers them to consumers with low latency. Leveraging this technique enables Spice to keep [locally accelerated](../data-acceleration/index.md) datasets up-to-date in real time with the source data, and it is highly efficient as it only transfers the changed rows instead of re-fetching the entire dataset on refresh.
Change Data Capture (CDC) captures changed rows from a database's transaction log and delivers them to consumers with low latency. This technique enables Spice to keep [locally accelerated](../data-acceleration/index.md) datasets up-to-date in real time with the source data. It is efficient because it only transfers the changed rows instead of re-fetching the entire dataset.

## Benefits

Leveraging locally accelerated datasets configured with CDC enables Spice to provide a solution that combines high-performance accelerated queries and efficient real-time delta updates.
Using locally accelerated datasets configured with CDC enables Spice to provide high-performance accelerated queries and efficient real-time updates.

## Example Use Case

Consider a fraud detection application that needs to determine whether a pending transaction is likely fraudulent. The application queries a Spice-accelerated real-time updated table of recent transactions to check if a pending transaction resembles known fraudulent ones. Using CDC, the table is kept up-to-date, allowing the application to quickly identify potential fraud.
Consider a fraud detection application that needs to determine whether a pending transaction is likely fraudulent. The application queries a Spice-accelerated, real-time updated table of recent transactions to check if a pending transaction resembles known fraudulent ones. With CDC, the table is kept up-to-date, allowing the application to quickly identify potential fraud.

## Considerations

When configuring datasets to be accelerated with CDC, ensure that the [data connector](/components/data-connectors) supports CDC and can return a stream of row-level changes. See the [Supported Data Connectors](#supported-data-connectors) section for more information.

The startup time for CDC-accelerated datasets may be longer than that for non-CDC-accelerated datasets due to the initial synchronization of the dataset.
The startup time for CDC-accelerated datasets may be longer than for non-CDC-accelerated datasets due to the initial synchronization.

:::tip

It's recommended to use CDC-accelerated datasets with persistent data accelerator configurations (i.e. `file` mode for [`DuckDB`](/components/data-accelerators/duckdb.md)/[`SQLite`](/components/data-accelerators/sqlite.md) or [`PostgreSQL`](/components/data-accelerators/postgres/index.md)). This ensures that when Spice restarts, it can resume from the last known state of the dataset instead of re-fetching the entire dataset.
It is recommended to use CDC-accelerated datasets with persistent data accelerator configurations (i.e., `file` mode for [`DuckDB`](/components/data-accelerators/duckdb.md)/[`SQLite`](/components/data-accelerators/sqlite.md) or [`PostgreSQL`](/components/data-accelerators/postgres/index.md)). This ensures that when Spice restarts, it can resume from the last known state of the dataset instead of re-fetching the entire dataset.

:::

## Supported Data Connectors

Enabling CDC via setting `refresh_mode: changes` in the acceleration settings requires support from the data connector to provide a stream of row-level changes.
Enabling CDC by setting `refresh_mode: changes` in the acceleration settings requires support from the data connector to provide a stream of row-level changes.

At present, the only supported data connector is [Debezium](/components/data-connectors/debezium.md)..
Currently, the only supported data connector is [Debezium](/components/data-connectors/debezium.md).

## Example

Expand Down
36 changes: 17 additions & 19 deletions spiceaidocs/docs/features/data-acceleration/constraints.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ sidebar_position: 2
description: 'Learn how to add/configure constraints on local acceleration tables in Spice.'
---

Constraints are rules that enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality, as well as configuring the behavior for inserting data updates that violate constraints.
Constraints enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality and configure behavior for data updates that violate constraints.

Constraints are specified using [column references](#column-references) in the Spicepod via the `primary_key` field in the acceleration configuration. Additional unique constraints are specified via the [`indexes`](./indexes.md) field with the value `unique`. Data that violates these constraints will result in a [conflict](#handling-conflicts).

If there are multiple rows in the incoming data that violate any constraint, the entire incoming batch of data will be dropped.
If multiple rows in the incoming data violate any constraint, the entire incoming batch of data will be dropped.

Example Spicepod:

Expand Down Expand Up @@ -72,64 +72,62 @@ datasets:

:::danger[Invalid]

```yaml
datasets:
```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
'(number, timestamp)': unique
on_conflict:
hash: upsert
"(number, timestamp)": upsert
```
'(number, timestamp)': upsert
```

:::

The following Spicepod is valid because it specifies multiple `on_conflict` targets with `drop`, which is allowed:

:::tip[Valid]

```yaml
datasets:
```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
'(number, timestamp)': unique
on_conflict:
hash: drop
"(number, timestamp)": drop
```
'(number, timestamp)': drop
```

:::

The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert` and `drop`:

:::danger[Invalid]
```yaml
datasets:

```yaml
datasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash
indexes:
"(number, timestamp)": unique
'(number, timestamp)': unique
on_conflict:
hash: upsert
"(number, timestamp)": drop
```
'(number, timestamp)': drop
```

:::

Expand Down
Loading

0 comments on commit a9451e6

Please sign in to comment.