Documentation improvements for RC.1

spiceai · Nov 28, 2024 · a9451e6 · a9451e6
1 parent 6f46ca7
commit a9451e6
Show file tree

Hide file tree

Showing 32 changed files with 218 additions and 161 deletions.
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -6,13 +6,18 @@ Remember to be concise, but do not omit useful information. Pay attention to det
 
 Use plain, clear, simple, easy-to-understand language. Do not use hyperbole or hype.
 
+Avoid "allows" to describe functionality.
+
 Always provide references and citations with links.
 
 Adhere to the instructions in CONTRIBUTING.md.
 
 Never use the words:
 
+- delve
 - seamlessly
 - empower / empowering
 - supercharge
 - countless
+- enhance / enhancing
+- allow / allowing
diff --git a/spiceaidocs/docs/components/embeddings/huggingface.md b/spiceaidocs/docs/components/embeddings/huggingface.md
@@ -4,19 +4,23 @@ sidebar_label: 'HuggingFace'
 sidebar_position: 2
 ---
 
-To run an embedding model from HuggingFace, specify the `huggingface` path in `from`. This will handle downloading and running the embedding model locally.
+To use an embedding model from HuggingFace with Spice, specify the `huggingface` path in the `from` field of your configuration. The model and its related files will be automatically downloaded, loaded, and served locally by Spice.
+
+Here is an example configuration in `spicepod.yaml`:
+
 ```yaml
 embeddings:
   - from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
     name: all_minilm_l6_v2
 ```
 
 Supported models include:
- - All models tagged as [text-embeddings-inference](https://huggingface.co/models?other=text-embeddings-inference) on Huggingface
- - Any Huggingface repository with the correct files to be loaded as a [local embedding model](/components/embeddings/local.md).
 
+- All models tagged as [text-embeddings-inference](https://huggingface.co/models?other=text-embeddings-inference) on Huggingface
+- Any Huggingface repository with the correct files to be loaded as a [local embedding model](/components/embeddings/local.md).
 
 With the same semantics as [language models](/components/models/huggingface#access-tokens), `spice` can run private HuggingFace embedding models:
+
 ```yaml
 embeddings:
   - from: huggingface:huggingface.co/secret-company/awesome-embedding-model

diff --git a/spiceaidocs/docs/components/embeddings/index.md b/spiceaidocs/docs/components/embeddings/index.md
@@ -7,12 +7,12 @@ pagination_prev: null
 pagination_next: null
 ---
 
-Embedding models are used to convert raw text into a numerical representation that can be used by machine learning models.
-
-Spice supports running embedding models locally, or use remote services such as OpenAI, or [la Plateforme](https://console.mistral.ai/).
+Embedding models convert raw text into numerical representations that can be used by machine learning models. Spice supports running embedding models locally or using remote services such as OpenAI or [la Plateforme](https://console.mistral.ai/).
 
 Embedding models are defined in the `spicepod.yaml` file as top-level components.
 
+Example configuration in `spicepod.yaml`:
+
 ```yaml
 embeddings:
   - from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
@@ -31,5 +31,6 @@ embeddings:
 ```
 
 Embedding models can be used either by:
- - An OpenAI-compatible [endpoint](/api/http/embeddings.md)
- - By augmenting a dataset with column-level [embeddings](/reference/spicepod/datasets.md#embeddings), to provide vector-based [search functionality](/features/search/index.md#vector-search).
+
+- An OpenAI-compatible [endpoint](/api/http/embeddings.md)
+- By augmenting a dataset with column-level [embeddings](/reference/spicepod/datasets.md#embeddings), to provide vector-based [search functionality](/features/search/index.md#vector-search).
diff --git a/spiceaidocs/docs/components/embeddings/local.md b/spiceaidocs/docs/components/embeddings/local.md
@@ -4,7 +4,11 @@ sidebar_label: 'Local'
 sidebar_position: 3
 ---
 
-Embedding models can be run with files stored locally.
+Embedding models can be run with files stored locally. This method is useful for using models that are not hosted on remote services.
+
+### Example Configuration
+
+To configure an embedding model using local files, you can specify the details in the `spicepod.yaml` file as shown below:
 
 ```yaml
 embeddings:
@@ -16,6 +20,7 @@ embeddings:
 ```
 
 ## Required Files
- - Model file, one of: `model.safetensors`, `pytorch_model.bin`.
- - A tokenizer file with the filename `tokenizer.json`.
- - A config file with the filename `config.json`.
+
+- Model file, one of: `model.safetensors`, `pytorch_model.bin`.
+- A tokenizer file with the filename `tokenizer.json`.
+- A config file with the filename `config.json`.
diff --git a/spiceaidocs/docs/components/embeddings/openai.md b/spiceaidocs/docs/components/embeddings/openai.md
@@ -4,20 +4,18 @@ sidebar_label: 'OpenAI'
 sidebar_position: 1
 ---
 
-To use a hosted OpenAI (or compatible) embedding model, specify the `openai` path in `from`. 
+To use a hosted OpenAI (or compatible) embedding model, specify the `openai` path in the `from` field of your configuration. If you want to use a specific model, include its model ID in the `from` field. If no model ID is specified, it defaults to `"text-embedding-3-small"`.
 
-For a specific model, include it as the model ID in `from` (see example below). Defaults to `"text-embedding-3-small"`.
-These parameters are specific to OpenAI models:
+The following parameters are specific to OpenAI models:
 
-| Parameter | Description | Default |
-| ----- | ----------- | ------- |
-| `openai_api_key` | The OpenAI API key.        | -                           |
-| `openai_org_id` | The OpenAI organization id. | -                           |
-| `openai_project_id` | The OpenAI project id.  | -                           |
-| `endpoint` | The OpenAI API base endpoint.    | `https://api.openai.com/v1` |
+| Parameter           | Description                           | Default                     |
+| ------------------- | ------------------------------------- | --------------------------- |
+| `openai_api_key`    | The API key for accessing OpenAI.     | -                           |
+| `openai_org_id`     | The organization ID for OpenAI.       | -                           |
+| `openai_project_id` | The project ID for OpenAI.            | -                           |
+| `endpoint`          | The base endpoint for the OpenAI API. | `https://api.openai.com/v1` |
 
-
-Example:
+Below is an example configuration in `spicepod.yaml`:
 
 ```yaml
 models:
@@ -31,4 +29,4 @@ models:
     params:
       endpoint: https://api.mistral.ai/v1
       api_key: ${ secrets:SPICE_MISTRAL_API_KEY }
-```
+```
diff --git a/spiceaidocs/docs/components/models/filesystem.md b/spiceaidocs/docs/components/models/filesystem.md
@@ -5,7 +5,7 @@ sidebar_label: 'Filesystem'
 sidebar_position: 3
 ---
 
-To use a model hosted on a filesystem, specify the path to the model file in `from`.
+To use a model hosted on a filesystem, specify the path to the model file in the `from` field.
 
 Supported formats include ONNX for traditional machine learning models and GGUF, GGML, and SafeTensor for large language models (LLMs).
 
@@ -50,15 +50,17 @@ models:
 ```
 
 ### Example: Loading from a directory
+
 ```yaml
 models:
   - name: hello
     from: file:models/llms/llama3.2-1b-instruct/
 ```
-Note: The folder provided should contain all the expected files (see examples above) to load a model in the base level.
 
+Note: The folder provided should contain all the expected files (see examples above) to load a model in the base level.
 
 ### Example: Overriding the Chat Template
+
 Chat templates convert the OpenAI compatible chat messages (see [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages)) and other components of a request
 into a stream of characters for the language model. It follows Jinja3 templating [syntax](https://jinja.palletsprojects.com/en/3.1.x/templates/).
 
@@ -81,6 +83,7 @@ models:
 ```
 
 #### Templating Variables
- - `messages`: List of chat messages, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).
- - `add_generation_prompt`: Boolean flag whether to add a [generation prompt](https://huggingface.co/docs/transformers/main/chat_templating#what-are-generation-prompts).
- - `tools`: List of callable tools, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools).
+
+- `messages`: List of chat messages, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).
+- `add_generation_prompt`: Boolean flag whether to add a [generation prompt](https://huggingface.co/docs/transformers/main/chat_templating#what-are-generation-prompts).
+- `tools`: List of callable tools, in the OpenAI [format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools).
diff --git a/spiceaidocs/docs/components/models/openai.md b/spiceaidocs/docs/components/models/openai.md
@@ -5,16 +5,17 @@ sidebar_label: 'OpenAI'
 sidebar_position: 4
 ---
 
-To use a language model hosted on OpenAI (or compatible), specify the `openai` path in `from`.
+To use a language model hosted on OpenAI (or compatible), specify the `openai` path in the `from` field.
+
+For a specific model, include it as the model ID in the `from` field (see example below). The default model is `"gpt-3.5-turbo"`.
 
-For a specific model, include it as the model ID in `from` (see example below). Defaults to `"gpt-3.5-turbo"`.
 These parameters are specific to OpenAI models:
 
 | Param               | Description                   | Default                     |
 | ------------------- | ----------------------------- | --------------------------- |
 | `openai_api_key`    | The OpenAI API key.           | -                           |
-| `openai_org_id`     | The OpenAI organization id.   | -                           |
-| `openai_project_id` | The OpenAI project id.        | -                           |
+| `openai_org_id`     | The OpenAI organization ID.   | -                           |
+| `openai_project_id` | The OpenAI project ID.        | -                           |
 | `endpoint`          | The OpenAI API base endpoint. | `https://api.openai.com/v1` |
 
 Example:

diff --git a/spiceaidocs/docs/components/models/spiceai.md b/spiceaidocs/docs/components/models/spiceai.md
@@ -5,7 +5,7 @@ sidebar_label: 'Spice Cloud Platform'
 sidebar_position: 2
 ---
 
-To use a model hosted on the [Spice Cloud Platform](https://docs.spice.ai/building-blocks/spice-models), specify the `spice.ai` path in `from`.
+To use a model hosted on the [Spice Cloud Platform](https://docs.spice.ai/building-blocks/spice-models), specify the `spice.ai` path in the `from` field.
 
 Example:
 

diff --git a/spiceaidocs/docs/components/views/index.md b/spiceaidocs/docs/components/views/index.md
@@ -1,18 +1,20 @@
 ---
 title: 'Views'
 sidebar_label: 'Views'
-description: 'Documentation for defining Views'
+description: 'Documentation for defining Views in Spice'
 sidebar_position: 7
 ---
 
-Views in Spice are virtual tables defined by SQL queries. They simplify complex queries and support reuse across applications.
+Views in Spice are virtual tables defined by SQL queries. They help simplify complex queries and promote reuse across different applications by encapsulating query logic in a single, reusable entity.
 
 ## Defining a View
 
-To define a view in `spicepod.yaml`, specify the `views` section. Each view requires a `name` and a `sql` field.
+To define a view in the `spicepod.yaml` configuration file, specify the `views` section. Each view definition must include a `name` and a `sql` field.
 
 ### Example
 
+The following example demonstrates how to define a view named `rankings` that lists the top five products based on the total count of orders:
+
 ```yaml
 views:
   - name: rankings

diff --git a/spiceaidocs/docs/features/cdc/index.md b/spiceaidocs/docs/features/cdc/index.md
@@ -7,33 +7,33 @@ pagination_prev: null
 pagination_next: null
 ---
 
-Change Data Capture (CDC) is a technique that captures changed rows from a database's transaction log and delivers them to consumers with low latency. Leveraging this technique enables Spice to keep [locally accelerated](../data-acceleration/index.md) datasets up-to-date in real time with the source data, and it is highly efficient as it only transfers the changed rows instead of re-fetching the entire dataset on refresh.
+Change Data Capture (CDC) captures changed rows from a database's transaction log and delivers them to consumers with low latency. This technique enables Spice to keep [locally accelerated](../data-acceleration/index.md) datasets up-to-date in real time with the source data. It is efficient because it only transfers the changed rows instead of re-fetching the entire dataset.
 
 ## Benefits
 
-Leveraging locally accelerated datasets configured with CDC enables Spice to provide a solution that combines high-performance accelerated queries and efficient real-time delta updates.
+Using locally accelerated datasets configured with CDC enables Spice to provide high-performance accelerated queries and efficient real-time updates.
 
 ## Example Use Case
 
-Consider a fraud detection application that needs to determine whether a pending transaction is likely fraudulent. The application queries a Spice-accelerated real-time updated table of recent transactions to check if a pending transaction resembles known fraudulent ones. Using CDC, the table is kept up-to-date, allowing the application to quickly identify potential fraud.
+Consider a fraud detection application that needs to determine whether a pending transaction is likely fraudulent. The application queries a Spice-accelerated, real-time updated table of recent transactions to check if a pending transaction resembles known fraudulent ones. With CDC, the table is kept up-to-date, allowing the application to quickly identify potential fraud.
 
 ## Considerations
 
 When configuring datasets to be accelerated with CDC, ensure that the [data connector](/components/data-connectors) supports CDC and can return a stream of row-level changes. See the [Supported Data Connectors](#supported-data-connectors) section for more information.
 
-The startup time for CDC-accelerated datasets may be longer than that for non-CDC-accelerated datasets due to the initial synchronization of the dataset.
+The startup time for CDC-accelerated datasets may be longer than for non-CDC-accelerated datasets due to the initial synchronization.
 
 :::tip
 
-It's recommended to use CDC-accelerated datasets with persistent data accelerator configurations (i.e. `file` mode for [`DuckDB`](/components/data-accelerators/duckdb.md)/[`SQLite`](/components/data-accelerators/sqlite.md) or [`PostgreSQL`](/components/data-accelerators/postgres/index.md)). This ensures that when Spice restarts, it can resume from the last known state of the dataset instead of re-fetching the entire dataset.
+It is recommended to use CDC-accelerated datasets with persistent data accelerator configurations (i.e., `file` mode for [`DuckDB`](/components/data-accelerators/duckdb.md)/[`SQLite`](/components/data-accelerators/sqlite.md) or [`PostgreSQL`](/components/data-accelerators/postgres/index.md)). This ensures that when Spice restarts, it can resume from the last known state of the dataset instead of re-fetching the entire dataset.
 
 :::
 
 ## Supported Data Connectors
 
-Enabling CDC via setting `refresh_mode: changes` in the acceleration settings requires support from the data connector to provide a stream of row-level changes.
+Enabling CDC by setting `refresh_mode: changes` in the acceleration settings requires support from the data connector to provide a stream of row-level changes.
 
-At present, the only supported data connector is [Debezium](/components/data-connectors/debezium.md)..
+Currently, the only supported data connector is [Debezium](/components/data-connectors/debezium.md).
 
 ## Example
 

diff --git a/spiceaidocs/docs/features/data-acceleration/constraints.md b/spiceaidocs/docs/features/data-acceleration/constraints.md
@@ -5,11 +5,11 @@ sidebar_position: 2
 description: 'Learn how to add/configure constraints on local acceleration tables in Spice.'
 ---
 
-Constraints are rules that enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality, as well as configuring the behavior for inserting data updates that violate constraints.
+Constraints enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality and configure behavior for data updates that violate constraints.
 
 Constraints are specified using [column references](#column-references) in the Spicepod via the `primary_key` field in the acceleration configuration. Additional unique constraints are specified via the [`indexes`](./indexes.md) field with the value `unique`. Data that violates these constraints will result in a [conflict](#handling-conflicts).
 
-If there are multiple rows in the incoming data that violate any constraint, the entire incoming batch of data will be dropped.
+If multiple rows in the incoming data violate any constraint, the entire incoming batch of data will be dropped.
 
 Example Spicepod:
 
@@ -72,64 +72,62 @@ datasets:
 
     :::danger[Invalid]
 
-      ```yaml
-      datasets:
-
+    ```yaml
+    datasets:
       - from: spice.ai/eth.recent_blocks
         name: eth.recent_blocks
         acceleration:
           enabled: true
           engine: sqlite
           primary_key: hash
           indexes:
-            "(number, timestamp)": unique
+            '(number, timestamp)': unique
           on_conflict:
             hash: upsert
-            "(number, timestamp)": upsert
-      ```
+            '(number, timestamp)': upsert
+    ```
 
     :::
 
           The following Spicepod is valid because it specifies multiple `on_conflict` targets with `drop`, which is allowed:
 
     :::tip[Valid]
 
-      ```yaml
-      datasets:
-
+    ```yaml
+    datasets:
       - from: spice.ai/eth.recent_blocks
         name: eth.recent_blocks
         acceleration:
           enabled: true
           engine: sqlite
           primary_key: hash
           indexes:
-            "(number, timestamp)": unique
+            '(number, timestamp)': unique
           on_conflict:
             hash: drop
-            "(number, timestamp)": drop
-      ```
+            '(number, timestamp)': drop
+    ```
 
     :::
 
           The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert` and `drop`:
 
     :::danger[Invalid]
-      ```yaml
-      datasets:
 
+    ```yaml
+    datasets:
       - from: spice.ai/eth.recent_blocks
         name: eth.recent_blocks
         acceleration:
           enabled: true
           engine: sqlite
           primary_key: hash
           indexes:
-            "(number, timestamp)": unique
+            '(number, timestamp)': unique
           on_conflict:
             hash: upsert
-            "(number, timestamp)": drop
-      ```
+            '(number, timestamp)': drop
+    ```
 
     :::