Update dataset spicepod reference (#102)

* Update dataset spicepod reference * Update datasets.md
spiceai · Feb 23, 2024 · 14607f6 · 14607f6
1 parent 29edece
commit 14607f6
Show file tree

Hide file tree

Showing 10 changed files with 138 additions and 530 deletions.
diff --git a/spiceaidocs/content/en/concepts/_index.md b/spiceaidocs/content/en/concepts/_index.md
@@ -25,35 +25,3 @@ A `Pod` is a package of configuration and data used to train and deploy Spice.ai
 A `Pod manifest` is a YAML file that describes how to connect data with a learning environment.
 
 A Pod is constructed from the following components:
-
-### Dataspace
-
-A [dataspace]({{<ref "concepts/dataspaces">}}) is a specification on how the Spice.ai runtime and AI engine loads, processes and interacts with data from a single source. A dataspace may contain a single data connector and data processor. There may be multiple dataspace definitions within a pod. The fields specified in the union of dataspaces are used as inputs to the neural networks that Spice.ai trains.
-
-A dataspace that doesn't contain a data connector/processor means that the observation data for this dataspace will be provided by calling [POST /pods/{pod}/observations]({{<ref api>}}).
-
-### Data Connector
-
-A [data connector]({{<ref "reference/pod#data-connector">}}) is a reuseable component that contains logic to fetch or ingest data from an external source. Spice.ai provides a general interface that anyone can implement to create a data connector, see the [data-components-contrib](https://github.com/spiceai/data-components-contrib/tree/trunk/dataconnectors) repo for more information.
-
-### Data Processor
-
-A [data processor]({{<ref "reference/pod#data-processor">}}) is a reusable component, composable with a data connector that contains logic to process raw connector data into [observations]({{<ref "api#observations">}}) and state Spice.ai can use.
-
-Spice.ai provides a general interface that anyone can implement to create a data processor, see the [data-components-contrib](https://github.com/spiceai/data-components-contrib/tree/trunk/dataprocessors) repo for more information.
-
-### Actions
-
-[Actions]({{<ref "reference/pod#actions">}}) are the set of actions the Spice.ai runtime can recommend for a pod.
-
-### Recommendations
-
-To intelligently adapt its behavior, an application should query the Spice.ai runtime for which [action]({{<ref "reference/pod#actions">}}) it recommends to take given a specified time. The result of this query is a [recommendation]({{<ref "concepts/recommendations">}}).
-
-If a time is not specified, the resulting recommendation query time will default to the time of the most recently ingested observation.
-
-### Training Rewards
-
-[Training Rewards]({{<ref "reference/pod#rewards">}}) are code definitions in Python that tell the Spice.ai AI Engine how to train the neural networks to achieve the desired goal. A reward is defined for each action specified in the pod.
-
-In the future we will expand the languages we support for writing the reward functions in. [Let us know](mailto:[email protected]) which language you want to be able to write your reward functions in!
diff --git a/spiceaidocs/content/en/concepts/rewards/_index.md b/spiceaidocs/content/en/concepts/rewards/_index.md
diff --git a/spiceaidocs/content/en/concepts/rewards/external.md b/spiceaidocs/content/en/concepts/rewards/external.md
diff --git a/spiceaidocs/content/en/concepts/time/_index.md b/spiceaidocs/content/en/concepts/time/_index.md
@@ -39,8 +39,6 @@ params:
 
 If not provided in the manifest, Spicepods will default to a period of **3 days**, intervals of **1 min**, and granularity of **10 seconds**. The period epoch will default to a dynamic epoch of the current time minus the period. In this mode, the period becomes a sliding window over time.
 
-See reference documentation for [Spicepod params]({{<ref "reference/pod#params">}}).
-
 ### Period
 
 The `period` defines the entire timespan the Spicepod will use for learning and decision-making.

diff --git a/spiceaidocs/content/en/reference/Spicepod/_index.md b/spiceaidocs/content/en/reference/Spicepod/_index.md
@@ -41,7 +41,7 @@ metadata:
 
 ## `datasets`
 
-A Spicepod can contain one or more [datasets](https://docs.spice.ai/reference/specifications/dataset-and-view-yaml-specification) referenced by relative path.
+A Spicepod can contain one or more [datasets]({{<ref "reference/Spicepod/datasets">}}) referenced by relative path.
 
 **Example**
 
@@ -60,6 +60,18 @@ datasets:
     dependsOn: datasets/uniswap_eth_usdc
 ```
 
+A dataset defined inline.
+
+```yaml
+datasets:
+  - name: spiceai.uniswap_v2_eth_usdc
+    type: overwrite
+    source: spice.ai
+    acceleration:
+      enabled: true
+      refresh: 1h
+```
+
 ## `functions`
 
 A Spicepod can contain one or more [functions](https://docs.spice.ai/reference/specifications/spice-functions-yaml-specification) referenced by relative path.

diff --git a/spiceaidocs/content/en/reference/Spicepod/datasets.md b/spiceaidocs/content/en/reference/Spicepod/datasets.md
@@ -0,0 +1,125 @@
+---
+type: docs
+title: "Datasets"
+linkTitle: "Datasets"
+description: 'Datasets YAML reference'
+weight: 80
+---
+
+A Spicepod can contain one or more datasets referenced by relative path, or defined inline.
+
+# `datasets`
+
+Inline example:
+
+`spicepod.yaml`
+```yaml
+datasets:
+  - from: spice.ai/eth/beacon/eigenlayer
+    name: strategy_manager_deposits
+    params:
+      app: goerli-app
+    acceleration:
+      enabled: true
+      mode: inmemory # / file
+      engine: arrow # / duckdb
+      refresh_interval: 1h
+      refresh_mode: full / append # update / incremental
+      retention: 30m
+```
+
+`spicepod.yaml`
+```yaml
+datasets:
+  - from: databricks.com/spiceai/datasets
+    name: uniswap_eth_usd
+    params:
+      environment: prod
+    acceleration:
+      enabled: true
+      mode: inmemory # / file
+      engine: arrow # / duckdb
+      refresh_interval: 1h
+      refresh_mode: full / append # update / incremental
+      retention: 30m
+```
+
+`spicepod.yaml`
+```yaml
+datasets:
+  - from: local/Users/phillip/data/test.parquet
+    name: test
+    acceleration:
+      enabled: true
+      mode: inmemory # / file
+      engine: arrow # / duckdb
+      refresh_interval: 1h
+      refresh_mode: full / append # update / incremental
+      retention: 30m
+```
+
+Relative path example:
+
+`spicepod.yaml`
+```yaml
+datasets:
+  - from: datasets/uniswap_v2_eth_usdc
+```
+
+`datasets/uniswap_v2_eth_usdc/dataset.yaml`
+```yaml
+name: spiceai.uniswap_v2_eth_usdc
+type: overwrite
+source: spice.ai
+auth: spice.ai
+acceleration:
+  enabled: true
+  refresh: 1h
+```
+
+## `name`
+
+The name of the dataset. This is used to reference the dataset in the pod manifest, as well as in external data sources.
+
+## `type`
+
+The type of dataset. The following types are supported:
+
+- `overwrite` - Overwrites the dataset with the contents of the dataset source.
+- `append` - Appends new data from dataset source to the dataset.
+
+## `source`
+
+The source of the dataset. The following sources are supported:
+
+- `spice.ai`
+- `dremio` (coming soon)
+- `databricks` (coming soon)
+
+## `auth`
+
+Optional. The authentication profile to use to connect to the dataset source. Use `spice login` to create a new authentication profile.
+
+If not specified, the default profile for the data source is used.
+
+## `acceleration`
+
+Optional. Accelerate queries to the dataset by caching data locally.
+
+## `acceleration.enabled`
+
+Optional. Enable or disable acceleration.
+
+## `acceleration.refresh`
+
+Optional. The interval to refresh the data for the dataset if the dataset type is overwrite. Specified as a [duration literal]({{<ref "reference/duration">}}).
+
+For `append` datasets, the refresh interval not used.
+
+i.e. `1h` for 1 hour, `1m` for 1 minute, `1s` for 1 second, etc.
+
+## `acceleration.retention`
+
+Optional. Only supported for `append` datasets. Specifies how long to retain data updates from the data source before they are deleted. Specified as a [duration literal]({{<ref "reference/duration">}}).
+
+If not specified, the default retention is to keep all data.