Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve spice dataset documentation #734

Merged
merged 10 commits into from
Jan 14, 2025
64 changes: 61 additions & 3 deletions website/docs/cli/reference/dataset.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
---

title: "dataset"
sidebar_label: "dataset"
pagination_prev: null
pagination_next: null

---

Dataset operations
Configure a Spice dataset.

### Usage

Expand All @@ -15,8 +17,64 @@ spice dataset [command]

Available `command`s:

- `configure`: Configure a dataset
- `configure`: Create/configure a dataset directly from the command-line, including customizing components such as whether to add acceleration to the connector.

**Note**: In order to run `spice dataset configure`, there _must_ be a `spicepod.yaml` file in the root of your project directory. To create this file, see [`spice init`](/docs/cli/reference/init).

#### Flags

- `-h`, `--help` Print this help message
- `-h`, `--help` Print this help message

### Example

When running `spice dataset configure`, Spice will prompt for four inputs:

1. The name of the dataset, labelled by `(1)` below.
2. The description of the dataset, labelled by `(2)` below.
3. The source of the dataset, labelled by `(3)` below. Consult [Spice's supported data connectors](/docs/components/data-connectors) to see possible values for this field. Note: Spice may prompt for a file format if necessary, as shown in the example below.
4. Whether or not to enable acceleration for this dataset, labelled by `(4)`. The default value for this input is `y`, enabling acceleration for this dataset. Learn more about acceleration in the [dataset acceleration reference](/docs/components/data-accelerators).

```shell
> spice dataset configure

dataset name: (spiceai) taxi-trips # (1)
description: Taxi Trips in S3 # (2)
from: s3://spiceai-demo-datasets/taxi_trips/2024/ # (3)
file_format (parquet/csv) (parquet) parquet
locally accelerate (y/n)? (y) y # (4)
2025/01/10 14:07:46 INFO Saved datasets/test/dataset.yaml
```

After execution, the directory structure looks like this for the above example:

```
├── datasets
│ ├── taxi-trips
│ ├── dataset.yaml
├── spicepod.yaml
└── ...
```

The datasets folder includes the datasets for your project configured by using `spice dataset configure` or added manually.

The `dataset.yaml` file in `./datasets/taxi-trips` is configured as defined by the inputs provided to `spice dataset configure`. For this example, the `dataset.yaml` file looks as follows:

```yaml
from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi-trips
description: Taxi trips in s3
acceleration:
- enabled: false
```

The command additionally updates the root `spicepod.yaml` file to include the configured dataset as a reference (`ref`). For this example, `spicepod.yaml` would include the following:

```yaml
version: v1
kind: Spicepod
name: Taxi Trips with Spice
datasets:
- ref: datasets/taxi-trips
```

To learn more about Spice datasets and Spicepods, visit the [Spice dataset reference](/docs/reference/spicepod/datasets) and [Spicepod reference](/docs/reference/spicepod).
Loading