Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for Catalog Connectors #324

Merged
merged 5 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions spiceaidocs/docs/components/catalogs/databricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: 'Databricks Catalog Connector'
sidebar_label: 'Databricks'
description: 'Connect to a Databricks Unity Catalog provider.'
sidebar_position: 1
pagination_prev: null
pagination_next: null
---

Connect to a [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) as a catalog provider for federated SQL query using [Spark Connect](https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html) or directly from [Delta Lake](https://delta.io/) tables.

## Configuration

```yaml
catalogs:
- from: databricks:my_uc_catalog
name: uc_catalog # tables from this catalog will be available in the "uc_catalog" catalog in Spice
include:
- "*.my_table_name" # include only the "my_table_name" tables
params:
endpoint: dbc-a12cd3e4-56f7.cloud.databricks.com
mode: delta_lake # or spark_connect
dataset_params:
# delta_lake S3 parameters
aws_region: us-west-2
aws_access_key_id: <aws-access-key-id>
aws_secret_access_key: <aws-secret>
aws_endpoint: s3.us-west-2.amazonaws.com
# spark_connect parameters
databricks_cluster_id: 1234-567890-abcde123
```

## `from`
The `from` field is used to specify the catalog provider. For Databricks, use `databricks:<catalog_name>`. The `catalog_name` is the name of the catalog in the Databricks Unity Catalog you want to connect to.

## `name`
The `name` field is used to specify the name of the catalog in Spice. Tables from the Databricks catalog will be available in the schema with this name in Spice. The schema hierarchy of the external catalog is preserved in Spice.

## `include`
Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.

## `params`
The `params` field is used to configure the connection to the Databricks Unity Catalog. The following parameters are supported:

- `endpoint`: The Databricks workspace endpoint, e.g. `dbc-a12cd3e4-56f7.cloud.databricks.com`.
- `token`: The Databricks API token to authenticate with the Unity Catalog API. Can also be specified in the `databricks` secret. See the [Databricks Data Connector](/components/data-connectors/databricks.md) for more information on configuring the secret.
- `mode`: The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:
- `spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.
- `delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided, either as a secret or inline in the params.
- `databricks_use_ssl`: If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`.

## `dataset_params`
The `dataset_params` field is used to configure the dataset-specific parameters for the catalog. The following parameters are supported:

### Spark Connect parameters
- `databricks_cluster_id`: The ID of the compute cluster in Databricks to use for the query. e.g. `1234-567890-abcde123`.

### Delta Lake parameters
These settings can also be configured in the `databricks` secret. See the [Databricks Data Connector](/components/data-connectors/databricks.md) for more information on configuring the secret.

#### AWS S3

- `aws_region`: The AWS region for the S3 object store.
- `aws_access_key_id`: The access key ID for the S3 object store.
- `aws_secret_access_key`: The secret access key for the S3 object store.
- `aws_endpoint`: The endpoint for the S3 object store.

#### Azure Blob
Note: One of the following must be provided: `azure_storage_account_key`, `azure_storage_client_id` and `azure_storage_client_secret`, or `azure_storage_sas_key`.

- `azure_storage_account_name`: The Azure Storage account name.
- `azure_storage_account_key`: The Azure Storage master key for accessing the storage account.
- `azure_storage_client_id`: The service principal client id for accessing the storage account.
- `azure_storage_client_secret`: The service principal client secret for accessing the storage account.
- `azure_storage_sas_key`: The shared access signature key for accessing the storage account.
- `azure_storage_endpoint`: The endpoint for the Azure Blob storage account.

#### Google Storage (GCS)

- `google_service_account`: Filesystem path to the Google service account JSON key file.
44 changes: 44 additions & 0 deletions spiceaidocs/docs/components/catalogs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: 'Catalog Connectors'
sidebar_label: 'Catalog Connectors'
description: ''
sidebar_position: 5
pagination_prev: null
pagination_next: null
---

In Spice, datasets are organized hierarchically with catalogs, schemas, and tables. A catalog, at the top level, contains multiple schemas. Each schema, in turn, contains multiple tables where the actual data is stored. By default a catalog named `spice` is created with all of the datasets defined in the `datasets` section of the Spicepod.

<img src="/img/catalog-schema-table.png" />

Creating schemas and tables within the `spice` catalog is configured by the `name` field in the dataset configuration. A name with a period (`.`) will create schema, i.e. a dataset defined with `name: foo.bar` would have a full path of `spice.foo.bar`. If the name does not contain a period, the dataset will be created in the `public` schema of the `spice` catalog. For example, a dataset defined with `name: foo` would have a full path of `spice.public.foo`. Attempting to create a dataset with a name that contains a catalog name will result in an error. Adding catalogs to Spice is done via Catalog Connectors.

Catalog Connectors connect to external catalog providers and make their tables available for federated SQL query in Spice. Configuring accelerations for tables in external catalogs is not supported. The schema hierarchy of the external catalog is preserved in Spice.

Currently supported Catalog Connectors include:

| Name | Description | Status | Protocol/Format |
| --------------- | ----------- | ------ | ----------------------------------- |
| `databricks` | Databricks | Alpha | Spark Connect <br/> S3 / Delta Lake |
| `unity_catalog` | Unity Catalog | Alpha | Delta Lake |
| `spice.ai` | Spice.ai Cloud Platform | Alpha | Arrow Flight |

## Catalog Connector Docs

Catalog are configured using a Catalog Connector in the `catalogs` section of the Spicepod. See the specific Catalog Connector documentation for configuration details.

### `include`
Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.

Example:
```yaml
catalogs:
- from: spice.ai
name: spiceai
include:
- "tpch.*" # Include only the "tpch" tables.
```

import DocCardList from '@theme/DocCardList';

<DocCardList />
32 changes: 32 additions & 0 deletions spiceaidocs/docs/components/catalogs/spiceai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: 'Spice.ai Catalog Connector'
sidebar_label: 'Spice.ai'
description: 'Connect to the Spice.ai built-in catalog.'
sidebar_position: 3
pagination_prev: null
pagination_next: null
---

Query all of the datasets provided by the [Spice.ai Cloud Platform](https://spice.ai).

## Configuration

Create a [Spice.ai Cloud Platform](https://spice.ai) account and login with the CLI using `spice login`.

Example:
```yaml
catalogs:
- from: spice.ai
name: spicey # tables from the Spice.ai platform will be available in the "spicey" schema in Spice
include:
- "tpch.*" # include only the tables from the "tpch" schema
```

## `from`
The `from` field is used to specify the catalog provider. For the Spice.ai catalog connector, use `spiceai`.

## `name`
The `name` field is used to specify the name of the catalog in Spice. Tables from the Spice.ai built-in catalog will be available in the schema with this name in Spice.

## `include`
Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.
61 changes: 61 additions & 0 deletions spiceaidocs/docs/components/catalogs/unity-catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: 'Unity Catalog Catalog Connector'
sidebar_label: 'Unity Catalog'
description: 'Connect to a Unity Catalog provider.'
sidebar_position: 2
pagination_prev: null
pagination_next: null
---

Connect to a [Unity Catalog](https://www.unitycatalog.io/) as a catalog provider for federated SQL query against [Delta Lake](https://delta.io/) tables.

## Configuration

```yaml
catalogs:
- from: unity_catalog:https://my_unity_catalog_host.com/api/2.1/unity-catalog/catalogs/my_catalog
name: uc
include:
- "*.my_table"
dataset_params:
# delta_lake S3 parameters
aws_region: us-west-2
aws_access_key_id: <aws-access-key-id>
aws_secret_access_key: <aws-secret>
aws_endpoint: s3.us-west-2.amazonaws.com
```

## `from`
The `from` field is used to specify the catalog provider. For Unity Catalog, use `unity_catalog:<catalog_path>`. The `catalog_path` is the URL to the [`getCatalog`](https://github.com/unitycatalog/unitycatalog/blob/main/api/Apis/CatalogsApi.md) endpoint of the Unity Catalog API. It should be formatted as `https://<unity_catalog_host>/api/2.1/unity-catalog/catalogs/<catalog_name>`.

## `name`
The `name` field is used to specify the name of the catalog in Spice. The schema hierarchy of the external catalog is preserved in Spice.

## `include`
Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.

## `dataset_params`
The `dataset_params` field is used to configure the dataset-specific parameters for the catalog.

These settings can also be configured in the `delta_lake` secret. See the [Delta Lake Data Connector](/components/data-connectors/delta-lake.md) for more information on configuring the secret.

#### AWS S3

- `aws_region`: The AWS region for the S3 object store.
- `aws_access_key_id`: The access key ID for the S3 object store.
- `aws_secret_access_key`: The secret access key for the S3 object store.
- `aws_endpoint`: The endpoint for the S3 object store.

#### Azure Blob
Note: One of the following must be provided: `azure_storage_account_key`, `azure_storage_client_id` and `azure_storage_client_secret`, or `azure_storage_sas_key`.

- `azure_storage_account_name`: The Azure Storage account name.
- `azure_storage_account_key`: The Azure Storage master key for accessing the storage account.
- `azure_storage_client_id`: The service principal client id for accessing the storage account.
- `azure_storage_client_secret`: The service principal client secret for accessing the storage account.
- `azure_storage_sas_key`: The shared access signature key for accessing the storage account.
- `azure_storage_endpoint`: The endpoint for the Azure Blob storage account.

#### Google Storage (GCS)

- `google_service_account`: Filesystem path to the Google service account JSON key file.
84 changes: 84 additions & 0 deletions spiceaidocs/docs/reference/spicepod/catalogs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: 'Catalogs'
sidebar_label: 'Catalogs'
description: 'Catalogs YAML reference'
---

A Spicepod can contain one or more catalogs.

# `catalogs`

Example:

`spicepod.yaml`

```yaml
catalogs:
- from: spiceai
name: spiceai
include:
- "tpch.*" # Include only the "tpch" tables.
```

## `from`

The `from` field is a string that represents the Uniform Resource Identifier (URI) for the catalog provider. This URI is composed of two parts: a prefix indicating the Catalog Connector to use, and the catalog path within the source.

The syntax for the `from` field is as follows:

```yaml
from: <catalog_connector>:<path>
```

Where:

- `<catalog_connector>`: The Catalog Connector to use to connect to the dataset

Currently supported catalog connectors:

- [`spiceai`](/components/catalogs/spiceai.md)
- [`databricks`](/components/catalogs/databricks.md)
- [`unity_catalog`](/components/catalogs/unity-catalog.md)

If the Data Connector is not explicitly specified, it defaults to `spiceai`.

- `<path>`: The path to the catalog within the provider.

## `ref`

An alternative to adding the catalog definition inline in the `spicepod.yaml` file. `ref` can be use to point to a directory with a catalog defined in a `catalog.yaml` file. For example, a catalog configured in a catalog.yaml in the "catalogs/sample" directory can be referenced with the following:

**catalogs/sample/catalog.yaml**

```yaml
from: spiceai
name: spiceai
include:
- "tpch.*" # Include only the "tpch" tables.
```

**ref used in spicepod.yaml**

```yaml
version: v1beta1
kind: Spicepod
name: duckdb
catalogs:
- ref: catalogs/sample
```

## `name`

The name of the catalog to register in Spice. The schema hierarchy of the external catalog is preserved in Spice. It doesn't need to match the name of the catalog in the external provider.

## `include`

Optional. The `include` field is used to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.

## `params`

Optional. Parameters to pass to the catalog connector for retrieving the metadata on the schemas and tables to be included. The parameters are specific to the connector used.

## `dataset_params`

Optional. Parameters used when constructing the individual datasets that are registered in Spice from the catalog. The parameters are specific to the connector used.
Binary file added spiceaidocs/static/img/catalog-schema-table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.