Add docs for Catalog Connectors (#324)

* wip on docs * WIP on catalog connector docs * Finish docs * Add Spicepod reference * Apply suggestions from code review Co-authored-by: Luke Kim <[email protected]> --------- Co-authored-by: Luke Kim <[email protected]>
spiceai · Jul 15, 2024 · be3c779 · be3c779
1 parent b750f88
commit be3c779
Show file tree

Hide file tree

Showing 6 changed files with 301 additions and 0 deletions.
diff --git a/spiceaidocs/docs/components/catalogs/databricks.md b/spiceaidocs/docs/components/catalogs/databricks.md
@@ -0,0 +1,80 @@
+---
+title: 'Databricks Catalog Connector'
+sidebar_label: 'Databricks'
+description: 'Connect to a Databricks Unity Catalog provider.'
+sidebar_position: 1
+pagination_prev: null
+pagination_next: null
+---
+
+Connect to a [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) as a catalog provider for federated SQL query using [Spark Connect](https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html) or directly from [Delta Lake](https://delta.io/) tables.
+
+## Configuration
+
+```yaml
+catalogs:
+  - from: databricks:my_uc_catalog
+    name: uc_catalog # tables from this catalog will be available in the "uc_catalog" catalog in Spice
+    include:
+      - "*.my_table_name" # include only the "my_table_name" tables
+    params:
+      endpoint: dbc-a12cd3e4-56f7.cloud.databricks.com
+      mode: delta_lake # or spark_connect
+    dataset_params:
+      # delta_lake S3 parameters
+      aws_region: us-west-2
+      aws_access_key_id: <aws-access-key-id>
+      aws_secret_access_key: <aws-secret>
+      aws_endpoint: s3.us-west-2.amazonaws.com
+      # spark_connect parameters
+      databricks_cluster_id: 1234-567890-abcde123
+```
+
+## `from`
+The `from` field is used to specify the catalog provider. For Databricks, use `databricks:<catalog_name>`. The `catalog_name` is the name of the catalog in the Databricks Unity Catalog you want to connect to.
+
+## `name`
+The `name` field is used to specify the name of the catalog in Spice. Tables from the Databricks catalog will be available in the schema with this name in Spice. The schema hierarchy of the external catalog is preserved in Spice.
+
+## `include`
+Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.
+
+## `params`
+The `params` field is used to configure the connection to the Databricks Unity Catalog. The following parameters are supported:
+
+- `endpoint`: The Databricks workspace endpoint, e.g. `dbc-a12cd3e4-56f7.cloud.databricks.com`.
+- `token`: The Databricks API token to authenticate with the Unity Catalog API. Can also be specified in the `databricks` secret. See the [Databricks Data Connector](/components/data-connectors/databricks.md) for more information on configuring the secret.
+- `mode`: The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:
+  - `spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.
+  - `delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided, either as a secret or inline in the params.
+- `databricks_use_ssl`: If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`.
+
+## `dataset_params`
+The `dataset_params` field is used to configure the dataset-specific parameters for the catalog. The following parameters are supported:
+
+### Spark Connect parameters
+- `databricks_cluster_id`: The ID of the compute cluster in Databricks to use for the query. e.g. `1234-567890-abcde123`.
+
+### Delta Lake parameters
+These settings can also be configured in the `databricks` secret. See the [Databricks Data Connector](/components/data-connectors/databricks.md) for more information on configuring the secret.
+
+#### AWS S3
+
+- `aws_region`: The AWS region for the S3 object store.
+- `aws_access_key_id`: The access key ID for the S3 object store.
+- `aws_secret_access_key`: The secret access key for the S3 object store.
+- `aws_endpoint`: The endpoint for the S3 object store.
+
+#### Azure Blob
+Note: One of the following must be provided: `azure_storage_account_key`, `azure_storage_client_id` and `azure_storage_client_secret`, or `azure_storage_sas_key`.
+
+- `azure_storage_account_name`: The Azure Storage account name.
+- `azure_storage_account_key`: The Azure Storage master key for accessing the storage account.
+- `azure_storage_client_id`: The service principal client id for accessing the storage account.
+- `azure_storage_client_secret`: The service principal client secret for accessing the storage account.
+- `azure_storage_sas_key`: The shared access signature key for accessing the storage account.
+- `azure_storage_endpoint`: The endpoint for the Azure Blob storage account.
+
+#### Google Storage (GCS)
+
+- `google_service_account`: Filesystem path to the Google service account JSON key file.
diff --git a/spiceaidocs/docs/components/catalogs/index.md b/spiceaidocs/docs/components/catalogs/index.md
@@ -0,0 +1,44 @@
+---
+title: 'Catalog Connectors'
+sidebar_label: 'Catalog Connectors'
+description: ''
+sidebar_position: 5
+pagination_prev: null
+pagination_next: null
+---
+
+In Spice, datasets are organized hierarchically with catalogs, schemas, and tables. A catalog, at the top level, contains multiple schemas. Each schema, in turn, contains multiple tables where the actual data is stored. By default a catalog named `spice` is created with all of the datasets defined in the `datasets` section of the Spicepod.
+
+<img src="/img/catalog-schema-table.png" />
+
+Creating schemas and tables within the `spice` catalog is configured by the `name` field in the dataset configuration. A name with a period (`.`) will create schema, i.e. a dataset defined with `name: foo.bar` would have a full path of `spice.foo.bar`. If the name does not contain a period, the dataset will be created in the `public` schema of the `spice` catalog. For example, a dataset defined with `name: foo` would have a full path of `spice.public.foo`. Attempting to create a dataset with a name that contains a catalog name will result in an error. Adding catalogs to Spice is done via Catalog Connectors.
+
+Catalog Connectors connect to external catalog providers and make their tables available for federated SQL query in Spice. Configuring accelerations for tables in external catalogs is not supported. The schema hierarchy of the external catalog is preserved in Spice.
+
+Currently supported Catalog Connectors include:
+
+| Name            | Description | Status | Protocol/Format                     | 
+| --------------- | ----------- | ------ | ----------------------------------- |
+| `databricks`    | Databricks  | Alpha  | Spark Connect <br/> S3 / Delta Lake | 
+| `unity_catalog`    | Unity Catalog  | Alpha  | Delta Lake                          | 
+| `spice.ai`       | Spice.ai Cloud Platform    | Alpha  | Arrow Flight                        |
+
+## Catalog Connector Docs
+
+Catalog are configured using a Catalog Connector in the `catalogs` section of the Spicepod. See the specific Catalog Connector documentation for configuration details.
+
+### `include`
+Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.
+
+Example:
+```yaml
+catalogs:
+  - from: spice.ai
+    name: spiceai
+    include:
+      - "tpch.*" # Include only the "tpch" tables.
+```
+
+import DocCardList from '@theme/DocCardList';
+
+<DocCardList />
diff --git a/spiceaidocs/docs/components/catalogs/spiceai.md b/spiceaidocs/docs/components/catalogs/spiceai.md
@@ -0,0 +1,32 @@
+---
+title: 'Spice.ai Catalog Connector'
+sidebar_label: 'Spice.ai'
+description: 'Connect to the Spice.ai built-in catalog.'
+sidebar_position: 3
+pagination_prev: null
+pagination_next: null
+---
+
+Query all of the datasets provided by the [Spice.ai Cloud Platform](https://spice.ai).
+
+## Configuration
+
+Create a [Spice.ai Cloud Platform](https://spice.ai) account and login with the CLI using `spice login`.
+
+Example:
+```yaml
+catalogs:
+  - from: spice.ai
+    name: spicey # tables from the Spice.ai platform will be available in the "spicey" schema in Spice
+    include:
+      - "tpch.*" # include only the tables from the "tpch" schema
+```
+
+## `from`
+The `from` field is used to specify the catalog provider. For the Spice.ai catalog connector, use `spiceai`.
+
+## `name`
+The `name` field is used to specify the name of the catalog in Spice. Tables from the Spice.ai built-in catalog will be available in the schema with this name in Spice.
+
+## `include`
+Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.
diff --git a/spiceaidocs/docs/components/catalogs/unity-catalog.md b/spiceaidocs/docs/components/catalogs/unity-catalog.md
@@ -0,0 +1,61 @@
+---
+title: 'Unity Catalog Catalog Connector'
+sidebar_label: 'Unity Catalog'
+description: 'Connect to a Unity Catalog provider.'
+sidebar_position: 2
+pagination_prev: null
+pagination_next: null
+---
+
+Connect to a [Unity Catalog](https://www.unitycatalog.io/) as a catalog provider for federated SQL query against [Delta Lake](https://delta.io/) tables.
+
+## Configuration
+
+```yaml
+catalogs:
+  - from: unity_catalog:https://my_unity_catalog_host.com/api/2.1/unity-catalog/catalogs/my_catalog
+    name: uc
+    include:
+      - "*.my_table"
+    dataset_params:
+      # delta_lake S3 parameters
+      aws_region: us-west-2
+      aws_access_key_id: <aws-access-key-id>
+      aws_secret_access_key: <aws-secret>
+      aws_endpoint: s3.us-west-2.amazonaws.com
+```
+
+## `from`
+The `from` field is used to specify the catalog provider. For Unity Catalog, use `unity_catalog:<catalog_path>`. The `catalog_path` is the URL to the [`getCatalog`](https://github.com/unitycatalog/unitycatalog/blob/main/api/Apis/CatalogsApi.md) endpoint of the Unity Catalog API. It should be formatted as `https://<unity_catalog_host>/api/2.1/unity-catalog/catalogs/<catalog_name>`.
+
+## `name`
+The `name` field is used to specify the name of the catalog in Spice. The schema hierarchy of the external catalog is preserved in Spice.
+
+## `include`
+Use the `include` field to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.
+
+## `dataset_params`
+The `dataset_params` field is used to configure the dataset-specific parameters for the catalog.
+
+These settings can also be configured in the `delta_lake` secret. See the [Delta Lake Data Connector](/components/data-connectors/delta-lake.md) for more information on configuring the secret.
+
+#### AWS S3
+
+- `aws_region`: The AWS region for the S3 object store.
+- `aws_access_key_id`: The access key ID for the S3 object store.
+- `aws_secret_access_key`: The secret access key for the S3 object store.
+- `aws_endpoint`: The endpoint for the S3 object store.
+
+#### Azure Blob
+Note: One of the following must be provided: `azure_storage_account_key`, `azure_storage_client_id` and `azure_storage_client_secret`, or `azure_storage_sas_key`.
+
+- `azure_storage_account_name`: The Azure Storage account name.
+- `azure_storage_account_key`: The Azure Storage master key for accessing the storage account.
+- `azure_storage_client_id`: The service principal client id for accessing the storage account.
+- `azure_storage_client_secret`: The service principal client secret for accessing the storage account.
+- `azure_storage_sas_key`: The shared access signature key for accessing the storage account.
+- `azure_storage_endpoint`: The endpoint for the Azure Blob storage account.
+
+#### Google Storage (GCS)
+
+- `google_service_account`: Filesystem path to the Google service account JSON key file.
diff --git a/spiceaidocs/docs/reference/spicepod/catalogs.md b/spiceaidocs/docs/reference/spicepod/catalogs.md
@@ -0,0 +1,84 @@
+---
+title: 'Catalogs'
+sidebar_label: 'Catalogs'
+description: 'Catalogs YAML reference'
+---
+
+A Spicepod can contain one or more catalogs.
+
+# `catalogs`
+
+Example:
+
+`spicepod.yaml`
+
+```yaml
+catalogs:
+  - from: spiceai
+    name: spiceai
+    include:
+      - "tpch.*" # Include only the "tpch" tables.
+```
+
+## `from`
+
+The `from` field is a string that represents the Uniform Resource Identifier (URI) for the catalog provider. This URI is composed of two parts: a prefix indicating the Catalog Connector to use, and the catalog path within the source.
+
+The syntax for the `from` field is as follows:
+
+```yaml
+from: <catalog_connector>:<path>
+```
+
+Where:
+
+- `<catalog_connector>`: The Catalog Connector to use to connect to the dataset
+
+  Currently supported catalog connectors:
+
+  - [`spiceai`](/components/catalogs/spiceai.md)
+  - [`databricks`](/components/catalogs/databricks.md)
+  - [`unity_catalog`](/components/catalogs/unity-catalog.md)
+
+  If the Data Connector is not explicitly specified, it defaults to `spiceai`.
+
+- `<path>`: The path to the catalog within the provider.
+
+## `ref`
+
+An alternative to adding the catalog definition inline in the `spicepod.yaml` file. `ref` can be use to point to a directory with a catalog defined in a `catalog.yaml` file. For example, a catalog configured in a catalog.yaml in the "catalogs/sample" directory can be referenced with the following:
+
+**catalogs/sample/catalog.yaml**
+
+```yaml
+from: spiceai
+name: spiceai
+include:
+  - "tpch.*" # Include only the "tpch" tables.
+```
+
+**ref used in spicepod.yaml**
+
+```yaml
+version: v1beta1
+kind: Spicepod
+name: duckdb
+catalogs:
+  - ref: catalogs/sample
+```
+
+## `name`
+
+The name of the catalog to register in Spice. The schema hierarchy of the external catalog is preserved in Spice. It doesn't need to match the name of the catalog in the external provider.
+
+## `include`
+
+Optional. The `include` field is used to specify which tables to include from the catalog. The `include` field supports glob patterns to match multiple tables. For example, `*.my_table_name` would include all tables with the name `my_table_name` in the catalog from any schema. Multiple `include` patterns are OR'ed together and can be specified to include multiple tables.
+
+## `params`
+
+Optional. Parameters to pass to the catalog connector for retrieving the metadata on the schemas and tables to be included. The parameters are specific to the connector used.
+
+## `dataset_params`
+
+Optional. Parameters used when constructing the individual datasets that are registered in Spice from the catalog. The parameters are specific to the connector used.
diff --git a/spiceaidocs/static/img/catalog-schema-table.png b/spiceaidocs/static/img/catalog-schema-table.png