Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type mapping documentation for databricks delta_lake mode #669

Merged
merged 2 commits into from
Dec 6, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 81 additions & 56 deletions spiceaidocs/docs/components/data-connectors/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Databricks as a connector for federated SQL query against Databricks using [Spar

```yaml
datasets:
- from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
- from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
Expand All @@ -30,12 +30,12 @@ The `from` field for the Databricks connector takes the form `databricks:catalog
The dataset name. This will be used as the table name within Spice.

Example:

```yaml
datasets:
- from: databricks:spiceai.datasets.my_awesome_table
name: cool_dataset
params:
...
params: ...
```

```sql
Expand All @@ -54,13 +54,13 @@ SELECT COUNT(*) FROM cool_dataset;

Use the [secret replacement syntax](../secret-stores/index.md) to reference a secret, e.g. `${secrets:my_token}`.

| Parameter Name | Description |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode` | The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:<br /> <ul><li>`spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.</li><li>`delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided.</li></ul> |
| `databricks_endpoint` | The endpoint of the Databricks instance. Required for both modes. |
| `databricks_cluster_id` | The ID of the compute cluster in Databricks to use for the query. Only valid when `mode` is `spark_connect`. |
| `databricks_use_ssl` | If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`. |
| `client_timeout` | Optional. Applicable only in `delta_lake` mode. Specifies timeout for object store operations. Default value is `30s` E.g. `client_timeout: 60s` |
| Parameter Name | Description |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode` | The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:<br /> <ul><li>`spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.</li><li>`delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided.</li></ul> |
| `databricks_endpoint` | The endpoint of the Databricks instance. Required for both modes. |
| `databricks_cluster_id` | The ID of the compute cluster in Databricks to use for the query. Only valid when `mode` is `spark_connect`. |
| `databricks_use_ssl` | If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`. |
| `client_timeout` | Optional. Applicable only in `delta_lake` mode. Specifies timeout for object store operations. Default value is `30s` E.g. `client_timeout: 60s` |

## Delta Lake object store parameters

Expand All @@ -80,15 +80,15 @@ Configure the connection to the object store when using `mode: delta_lake`. Use
:::info Note
**One** of the following auth values must be provided for Azure Blob:

- `databricks_azure_storage_account_key`,
- `databricks_azure_storage_client_id` and `azure_storage_client_secret`, or
- `databricks_azure_storage_account_key`,
- `databricks_azure_storage_client_id` and `azure_storage_client_secret`, or
- `databricks_azure_storage_sas_key`.
:::
:::

| Parameter Name | Description |
| ---------------------------------------- | ---------------------------------------------------------------------- |
| `databricks_azure_storage_account_name` | The Azure Storage account name. |
| `databricks_azure_storage_account_key` | The Azure Storage key for accessing the storage account. |
| `databricks_azure_storage_account_key` | The Azure Storage key for accessing the storage account. |
| `databricks_azure_storage_client_id` | The Service Principal client ID for accessing the storage account. |
| `databricks_azure_storage_client_secret` | The Service Principal client secret for accessing the storage account. |
| `databricks_azure_storage_sas_key` | The shared access signature key for accessing the storage account. |
Expand All @@ -105,64 +105,89 @@ Configure the connection to the object store when using `mode: delta_lake`. Use
### Spark Connect

```yaml
- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: spark_connect
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:my_token}
- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: spark_connect
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:my_token}
```

### Delta Lake (S3)

```yaml
- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_region: us-west-2 # Optional
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_region: us-west-2 # Optional
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
```

### Delta Lake (Azure Blobs)

```yaml
- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}

# Account Name + Key
databricks_azure_storage_account_name: my_account
databricks_azure_storage_account_key: ${secrets:my_key}

# OR Service Principal + Secret
databricks_azure_storage_client_id: my_client_id
databricks_azure_storage_client_secret: ${secrets:my_secret}

# OR SAS Key
databricks_azure_storage_sas_key: my_sas_key
- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}

# Account Name + Key
databricks_azure_storage_account_name: my_account
databricks_azure_storage_account_key: ${secrets:my_key}

# OR Service Principal + Secret
databricks_azure_storage_client_id: my_client_id
databricks_azure_storage_client_secret: ${secrets:my_secret}

# OR SAS Key
databricks_azure_storage_sas_key: my_sas_key
```

### Delta Lake (GCP)

```yaml
- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_google_service_account_path: /path/to/service-account.json
- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_google_service_account_path: /path/to/service-account.json
```

## Types

### mode: delta_lake

The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.

| Databricks SQL Type | Arrow Type |
| ------------------- | ------------------------------------- |
| `STRING` | `Utf8` |
| `BIGINT` | `Int64` |
| `INT` | `Int32` |
| `SMALLINT` | `Int16` |
| `TINYINT` | `Int8` |
| `FLOAT` | `Float32` |
| `DOUBLE` | `Float64` |
| `BOOLEAN` | `Boolean` |
| `BINARY` | `Binary` |
| `DATE` | `Date32` |
| `TIMESTAMP` | `Timestamp(Microsecond, Some("UTC"))` |
| `TIMESTAMP_NTZ` | `Timestamp(Microsecond, None)` |
| `DECIMAL` | `Decimal128` |
| `ARRAY` | `List` |
| `STRUCT` | `Struct` |
| `MAP` | `Map` |

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
Loading