Skip to content

Commit

Permalink
Updates and clarity to data connectors (#453)
Browse files Browse the repository at this point in the history
* Updates and clarity to data connectors

* Update spiceaidocs/docs/components/data-connectors/index.md

* Update spiceaidocs/docs/components/data-connectors/index.md

* Update spiceaidocs/docs/components/data-connectors/index.md

* Update spiceaidocs/docs/components/data-connectors/index.md
  • Loading branch information
lukekim authored Oct 14, 2024
1 parent 5512ea5 commit 3c084e8
Showing 1 changed file with 47 additions and 40 deletions.
87 changes: 47 additions & 40 deletions spiceaidocs/docs/components/data-connectors/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: 'Data Connectors'
sidebar_label: 'Data Connectors'
description: ''
description: 'Learn how to use Data Connector to query external data.'
sidebar_position: 1
pagination_prev: null
pagination_next: null
Expand All @@ -11,62 +11,66 @@ Data Connectors provide connections to databases, data warehouses, and data lake

Currently supported Data Connectors include:

| Name | Description | Status | Protocol/Format | Refresh Modes | Supports Inserts | Supports Documents |
| --------------- | --------------| ------ | ----------------------------------- | ---------------- | ---------------- | ------------------ |
| `clickhouse` | Clickhouse | Alpha | | `full` | | |
| `databricks` | Databricks | Alpha | Spark Connect <br/> S3 / Delta Lake | `full` |||
| `delta_lake` | Delta Lake | Alpha | Delta Lake | `full` | ||
| `dremio` | Dremio | Alpha | Arrow Flight SQL | `full` |||
| `file` | File | Alpha | Parquet, CSV | `full` | | |
| `flightsql` | FlightSQL | Alpha | Arrow Flight SQL | `full` |||
| `ftp`, `sftp` | FTP/SFTP | Alpha | Parquet, CSV | `full` | ||
| `graphql` | GraphQL | Alpha | GraphQL | `full` |||
| `github` | GitHub | Alpha | GraphQL, REST | `full` || |
| `http`, `https` | HTTP(s) | Alpha | Parquet, CSV | `full` |||
| `mssql` | MS SQL Server | Alpha | Tabular Data Stream (TDS) | `full` |||
| `mysql` | MySQL | Alpha | | `full` |||
| `odbc` | ODBC | Alpha | ODBC | `full` |||
| `postgres` | PostgreSQL | Alpha | | `full` | ||
| `sharepoint` | SharePoint | Alpha | | `full` || |
| `snowflake` | Snowflake | Alpha | Arrow | `full` | ||
| `spiceai` | Spice.ai | Alpha | Arrow Flight | `append`, `full` | | |
| `s3` | S3 | Alpha | Parquet, CSV | `full` |||
| `abfs` | Azure BlobFS | Alpha | Parquet, CSV | `full` | | |
| `sharepoint` | SharePoint | Alpha | | `full` | | |
| `spark` | Spark | Alpha | Spark Connect | `full` |||
| Name | Description | Status | Protocol/Format | Refresh Modes | Supports [Ingestion](https://docs.spiceai.org/features/data-ingestion) | Supports Documents |
| --------------- | ------------- | ------ | ----------------------------------- | --------------------------- | ------------------ | ------------------ |
| `abfs` | Azure BlobFS | Alpha | Parquet, CSV | `append`, `full` | Roadmap | |
| `clickhouse` | Clickhouse | Alpha | | `append`, `full` | | |
| `databricks` | Databricks | Beta | Spark Connect <br/> S3 / Delta Lake | `append`, `full` | Roadmap | |
| `debezium` | Debezium | Alpha | CDC, Kafka | `append`, `full`, `changes` | | |
| `delta_lake` | Delta Lake | Beta | Delta Lake | `append`, `full` | Roadmap | |
| `dremio` | Dremio | Alpha | Arrow Flight SQL | `append`, `full` | | |
| `file` | File | Alpha | Parquet, CSV | `append`, `full` | Roadmap | |
| `flightsql` | FlightSQL | Beta | Arrow Flight SQL | `append`, `full` | | |
| `ftp`, `sftp` | FTP/SFTP | Alpha | Parquet, CSV | `append`, `full` | | |
| `github` | GitHub | Alpha | GraphQL, REST | `append`, `full` | | |
| `graphql` | GraphQL | Alpha | GraphQL | `append`, `full` | | |
| `http`, `https` | HTTP(s) | Alpha | Parquet, CSV | `append`, `full` | | |
| `mssql` | MS SQL Server | Alpha | Tabular Data Stream (TDS) | `append`, `full` | | |
| `mysql` | MySQL | Beta | | `append`, `full` | Roadmap | |
| `odbc` | ODBC | Beta | | `append`, `full` | | |
| `postgres` | PostgreSQL | Beta | | `append`, `full` | Roadmap | |
| `s3` | S3 | Beta | Parquet, CSV | `append`, `full` | Roadmap | |
| `sharepoint` | SharePoint | Alpha | | `append`, `full` | | |
| `snowflake` | Snowflake | Alpha | Arrow | `append`, `full` | Roadmap | |
| `spiceai` | Spice.ai | Beta | Arrow Flight | `append`, `full` | | |
| `spark` | Spark | Alpha | Spark Connect | `append`, `full` | | |

## Object Store File Formats

For data connectors that are object store compatible, if a folder is provided, the file format must be specified with `params.file_format`.

If a file is provided, the file format will be inferred, and `params.file_format` is unnecessary.

File formats currently supported are:

| Name | Parameter | Supported | Is Document Format |
| --------------------------------------------- | ----------------------- | --------- | ------------------ |
| [Apache Parquet](https://parquet.apache.org/) | `file_format: parquet` |||
| [CSV](/reference/file_format.md#csv) | `file_format: csv` |||
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Roadmap ||
| JSON | `file_format: json` | Roadmap ||
| Microsoft Excel | `file_format: xlsx` | Roadmap ||
| Markdown | `file_format: md` |||
| Text | `file_format: txt` |||
| PDF | `file_format: pdf` | Alpha ||
| Microsoft Word | `file_format: docx` | Alpha ||
| Name | Parameter | Supported | Is Document Format |
| --------------------------------------------- | ---------------------- | --------- | ------------------ |
| [Apache Parquet](https://parquet.apache.org/) | `file_format: parquet` |||
| [CSV](/reference/file_format.md#csv) | `file_format: csv` |||
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Roadmap ||
| JSON | `file_format: json` | Roadmap ||
| Microsoft Excel | `file_format: xlsx` | Roadmap ||
| Markdown | `file_format: md` |||
| Text | `file_format: txt` |||
| PDF | `file_format: pdf` | Alpha ||
| Microsoft Word | `file_format: docx` | Alpha ||

File formats support additional parameters in the `params` (like `csv_has_header`) described in [File Formats](/reference/file_format)

If a format is a document format, each file will be treated as a document, as per [document support](#document-support) below.

:::warning[Note]
Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.
Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.
:::

### Document Support

If a Data Connector supports documents, when the appropriate file format is specified (see [above](#object-store-file-formats)), each file will be treated as a row in the table, with the contents of the file within the `content` column. Additional columns will exist, dependent on the data connector.

#### Example

Consider a local filesystem

```shell
>>> ls -la
total 232
Expand All @@ -78,14 +82,17 @@ drwxr-sr-x@ 18 jeadie staff 576 30 Jul 13:12 ..
```

And the spicepod

```yaml
datasets:
- name: my_documents
from: file:docs/decisions/
params:
file_format: md
- name: my_documents
from: file:docs/decisions/
params:
file_format: md
```
A Document table will be created.
```shell
>>> SELECT * FROM my_documents LIMIT 3
+----------------------------------------------------+--------------------------------------------------+
Expand Down

0 comments on commit 3c084e8

Please sign in to comment.