Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default of hive_infer_partitions to false #551

Merged
merged 1 commit into from
Oct 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/data-connectors/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ SELECT COUNT(*) FROM cool_dataset
| `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Ignore any tags provided to `put_opts` |
| `hive_infer_partitions` | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true |
| `hive_infer_partitions` | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` |

#### Authentication parameters

Expand Down
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/data-connectors/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ datasets:
| Parameter name | Description |
|------------------------|-------------------------------------------------------------------------------------------------------|
| `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. |
| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true. |
| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` |

For CSV-specific parameters, see [CSV Parameters](/reference/file_format.md#csv).

Expand Down
4 changes: 2 additions & 2 deletions spiceaidocs/docs/components/data-connectors/ftp.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded.
- `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user`
- `ftp_pass`: The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

Expand Down Expand Up @@ -52,7 +52,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded.
- `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user`
- `sftp_pass`: The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

Expand Down
13 changes: 12 additions & 1 deletion spiceaidocs/docs/components/data-connectors/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Example: `name: cool_dataset`
- `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server`
- `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1`
- `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s`
- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

Expand Down Expand Up @@ -207,6 +207,17 @@ Create a dataset named `taxi_trips` from a public S3 folder.

### Hive Partitioning Example

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

```plaintext
s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
```

Spice can automatically infer these partition columns from the directory structure when `hive_infer_partitions` is set to `true`.

```yaml
version: v1beta1
kind: Spicepod
Expand Down