Skip to content

Commit

Permalink
Change default of hive_infer_partitions to false (#551)
Browse files Browse the repository at this point in the history
  • Loading branch information
phillipleblanc authored Oct 20, 2024
1 parent a6315c2 commit 7e8d9c4
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 5 deletions.
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/data-connectors/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ SELECT COUNT(*) FROM cool_dataset
| `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Ignore any tags provided to `put_opts` |
| `hive_infer_partitions` | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true |
| `hive_infer_partitions` | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` |

#### Authentication parameters

Expand Down
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/data-connectors/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ datasets:
| Parameter name | Description |
|------------------------|-------------------------------------------------------------------------------------------------------|
| `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. |
| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true. |
| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` |

For CSV-specific parameters, see [CSV Parameters](/reference/file_format.md#csv).

Expand Down
4 changes: 2 additions & 2 deletions spiceaidocs/docs/components/data-connectors/ftp.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded.
- `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user`
- `ftp_pass`: The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

Expand Down Expand Up @@ -52,7 +52,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded.
- `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user`
- `sftp_pass`: The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
- `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

Expand Down
13 changes: 12 additions & 1 deletion spiceaidocs/docs/components/data-connectors/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Example: `name: cool_dataset`
- `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server`
- `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1`
- `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s`
- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

Expand Down Expand Up @@ -207,6 +207,17 @@ Create a dataset named `taxi_trips` from a public S3 folder.

### Hive Partitioning Example

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

```plaintext
s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
```

Spice can automatically infer these partition columns from the directory structure when `hive_infer_partitions` is set to `true`.

```yaml
version: v1beta1
kind: Spicepod
Expand Down

0 comments on commit 7e8d9c4

Please sign in to comment.