Change default of hive_infer_partitions to false (#551)

spiceai · Oct 20, 2024 · 7e8d9c4 · 7e8d9c4
1 parent a6315c2
commit 7e8d9c4
Show file tree

Hide file tree

Showing 4 changed files with 16 additions and 5 deletions.
diff --git a/spiceaidocs/docs/components/data-connectors/abfs.md b/spiceaidocs/docs/components/data-connectors/abfs.md
@@ -75,7 +75,7 @@ SELECT COUNT(*) FROM cool_dataset
 | `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy                                                  |
 | `abfs_proxy_exludes`        | A list of hosts to exclude from proxy connections                                       |
 | `abfs_disable_tagging`      | Ignore any tags provided to `put_opts`                                                  |
-| `hive_infer_partitions`     | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true |
+| `hive_infer_partitions`     | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` |
 
 #### Authentication parameters
 

diff --git a/spiceaidocs/docs/components/data-connectors/file.md b/spiceaidocs/docs/components/data-connectors/file.md
@@ -32,7 +32,7 @@ datasets:
 | Parameter name         | Description                                                                                           |
 |------------------------|-------------------------------------------------------------------------------------------------------|
 | `file_format`          | Specifies the data file format. Required if the format cannot be inferred from the `from` path.       |
-| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.  |
+| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`  |
 
 For CSV-specific parameters, see [CSV Parameters](/reference/file_format.md#csv).
 

diff --git a/spiceaidocs/docs/components/data-connectors/ftp.md b/spiceaidocs/docs/components/data-connectors/ftp.md
@@ -24,7 +24,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded.
     - `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user`
     - `ftp_pass`: The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`.
     - `client_timeout`: Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client.
-    - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
+    - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`
 
     More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)
 
@@ -52,7 +52,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded.
     - `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user`
     - `sftp_pass`: The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`.
     - `client_timeout`: Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client.
-    - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
+    - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`
 
     More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)
 

diff --git a/spiceaidocs/docs/components/data-connectors/s3.md b/spiceaidocs/docs/components/data-connectors/s3.md
@@ -60,7 +60,7 @@ Example: `name: cool_dataset`
 - `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server`
 - `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1`
 - `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s`
-- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true.
+- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false`
 
 More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)
 
@@ -207,6 +207,17 @@ Create a dataset named `taxi_trips` from a public S3 folder.
 
 ### Hive Partitioning Example
 
+Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.
+
+For example, a dataset partitioned by year, month, and day might have a directory structure like:
+
+```plaintext
+s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
+s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
+```
+
+Spice can automatically infer these partition columns from the directory structure when `hive_infer_partitions` is set to `true`.
+
 ```yaml
 version: v1beta1
 kind: Spicepod