From b5095ec150b3a5458a52c3ca3180ea38bde09d39 Mon Sep 17 00:00:00 2001 From: Phillip LeBlanc Date: Sun, 20 Oct 2024 00:33:11 +0900 Subject: [PATCH] Change default of `hive_infer_partitions` to false --- spiceaidocs/docs/components/data-connectors/abfs.md | 2 +- spiceaidocs/docs/components/data-connectors/file.md | 2 +- spiceaidocs/docs/components/data-connectors/ftp.md | 4 ++-- spiceaidocs/docs/components/data-connectors/s3.md | 13 ++++++++++++- 4 files changed, 16 insertions(+), 5 deletions(-) diff --git a/spiceaidocs/docs/components/data-connectors/abfs.md b/spiceaidocs/docs/components/data-connectors/abfs.md index 4db01804..8971ada2 100644 --- a/spiceaidocs/docs/components/data-connectors/abfs.md +++ b/spiceaidocs/docs/components/data-connectors/abfs.md @@ -75,7 +75,7 @@ SELECT COUNT(*) FROM cool_dataset | `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy | | `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections | | `abfs_disable_tagging` | Ignore any tags provided to `put_opts` | -| `hive_infer_partitions` | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true | +| `hive_infer_partitions` | Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` | #### Authentication parameters diff --git a/spiceaidocs/docs/components/data-connectors/file.md b/spiceaidocs/docs/components/data-connectors/file.md index b503d27b..509d6d9b 100644 --- a/spiceaidocs/docs/components/data-connectors/file.md +++ b/spiceaidocs/docs/components/data-connectors/file.md @@ -32,7 +32,7 @@ datasets: | Parameter name | Description | |------------------------|-------------------------------------------------------------------------------------------------------| | `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. | -| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true. | +| `hive_infer_partitions`| Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` | For CSV-specific parameters, see [CSV Parameters](/reference/file_format.md#csv). diff --git a/spiceaidocs/docs/components/data-connectors/ftp.md b/spiceaidocs/docs/components/data-connectors/ftp.md index 47fa2710..94ce044d 100644 --- a/spiceaidocs/docs/components/data-connectors/ftp.md +++ b/spiceaidocs/docs/components/data-connectors/ftp.md @@ -24,7 +24,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded. - `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user` - `ftp_pass`: The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`. - `client_timeout`: Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client. - - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true. + - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv) @@ -52,7 +52,7 @@ If a folder is provided, all child Parquet/CSV files will be loaded. - `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user` - `sftp_pass`: The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`. - `client_timeout`: Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client. - - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true. + - `hive_infer_partitions`: Optional. Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv) diff --git a/spiceaidocs/docs/components/data-connectors/s3.md b/spiceaidocs/docs/components/data-connectors/s3.md index 17462800..99e7671b 100644 --- a/spiceaidocs/docs/components/data-connectors/s3.md +++ b/spiceaidocs/docs/components/data-connectors/s3.md @@ -60,7 +60,7 @@ Example: `name: cool_dataset` - `s3_endpoint`: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. `s3_endpoint: https://my.minio.server` - `s3_region`: Region of the S3 bucket, if region specific. Default value is `us-east-1` E.g. `s3_region: us-east-1` - `client_timeout`: Specifies timeout for S3 operations. Default value is `30s` E.g. `client_timeout: 60s` -- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to true. +- `hive_infer_partitions`: Infer the partition columns for hive-style partitioning from the folder structure. Defaults to `false` More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv) @@ -207,6 +207,17 @@ Create a dataset named `taxi_trips` from a public S3 folder. ### Hive Partitioning Example +Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans. + +For example, a dataset partitioned by year, month, and day might have a directory structure like: + +```plaintext +s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet +s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet +``` + +Spice can automatically infer these partition columns from the directory structure when `hive_infer_partitions` is set to `true`. + ```yaml version: v1beta1 kind: Spicepod