-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: customize copy to parquet parameter #4328
feat: customize copy to parquet parameter #4328
Conversation
Enhance Parquet Writer with Column-wise Configuration Summary: • Introduced column_wise_config function to customize per-column properties in Parquet writer.
Enhance Parquet File Format Handling for Specific Data Types Summary: • Added ConcreteDataType import to support specific data type handling.
WalkthroughThe recent modifications encompass enhancements to Parquet file writing, adjustments to default configuration values for compaction processes, and updates to test scenarios for better alignment with new settings. Specifically, Parquet writing now benefits from custom column properties, while TWCS compaction has an increased default window run count, and test cases have been updated for comprehensive configuration coverage. Changes
Sequence Diagram(s)No sequence diagrams needed for these changes as they primarily involve configuration updates and test enhancements. Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/common/datasource/src/file_format/parquet.rs (4 hunks)
- src/operator/src/statement/copy_table_to.rs (1 hunks)
Additional comments not posted (4)
src/operator/src/statement/copy_table_to.rs (1)
78-89
: LGTM! Ensures schema is utilized correctly for Parquet writing.The changes correctly extract the schema from the stream before calling
stream_to_parquet
, ensuring the schema information is utilized during the Parquet file writing process.src/common/datasource/src/file_format/parquet.rs (3)
30-31
: Approved: Necessary imports for column-wise configuration.The imports of
ConcreteDataType
andSchemaRef
are necessary for the new column-wise configuration functionality.
190-199
: LGTM! Enhances flexibility with column-wise configuration.The addition of the call to
column_wise_config
allows for customized column properties, enhancing the flexibility of Parquet file writing.
224-247
: LGTM! Provides granular control over column properties.The
column_wise_config
function customizes per-column properties, such as disabling dictionary encoding for the timestamp column and using delta binary packed encoding for int32/int64 columns, improving the efficiency and customization of Parquet file writing.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4328 +/- ##
==========================================
- Coverage 85.18% 84.90% -0.29%
==========================================
Files 1061 1060 -1
Lines 188392 189069 +677
==========================================
+ Hits 160485 160520 +35
- Misses 27907 28549 +642 |
Refactor Parquet file format configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/common/datasource/src/file_format/parquet.rs (4 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/common/datasource/src/file_format/parquet.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGM
Enhance Parquet file format handling for timestamp columns - Added logic to disable dictionary encoding and set DELTA_BINARY_PACKED encoding for timestamp columns in the Parquet file format configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/common/datasource/src/file_format/parquet.rs (4 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/common/datasource/src/file_format/parquet.rs
Disable dictionary encoding for timestamp columns in Parquet writer and update default max_active_window_runs in TwcsOptions - Modified Parquet writer to disable dictionary encoding for timestamp columns to optimize for increasing timestamp data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/common/datasource/src/file_format/parquet.rs (4 hunks)
- src/mito2/src/region/options.rs (1 hunks)
Additional comments not posted (5)
src/common/datasource/src/file_format/parquet.rs (4)
30-37
: LGTM! Imports are necessary for added functionality.The new imports from
datatypes
module andparquet::schema::types::ColumnPath
are appropriate and necessary for the added functionality.
223-239
: LGTM! But consider improving the logic for other data types.The
column_wise_config
function correctly customizes per-column properties, especially for timestamp columns.However, consider improving the logic for other data types based on previous review comments.
for col in schema.column_schemas() { if col.data_type.is_timestamp() { let path = ColumnPath::new(vec![col.name.clone()]); props = props .set_column_dictionary_enabled(path.clone(), false) .set_column_encoding(path, Encoding::DELTA_BINARY_PACKED) } }
Line range hint
78-80
: Skip generating a review comment.The TODO comment is a reminder for future improvements and does not affect the current functionality.
189-198
: LGTM! But verify the function usage in the codebase.The changes to incorporate the new column-wise configuration in the
stream_to_parquet
function are approved.However, ensure that all function calls to
stream_to_parquet
are updated to handle the new configuration.src/mito2/src/region/options.rs (1)
219-219
: LGTM! But verify the usage ofTwcsOptions
in the codebase.The change to increase the default value for the
max_active_window_runs
field from 1 to 4 is approved.However, ensure that this change does not introduce any issues by verifying the usage of
TwcsOptions
in the codebase.
Update compaction settings in tests - Modified `test_compaction_region` to include new compaction options: `compaction.type`, `compaction.twcs.max_active_window_runs`, and `compaction.twcs.max_inactive_window_runs`. - Updated `test_merge_mode_compaction` to use `compaction.twcs.max_active_window_runs` and `compaction.twcs.max_inactive_window_runs` instead of `max_active_window_files` and `max_inactive_window_files`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/mito2/src/engine/compaction_test.rs (1 hunks)
- src/mito2/src/engine/merge_mode_test.rs (1 hunks)
Files skipped from review due to trivial changes (1)
- src/mito2/src/engine/merge_mode_test.rs
Additional comments not posted (1)
src/mito2/src/engine/compaction_test.rs (1)
115-119
: Verify the compaction options.The added options for
compaction.type
andcompaction.twcs.*
settings should be verified to ensure they are correctly implemented and used in the codebase.Verification successful
The compaction options are correctly implemented and used.
The added options for
compaction.type
andcompaction.twcs.*
settings are correctly implemented and referenced in multiple places in the codebase, including relevant tests and configurations.
src/store-api/src/mito_engine_options.rs
src/sql/src/statements/create.rs
src/sql/src/parsers/create_parser.rs
src/mito2/src/region/options.rs
src/mito2/src/engine/append_mode_test.rs
src/mito2/src/engine/filter_deleted_test.rs
src/mito2/src/engine/compaction_test.rs
src/mito2/src/engine/merge_mode_test.rs
src/common/function/src/table/flush_compact_table.rs
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new compaction options in the codebase. # Test: Search for the usage of compaction options. Expect: Correct usage in relevant files. rg --type rust -A 5 $'compaction.type' rg --type rust -A 5 $'compaction.twcs.max_active_window_runs' rg --type rust -A 5 $'compaction.twcs.max_inactive_window_runs'Length of output: 20278
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
This PR adds customized column config for parquet writer while copying table data to parquet files.
733->524
Checklist
Summary by CodeRabbit
New Features
Improvements
max_active_window_runs
from 1 to 4, improving performance in specific scenarios.Tests