[SDK-parquet] add parquet version tracker #609

yuunlimm · 2024-11-14T17:53:34Z

Description

1. Added Parquet Version Tracker Functionality

Updated steps to include set_backfill_table_flag logic for selective table processing.
added a processor status saver for parquet

2. Schema and Table Handling Updates

Updated the logic for handling backfill tables:
Renamed the tables field with backfill_table in - ParquetDefaultProcessorConfig.
Adjusted validations and logic to ensure only valid tables are processed.
ParquetTypeEnum improvements:
- Added mappings and validations for table names.
- Enhanced schema initialization and writer creation.

3. Tests updated

Modified tests to reflect changes in backfill_table handling and validation.
Updated table name checks to ensure compatibility with the new logic.
Added test coverage for: Invalid backfill tables.

4. General Code Improvements

Removed redundant logic in ParquetDefaultProcessor.
Moved shared functionality (e.g., writer creation) to reusable helper functions.
initialize_database_pool centralizes database pool setup for Postgres.
Handles error cases cleanly.
initialize_gcs_client abstracts GCS client setup using provided credentials.
Consolidated initialization of schemas, writers, and GCS uploaders into modular functions.

Enhanced comments for better readability and maintainability.

Test Plan

yuunlimm · 2024-11-14T17:53:52Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[SDK-parquet] add parquet version tracker #609 👈 (View in Graphite)
[SDK-parquet] parquet sized buffer and gcs handler #602
[SDK-parquet] parquet default processor extractor step #601
[parquet-sdk-migration] add a logic to determine the starting version for parquet processor #587 : 1 other dependent PR (#591 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

dermanyang · 2024-11-15T22:40:15Z

rust/sdk-processor/src/steps/common/parquet_processor_status_saver.rs

what's the difference between this and the non-parquet processor_status_saver?

only difference is it takes in table_name: &str, arg b/c it should be able to to update row per table. wasn't sure how we could reduce the duplicated code here :(

i'm wondering if the parquet version tracker could also just use the postgres processor_status_saver?

hmm if we could add another function(save_parquet_processor_status) to processor_status_saver with arg and let that call the existing function in the processor_status_saver, that would also work

mmmm right. would it work if we added parquet_table_name as an optional param?

dermanyang · 2024-11-16T02:15:43Z

rust/sdk-processor/src/steps/common/processor_status_saver.rs

@@ -19,6 +20,7 @@ use processor::schema::{backfill_processor_status, processor_status};
 pub fn get_processor_status_saver(
    conn_pool: ArcDbPool,
    config: IndexerProcessorConfig,
+    is_parquet: bool,


discussed offline: consider using a more descriptive enum instead of this bool flag. Ideally we are able to determine what type of processor_status_saver to return just from the config. It's weird/inconsistent that we have this function determine which enum to return while relying on a user input to help with that decision

yuunlimm mentioned this pull request Nov 14, 2024

[SDK-parquet] parquet default processor extractor step #601

Open

yuunlimm mentioned this pull request Nov 14, 2024

[SDK-parquet] parquet sized buffer and gcs handler #602

Open

yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from f1ca3f9 to 4d56db7 Compare November 14, 2024 17:59

yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch 2 times, most recently from 5a1f116 to b0c3839 Compare November 14, 2024 19:34

yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 4d56db7 to d0a5380 Compare November 15, 2024 17:34

yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch from b0c3839 to 5376225 Compare November 15, 2024 17:34

yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from d0a5380 to 946b726 Compare November 15, 2024 18:21

yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch from 5376225 to 11a65f0 Compare November 15, 2024 18:21

[SDK-parquet] add parquet version tracker

823215e

yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch 4 times, most recently from 1a4cc67 to f321d33 Compare November 15, 2024 20:29

fix conflict

b276784

yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch from f321d33 to b276784 Compare November 15, 2024 21:43

yuunlimm marked this pull request as ready for review November 15, 2024 21:44

yuunlimm requested review from rtso and dermanyang November 15, 2024 22:27

dermanyang reviewed Nov 15, 2024

View reviewed changes

refactor parquet processor status saver into processor status saver

9d249df

dermanyang reviewed Nov 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK-parquet] add parquet version tracker #609

[SDK-parquet] add parquet version tracker #609

yuunlimm commented Nov 14, 2024 •

edited

Loading

yuunlimm commented Nov 14, 2024 •

edited

Loading

dermanyang Nov 15, 2024

yuunlimm Nov 15, 2024

dermanyang Nov 15, 2024

yuunlimm Nov 15, 2024 •

edited

Loading

dermanyang Nov 15, 2024

dermanyang Nov 16, 2024

[SDK-parquet] add parquet version tracker #609

Are you sure you want to change the base?

[SDK-parquet] add parquet version tracker #609

Conversation

yuunlimm commented Nov 14, 2024 • edited Loading

Description

yuunlimm commented Nov 14, 2024 • edited Loading

dermanyang Nov 15, 2024

Choose a reason for hiding this comment

yuunlimm Nov 15, 2024

Choose a reason for hiding this comment

dermanyang Nov 15, 2024

Choose a reason for hiding this comment

yuunlimm Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

dermanyang Nov 15, 2024

Choose a reason for hiding this comment

dermanyang Nov 16, 2024

Choose a reason for hiding this comment

yuunlimm commented Nov 14, 2024 •

edited

Loading

yuunlimm commented Nov 14, 2024 •

edited

Loading

yuunlimm Nov 15, 2024 •

edited

Loading