Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK-parquet] add parquet version tracker #609

Open
wants to merge 3 commits into
base: 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler
Choose a base branch
from

Conversation

yuunlimm
Copy link
Contributor

@yuunlimm yuunlimm commented Nov 14, 2024

Description

1. Added Parquet Version Tracker Functionality

  • Updated steps to include set_backfill_table_flag logic for selective table processing.
  • added a processor status saver for parquet

2. Schema and Table Handling Updates

  • Updated the logic for handling backfill tables:
  • Renamed the tables field with backfill_table in - ParquetDefaultProcessorConfig.
  • Adjusted validations and logic to ensure only valid tables are processed.
    ParquetTypeEnum improvements:
    • Added mappings and validations for table names.
    • Enhanced schema initialization and writer creation.

3. Tests updated

  • Modified tests to reflect changes in backfill_table handling and validation.
  • Updated table name checks to ensure compatibility with the new logic.
  • Added test coverage for: Invalid backfill tables.

4. General Code Improvements

  • Removed redundant logic in ParquetDefaultProcessor.
  • Moved shared functionality (e.g., writer creation) to reusable helper functions.
  • initialize_database_pool centralizes database pool setup for Postgres.
  • Handles error cases cleanly.
  • initialize_gcs_client abstracts GCS client setup using provided credentials.
  • Consolidated initialization of schemas, writers, and GCS uploaders into modular functions.

Enhanced comments for better readability and maintainability.

Test Plan
Screenshot 2024-11-15 at 12 35 53 PM

Screenshot 2024-11-15 at 11.38.50 AM.png

Copy link
Contributor Author

yuunlimm commented Nov 14, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from f1ca3f9 to 4d56db7 Compare November 14, 2024 17:59
@yuunlimm yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch 2 times, most recently from 5a1f116 to b0c3839 Compare November 14, 2024 19:34
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from 4d56db7 to d0a5380 Compare November 15, 2024 17:34
@yuunlimm yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch from b0c3839 to 5376225 Compare November 15, 2024 17:34
@yuunlimm yuunlimm force-pushed the 11-12-_sdk-parquet_parquet_sized_buffer_and_gcs_handler branch from d0a5380 to 946b726 Compare November 15, 2024 18:21
@yuunlimm yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch from 5376225 to 11a65f0 Compare November 15, 2024 18:21
@yuunlimm yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch 4 times, most recently from 1a4cc67 to f321d33 Compare November 15, 2024 20:29
@yuunlimm yuunlimm force-pushed the 11-13-_sdk-parquet_add_parquet_version_tracker branch from f321d33 to b276784 Compare November 15, 2024 21:43
@yuunlimm yuunlimm marked this pull request as ready for review November 15, 2024 21:44
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between this and the non-parquet processor_status_saver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only difference is it takes in table_name: &str, arg b/c it should be able to to update row per table. wasn't sure how we could reduce the duplicated code here :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if the parquet version tracker could also just use the postgres processor_status_saver?

Copy link
Contributor Author

@yuunlimm yuunlimm Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm if we could add another function(save_parquet_processor_status) to processor_status_saver with arg and let that call the existing function in the processor_status_saver, that would also work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm right. would it work if we added parquet_table_name as an optional param?

@@ -19,6 +20,7 @@ use processor::schema::{backfill_processor_status, processor_status};
pub fn get_processor_status_saver(
conn_pool: ArcDbPool,
config: IndexerProcessorConfig,
is_parquet: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline: consider using a more descriptive enum instead of this bool flag. Ideally we are able to determine what type of processor_status_saver to return just from the config. It's weird/inconsistent that we have this function determine which enum to return while relying on a user input to help with that decision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants