Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet processor #390

Merged
merged 4 commits into from
Jun 24, 2024
Merged

Add parquet processor #390

merged 4 commits into from
Jun 24, 2024

Conversation

yuunlimm
Copy link
Contributor

@yuunlimm yuunlimm commented May 28, 2024

This is to migrate some of tables in AlloyDB to BigQuery to fully deprecate those tables that are only used for analytics.

Components added:

  1. generic_parquet_processor:
    • it store structs passed in from parquet_handler in in-memory buffer and whenever it reaches the max size, it writes to a parquet file and calls gcs_handler to upload the file to gcs bucket.
  2. parquet_default_processor
    • basically replica of deafult_processor that is responsible for
      • spawning a task that receives parquet structs from the kanal channel and invoke parquet_manager.handle
      • converting TransactionPB into structs.
      • sending parquet structs to its channel, current structs being handled:
        • move_resources, table_items, transactions, write_set_changes.
  3. parquet_handler:
    • util where it has a function that handles spawning a task for parquet_receiver to poll structs from the channel and calling parquet_manager.handle function.
  4. gcs_handler:
  • responsible for reading local parquet files and uploading to GCS bucket. it retries 3 times with expontential backoffs.

Test Plan
tested in testing environment and running locally.

@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch 4 times, most recently from dd2c400 to 52653dd Compare May 28, 2024 23:40
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch from c3c0829 to f9aad63 Compare May 30, 2024 04:45
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch 5 times, most recently from 583aabb to bf3e60d Compare May 31, 2024 19:02
@yuunlimm yuunlimm requested a review from CapCap May 31, 2024 20:24
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch 4 times, most recently from 568e60e to 8c529e8 Compare May 31, 2024 20:59
Copy link
Contributor

@ying-w ying-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be really nice to have but prob out of scope:

  • having tx_version of previous state in move_resources and write_set_changes

@yuunlimm yuunlimm marked this pull request as draft May 31, 2024 21:26
@yuunlimm
Copy link
Contributor Author

Gap Detector isn't working, but since I checked that all txns exist in BigQuery. this is a less concern for now. so I am prioritizing shipping this processor so that we can start backfilling mainnet
cc. @CapCap

@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch 3 times, most recently from edc7f16 to 2d0b9c1 Compare June 4, 2024 01:41
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch from 2d0b9c1 to 77143d3 Compare June 12, 2024 19:02
@yuunlimm yuunlimm changed the title [DO NOT COMMIT] Yuunlimm/parquet refactoring Yuunlimm/parquet refactoring Jun 14, 2024
@yuunlimm yuunlimm marked this pull request as ready for review June 14, 2024 17:17
rust/Cargo.toml Outdated Show resolved Hide resolved
@CapCap
Copy link
Contributor

CapCap commented Jun 14, 2024

@yuunlimm please rebase this on main, will make the review easier for us :-)

@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch from 20c0959 to 3da4377 Compare June 19, 2024 15:30
banool
banool previously requested changes Jun 20, 2024
Copy link
Collaborator

@banool banool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a refactor PR but it's my first time seeing lots of this code, so I left comments all over the place anyway. For any comment that's not relevant to the refactoring directly, please open an issue / task somewhere to track it for later!

I only reviewed the first bit for now. Since this is the refactoring PR I see there is now a mountain of new code hahah. I'm happy to do a full audit of the code later, but it's probably helpful for me to leave reviews in chunks rather than like 50 comments all at once.

rust/Cargo.toml Show resolved Hide resolved
rust/Cargo.toml Outdated Show resolved Hide resolved
rust/Cargo.toml Outdated Show resolved Hide resolved
rust/processor/Cargo.toml Show resolved Hide resolved
rust/processor/src/config.rs Outdated Show resolved Hide resolved
}
}

pub fn get_outer_type_from_resource(write_resource: &WriteResource) -> String {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and convert_move_struct_tag don't have anything to do with MoveResource. Can you move them to the relevant struct or just make them freestanding?

As a general rule, if a method doesn't take self as a param or return Self there is no reason for it to a method on a struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced offline, will follow up in the next pr.

}
}

impl Default for Transaction {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we have to implement Default? There is no "default transaction" so I'm not sure it makes sense to implement Default for Transaction. Same comment for many of these. If some trait / library is forcing us to implement Default, we should at least use unreachable so we never use it, or if something does use it, for the fields we should use default, like txn_version: Default::default(), event_root_hash: Default::default(), etc.

Copy link
Contributor Author

@yuunlimm yuunlimm Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a usecase, where we have derive the parquet schema from the instance, and the we need to know this schema in advance. please take a look at hasParquetSchema trait in the generic_parquet_processor.rs. and we use this schema() to construct parquetHandler for each struct type. that's the only usecase of default but I am not too sure if there is any other workaround.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@banool for the non-Parquet processor, we use Default for transaction types where we can skip indexing most of its contents


impl Transaction {
fn from_transaction_info(
info: &TransactionInfo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure the use case but if possible, it'd be good to just take in the owned object, it'd be more efficient. No worries if it doesn't make sense for this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced offline, will follow up in the next pr.

@yuunlimm yuunlimm changed the title Yuunlimm/parquet refactoring Add parquet processor Jun 20, 2024
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch 4 times, most recently from 25a5706 to e99e37d Compare June 21, 2024 17:53
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-refactoring branch from 4d57a60 to 3cf9386 Compare June 24, 2024 16:39
@yuunlimm yuunlimm dismissed banool’s stale review June 24, 2024 17:22

will address in the follow-up pr

@yuunlimm yuunlimm requested a review from banool June 24, 2024 17:23
Copy link
Collaborator

@banool banool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major items have been addressed. I'll leave a full audit here: #420.

@yuunlimm yuunlimm merged commit 41785eb into main Jun 24, 2024
7 checks passed
@yuunlimm yuunlimm deleted the yuunlimm/parquet-refactoring branch June 24, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants