Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use option to replace expect with context and avoid creating two gap detectors #428

Merged
merged 5 commits into from
Jul 9, 2024

Conversation

yuunlimm
Copy link
Contributor

Summary

  • made bucket root to be configurable for each stage.
  • fix gap detector issue where the parquet gap detector wasn't updating its processor status to db.
  • move unrelated functions in MoveResource to outside of the struct impl
  • removed unnecessary fields in the parquetProcessingResult

Test Plan

Tested locally and verified status being updated
Screenshot 2024-06-25 at 4 54 51 PM

@yuunlimm yuunlimm requested a review from banool June 27, 2024 18:14
@rtso rtso requested a review from a team June 27, 2024 19:40
@@ -132,24 +144,20 @@ pub async fn create_gap_detector_status_tracker_loop(
// We don't panic as everything downstream will panic if it doesn't work/receive
}

if let Some(res_last_success_batch) = res.last_success_batch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to remove the last_success_batch field from the DefaultGapDetectorResult due to the complexity and limited utility it offered in tracking progress within the parquet processor. Given the processor's architecture, where 10 tasks each manage a buffer containing distinct transaction ranges, the end_version of a batch isn't necessarily sequential. This non-sequential nature made it challenging to effectively use end_version and last_transaction_timestamp for status updates.

Instead, I'm proposing that we update the processor status using next_version_process - 1. This approach aligns better with the purpose of the status ensuring data integrity.

But as I am writing this, we are running 10 tasks only for backfilling, but I tuned to 1 task for regular traffic, which doesn't apply to the case above. let me come back with a better one that is applicable to both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, maybe keep this behaviorfor now since backfilling is our priority, and then improve later to support both cases as a follow-up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bouncing back and forth between the new and old code and can't quite figure out what changed. At least here it seems like we're still logging the same stuff as before, but more reliably now, since the fields are always set. With the new code the structs seem less deep / cleaner. tldr I need to read the code further to figure out what's going on here, but don't let it block you, I'll catch up later.

@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-gap-detector-fix branch 4 times, most recently from 39b295a to 0488e8f Compare July 1, 2024 20:54
@yuunlimm yuunlimm requested a review from rtso July 1, 2024 22:12
rust/processor/src/bq_analytics/gcs_handler.rs Outdated Show resolved Hide resolved
rust/processor/src/gap_detectors/mod.rs Outdated Show resolved Hide resolved
@@ -132,24 +144,20 @@ pub async fn create_gap_detector_status_tracker_loop(
// We don't panic as everything downstream will panic if it doesn't work/receive
}

if let Some(res_last_success_batch) = res.last_success_batch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bouncing back and forth between the new and old code and can't quite figure out what changed. At least here it seems like we're still logging the same stuff as before, but more reliably now, since the fields are always set. With the new code the structs seem less deep / cleaner. tldr I need to read the code further to figure out what's going on here, but don't let it block you, I'll catch up later.

rust/processor/src/worker.rs Outdated Show resolved Hide resolved
rust/processor/src/worker.rs Outdated Show resolved Hide resolved
rust/processor/src/worker.rs Show resolved Hide resolved
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-gap-detector-fix branch from 86cfcb9 to 171609e Compare July 8, 2024 17:28
…d encoding beforehand to call this function.

- reduced the duplicated lines of code
- Use context instead of unwraping with unsafe assumptions
- make gap detector trait fn to return Result type without requiring a unnecessary error type
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-gap-detector-fix branch from c1f9e04 to 5f3d48a Compare July 8, 2024 17:36
@yuunlimm yuunlimm force-pushed the yuunlimm/parquet-gap-detector-fix branch from 5f3d48a to e4593aa Compare July 8, 2024 18:01
@yuunlimm yuunlimm requested review from banool and a team July 8, 2024 18:02
Copy link
Collaborator

@banool banool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@yuunlimm yuunlimm merged commit c6c7305 into main Jul 9, 2024
7 checks passed
@yuunlimm yuunlimm deleted the yuunlimm/parquet-gap-detector-fix branch July 9, 2024 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants