Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7645] Optimize BQ sync tool for MDT #11065

Merged

Conversation

wombatu-kun
Copy link
Contributor

Change Logs

Looks like in BQ sync, we are polling fsview for latest files sequentially for every partition.
When MDT is enabled, we could load all partitions in one call.

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

none

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 22, 2024
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wombatu-kun : Please take a look at the comment and let me know your thoughts.

Stream<HoodieBaseFile> allLatestBaseFiles;
if (useFileListingFromMetadata) {
LOG.info("Fetching all base files from MDT.");
allLatestBaseFiles = fsView.getLatestBaseFiles();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like fsView.getLatestBaseFiles() only returns already loaded file-groups in the view so some partitions may not be loaded at all. Can you check if this is your intended behavior ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, fsView.getLatestBaseFiles() only returns already loaded file-groups in the view. But I don't see any other approach to load all latest files in one call to HoodieMetadataFileSystemView/HoodieTableMetadata. It would be great if you or @nsivabalan (as reporter of this task) give me some advice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've found fsView.loadAllPartitions() to load all partitions in one call, and now all file-groups are loaded in the view before getting latest base files by fsView.getLatestBaseFiles().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. LGTM

@bvaradar bvaradar self-assigned this Apr 23, 2024
@wombatu-kun wombatu-kun force-pushed the HUDI-7645_Optimize_BQ_sync_tool_for_MDT branch from d417a41 to 9464340 Compare April 24, 2024 11:26
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Stream<HoodieBaseFile> allLatestBaseFiles;
if (useFileListingFromMetadata) {
LOG.info("Fetching all base files from MDT.");
allLatestBaseFiles = fsView.getLatestBaseFiles();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. LGTM

@bvaradar bvaradar merged commit 835d473 into apache:master Apr 25, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants