Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] PartitionedFile's size estimation in FileSourceScanExec#createReadRDD when enable NewHoodieParquetFileFormat #12139

Open
TheR1sing3un opened this issue Oct 22, 2024 · 4 comments
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance spark Issues related to spark

Comments

@TheR1sing3un
Copy link
Member

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I had a question when reading the source code in NewHoodieParquetFileFormat and HoodieFileIndex.

When we enable hoodie.datasource.read.use.new.parquet.file.format and then hudi will provide a HadoopFsRelation with NewHoodieParquetFileFormat.
image
image
And FileSourceScanExec#createReadRDD will query needed Partitions.
image
relation.location.listFiles in this case will be redirected to HoodieFileIndex#listFiles.
image
And I find that for each file-slice:

  1. BaseFile is present, return PartitionedFile with BaseFile's status
  2. LogFile nonEmpty, return PartitionedFile with any (random maybe) LogFile's status.

My question is, should we choose the file with the median log-files sorted by size as the PartitionedFile rather than a random log-file?
In Spark, it merges multiple PartitionedFile into a FilePartition based on the size of each PartitionedFile.
image

image

I think for a PartitionedFile that is actually a FileSlice, we should choose a more representative file size.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@danny0405
Copy link
Contributor

Reasonable, what algorithm do you suggest for the size estimation, did you go through the commit history why the initial size estimation looks like that?

@danny0405 danny0405 added spark Issues related to spark feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance labels Oct 30, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Oct 30, 2024
@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Oct 30, 2024

Reasonable, what algorithm do you suggest for the size estimation,

Consider this case:

  • All buckets in a partition do not have data skew
  • Each bucket has the same construction, for example
    1. only base file
    1. option[base file] and same num log files

Performance factors that affect MoR snapshot read for above case:

  • base file size
  • log file total size
  • the repetition rate of the same primary key data between log files (affect merge, cause random seek in external-spillable-map)

So it is hard to find a size estimation algorithm to feat all cases, I think we can introduce some read options about estimation-fraction to provide a way to adjust parameters to improve performance in specific scenarios.

did you go through the commit history why the initial size estimation looks like that?

From commit history, it looks like it was set to this size from the start。

@danny0405
Copy link
Contributor

We can get the accurate size of a file slice right, can that be utilitized for optimization?

@TheR1sing3un
Copy link
Member Author

We can get the accurate size of a file slice right, can that be utilitized for optimization?

Of course. However, the execution efficiency of the task is not only affected by the size of all the files in slice, but also by the data repetition rate, slice layout and other factors. We can find the experience value of common scenarios through testing, and use it as the default optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance spark Issues related to spark
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants