[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139

TheR1sing3un · 2024-10-22T09:30:02Z

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I had a question when reading the source code in NewHoodieParquetFileFormat and HoodieFileIndex.

When we enable hoodie.datasource.read.use.new.parquet.file.format and then hudi will provide a HadoopFsRelation with NewHoodieParquetFileFormat.

And FileSourceScanExec#createReadRDD will query needed Partitions.

relation.location.listFiles in this case will be redirected to HoodieFileIndex#listFiles.

And I find that for each file-slice:

BaseFile is present, return PartitionedFile with BaseFile's status
LogFile nonEmpty, return PartitionedFile with any (random maybe) LogFile's status.

My question is, should we choose the file with the median log-files sorted by size as the PartitionedFile rather than a random log-file?
In Spark, it merges multiple PartitionedFile into a FilePartition based on the size of each PartitionedFile.

I think for a PartitionedFile that is actually a FileSlice, we should choose a more representative file size.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

danny0405 · 2024-10-30T01:17:05Z

Reasonable, what algorithm do you suggest for the size estimation, did you go through the commit history why the initial size estimation looks like that?

TheR1sing3un · 2024-10-30T11:12:08Z

Reasonable, what algorithm do you suggest for the size estimation,

Consider this case:

All buckets in a partition do not have data skew
Each bucket has the same construction, for example
1. only base file
1. option[base file] and same num log files

Performance factors that affect MoR snapshot read for above case:

base file size
log file total size
the repetition rate of the same primary key data between log files (affect merge, cause random seek in external-spillable-map)

So it is hard to find a size estimation algorithm to feat all cases, I think we can introduce some read options about estimation-fraction to provide a way to adjust parameters to improve performance in specific scenarios.

did you go through the commit history why the initial size estimation looks like that?

From commit history, it looks like it was set to this size from the start。

danny0405 · 2024-10-31T09:36:11Z

We can get the accurate size of a file slice right, can that be utilitized for optimization?

TheR1sing3un · 2024-10-31T09:43:34Z

We can get the accurate size of a file slice right, can that be utilitized for optimization?

Of course. However, the execution efficiency of the task is not only affected by the size of all the files in slice, but also by the data repetition rate, slice layout and other factors. We can find the experience value of common scenarios through testing, and use it as the default optimization

danny0405 added spark Issues related to spark feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance labels Oct 30, 2024

danny0405 added this to Hudi Issue Support Oct 30, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139

[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139

TheR1sing3un commented Oct 22, 2024

danny0405 commented Oct 30, 2024

TheR1sing3un commented Oct 30, 2024 •

edited

Loading

danny0405 commented Oct 31, 2024

TheR1sing3un commented Oct 31, 2024

[SUPPORT] PartitionedFile's size estimation in FileSourceScanExec#createReadRDD when enable NewHoodieParquetFileFormat #12139

[SUPPORT] PartitionedFile's size estimation in FileSourceScanExec#createReadRDD when enable NewHoodieParquetFileFormat #12139

Comments

TheR1sing3un commented Oct 22, 2024

danny0405 commented Oct 30, 2024

TheR1sing3un commented Oct 30, 2024 • edited Loading

danny0405 commented Oct 31, 2024

TheR1sing3un commented Oct 31, 2024

[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139

[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139

TheR1sing3un commented Oct 30, 2024 •

edited

Loading