[SUPPORT] PartitionedFile
's size estimation in FileSourceScanExec#createReadRDD
when enable NewHoodieParquetFileFormat
#12139
Labels
feature-enquiry
issue contains feature enquiries/requests or great improvement ideas
performance
spark
Issues related to spark
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I had a question when reading the source code in
NewHoodieParquetFileFormat
andHoodieFileIndex
.When we enable
hoodie.datasource.read.use.new.parquet.file.format
and then hudi will provide aHadoopFsRelation
withNewHoodieParquetFileFormat
.And
FileSourceScanExec#createReadRDD
will query needed Partitions.relation.location.listFiles
in this case will be redirected toHoodieFileIndex#listFiles
.And I find that for each file-slice:
My question is, should we choose the file with the median log-files sorted by size as the PartitionedFile rather than a random log-file?
In Spark, it merges multiple PartitionedFile into a FilePartition based on the size of each PartitionedFile.
I think for a PartitionedFile that is actually a FileSlice, we should choose a more representative file size.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
The text was updated successfully, but these errors were encountered: