[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831

yabola · 2024-12-02T04:53:04Z

Why are the changes needed?

when merging small files(set spark.sql.optimizer.insertRepartitionBeforeWrite.enabled=true) , the default session advisory partition size (64MB) will be used as target. This default value can still lead to small files because the written data can be compressed nicely using columnar file formats (usually 1/4 or smaller of the shuffle exchange size, the result is often around 15MB).

Spark now support configuring the rebalance expression advisory size in apache/spark#40421 . So we can have a configuration that can configure the merge size separately.

Was this patch authored or co-authored using generative AI tooling?

no

…hen merge small files

pan3793 · 2024-12-02T05:03:13Z

the use case is already covered by

spark.sql.optimizer.finalStageConfigIsolation.enabled=true
spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=512m

https://kyuubi.readthedocs.io/en/master/extensions/engines/spark/rules.html#additional-configurations

codecov-commenter · 2024-12-02T06:01:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (c391d16) to head (345b58e).
Report is 9 commits behind head on master.

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #6831   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         687     687           
  Lines       42442   42439    -3     
  Branches     5793    5792    -1     
======================================
+ Misses      42442   42439    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yabola · 2024-12-02T06:16:09Z

@pan3793 Yes, I hadn't noticed before.
Another question is, can we add compression ratios to control different advisoryPartitionSizeInBytes for different file format (parquet, orc,avro, text, etc.) . It can make it more automated. Iceberg has similar functionality. If you think it's okay, I can improve it.

[KYUUBI apache#6830] Allow indicate advisory shuffle partition size w…

345b58e

…hen merge small files

github-actions bot added module:spark module:extensions labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831

[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831

yabola commented Dec 2, 2024 •

edited

Loading

pan3793 commented Dec 2, 2024 •

edited

Loading

codecov-commenter commented Dec 2, 2024

yabola commented Dec 2, 2024 •

edited

Loading

[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831

Are you sure you want to change the base?

[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831

Conversation

yabola commented Dec 2, 2024 • edited Loading

Why are the changes needed?

Was this patch authored or co-authored using generative AI tooling?

pan3793 commented Dec 2, 2024 • edited Loading

codecov-commenter commented Dec 2, 2024

Codecov Report

yabola commented Dec 2, 2024 • edited Loading

yabola commented Dec 2, 2024 •

edited

Loading

pan3793 commented Dec 2, 2024 •

edited

Loading

yabola commented Dec 2, 2024 •

edited

Loading