Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yabola
Copy link
Contributor

@yabola yabola commented Dec 2, 2024

Why are the changes needed?

when merging small files(set spark.sql.optimizer.insertRepartitionBeforeWrite.enabled=true) , the default session advisory partition size (64MB) will be used as target. This default value can still lead to small files because the written data can be compressed nicely using columnar file formats (usually 1/4 or smaller of the shuffle exchange size, the result is often around 15MB).

Spark now support configuring the rebalance expression advisory size in apache/spark#40421 . So we can have a configuration that can configure the merge size separately.

Was this patch authored or co-authored using generative AI tooling?

no

@pan3793
Copy link
Member

pan3793 commented Dec 2, 2024

the use case is already covered by

spark.sql.optimizer.finalStageConfigIsolation.enabled=true
spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=512m

https://kyuubi.readthedocs.io/en/master/extensions/engines/spark/rules.html#additional-configurations

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (c391d16) to head (345b58e).
Report is 9 commits behind head on master.

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #6831   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         687     687           
  Lines       42442   42439    -3     
  Branches     5793    5792    -1     
======================================
+ Misses      42442   42439    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yabola
Copy link
Contributor Author

yabola commented Dec 2, 2024

@pan3793 Yes, I hadn't noticed before.
Another question is, can we add compression ratios to control different advisoryPartitionSizeInBytes for different file format (parquet, orc,avro, text, etc.) . It can make it more automated. Iceberg has similar functionality. If you think it's okay, I can improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants