-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KYUUBI #6691] A new Spark SQL command to merge small files #6695
Conversation
2. parser tests pass
cc @ulysses-you |
...yuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/KyuubiSparkSQLExtension.scala
Outdated
Show resolved
Hide resolved
...ion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CachePerformanceViewCommand.scala
Outdated
Show resolved
Hide resolved
@cxzl25 @ulysses-you all tests pass |
more unit test
thank you @gabrywu , about the syntax, is there any reason to introduce
|
The answer is same to the question, why use procedure? |
...nsion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/merge/PlainFileLikeMerger.scala
Outdated
Show resolved
Hide resolved
...extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CompressionCodecsUtil.scala
Outdated
Show resolved
Hide resolved
...ion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CachePerformanceViewCommand.scala
Outdated
Show resolved
Hide resolved
@cxzl25 what do you think of the |
@AngersZhuuuu can you help to review this PR? |
For the syntax part, given Delta and Iceberg's dominance in the lakehouse market, I suggest following either Delta's Additional information:
|
we'd better talk about the syntax and make a final decision in the dev emails [email protected], otherwise, the upcoming PR will still not use |
An email thread to decide which one should be used, command or call procedure |
do you support to compact one partition for partitioned table? |
hi, @turboFei I removed this feature from this PR. only support partition table internally. |
Close this PR and will create a new one if apache/spark/pull/47190 is released in next Spark version v4.0.0 |
🔍 Description
Issue References 🔗
This pull request closing #6691
Describe Your Solution 🔧
There are many cases in which a SQL generate small files, we MUST merge them into bigger ones.
I create a new Spark SQL command to merge small files, which doesn't read-write all of the records of a table, it just merges files in a binary level. Take a CSV table for example, it only appends the byte array from one file to another one, without reading & writing records
Syntax here
Types of changes 🔖
Test Plan 🧪
Behavior Without This Pull Request ⚰️
Behavior With This Pull Request 🎉
Related Unit Tests
Checklist 📝
Be nice. Be informative.