-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, Spark: Refactor RewriteFileGroup planner to core #11513
base: main
Are you sure you want to change the base?
Conversation
return new RewriteFileGroup(info, Lists.newArrayList(tasks)); | ||
} | ||
|
||
public static class RewritePlanResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the serializable as it was not required by the tests.
@RussellSpitzer, @szehon-ho: Could you please check if this is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just hesitate to remove something that was there before. I think originally we wanted it because we were considering shipping this class around but I don't see any evidence that it is currently serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that these are Public we need java docs describing their purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the javadoc.
Since we make this public, I would remove the Serializable. We can always add back if needed, but I restrict ourselves on a public API only when it is strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the javadoc.
I think shipping around a stream is a non-trivial task, so I would keep it not Serializable for now.
@@ -191,7 +191,7 @@ protected long inputSize(List<T> group) { | |||
* of output files. The final split size is adjusted to be at least as big as the target file size | |||
* but less than the max write file size. | |||
*/ | |||
protected long splitSize(long inputSize) { | |||
public long splitSize(long inputSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat unrelated change - Flink will need to extract the split size since in the Flink compaction implementation the planner which needs its own rewriter, and the executor (where the rewriter is not used) runs on a different instance.
We could put into the FileGroupInfo
, but as a first run I opted for the minimal change, especially on public APIs.
For easier review it would be great if you could highlight the changes made. I see the note on RewritePlanResult but i'm not sure where this class came from. It is unclear what has been extracted and renamed and what is new code. |
import org.slf4j.LoggerFactory; | ||
|
||
/** | ||
* Checks the files in the table, and using the {@link FileRewriter} plans the groups for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the inheritance structure here. If we want to make this common, maybe we should just have this belong to the FileRewriter as a base class?
If I can think through this correctly in this draft we have
RewriteFileGroupPlanner Responsible for
- Scanning Table
- Grouping Tasks by Partition
- Further Grouping and Filtering in FileRewriter
FileRewriter Responsible for
- Filtering Tasks based on Rewriter Config
- Grouping tasks within partitions
Feels like all of that could just be in the FileRewriter without having a separate GroupPlanner? Do we have an argument against that? I think we could even just have some new default methods on the FileRewriter interface for planning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check that. I was hesitant to change existing classes, but if we have a consensus, I'm happy to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileRewriter
is a generic class with <T extends ContentScanTask<F>, F extends ContentFile<F>>
.
The RewriteFileGroup
which is returned by the planning is specific for FileScanTask
.
I'm not sure that we use the FileRewriter
with any other parametrization ATM, but to put everything into a single class we need to do one of the things below:
- Changing
FileRewriter<T extends FileScanTask>
instead of the currentFileRewriter<T extends ContentScanTask<F>, F extends ContentFile<F>>
, or - Changing the
public List<FileScanTask> RewriteFileGroup.fileScans()
to returnpublic <T extends ContentScanTask<F>, F extends ContentFile<F>> List<T> fileScans
Both of the would be a breaking change in the public API so we should be careful about them.
We still can "duplicate" and create a new FileRewriter, and deprecate the old one - but I would only do this, if we think that the original API needs a serious refactor anyways.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking a lot about the refactor. Trying to figure out what we should do.
we could possibly do
Rewriter[T]
has a Planner[T] // Responsible for generating tasks (move actual filtering into rewriter)
Maybe we should also have
Planner[T]
DeletePlanner[DeleteScanTask]
FilePlanner[FileScanTask]
So then you would have
SizeBasedDataRewriter <FileScanTask, DataFile>
private planner = new FilePlanner<FileScanTask>
Would that end up being more complicated? I feel like we are getting some interdependencies here I'm not comfortable with extending. @aokolnychyi Probably has some smart things to say here as well
// which belong to multiple partitions in the current spec. Treating all such files as | ||
// un-partitioned and grouping them together helps to minimize new files made. | ||
StructLike taskPartition = | ||
task.file().specId() == table.spec().specId() ? task.file().partition() : emptyStruct; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor fix; we can drop this now and use EmptyStructLike.get()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make EmptyStructLike
, and EmptyStructLike.get
public?
Currently it is package private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made EmptyStructLike
public for now, but if you think this change doesn't worth it to make it public, I can revert this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got this exception:
TestRewriteDataFilesAction > testBinPackRewriterWithSpecificOutputSpec() > formatVersion = 3 FAILED
java.lang.UnsupportedOperationException: Can't retrieve values from an empty struct
at org.apache.iceberg.EmptyStructLike.get(EmptyStructLike.java:40)
at org.apache.iceberg.types.JavaHashes$StructLikeHash.hash(JavaHashes.java:96)
at org.apache.iceberg.types.JavaHashes$StructLikeHash.hash(JavaHashes.java:75)
at org.apache.iceberg.util.StructLikeWrapper.hashCode(StructLikeWrapper.java:96)
at java.base/java.util.HashMap.hash(HashMap.java:338)
So reverted the EmptyStructLike
changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blargh
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
Let me amend this now, and thanks for taking a look anyway! So there were no significant new code in the refactor. Originally the The methods and the class got moved to the |
@RussellSpitzer: This is ready to another round if you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. Some minor nit
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroupPlanner.java
Outdated
Show resolved
Hide resolved
@pvary and I were talking this over a bit, I think we really want to get a stronger division between "Planning" and "Excecution" since the two are very intertwined right now. Ideally we end up in a situation where core has base classes that are only responsible for planning out exactly which files should be rewritten and in what logical groups they should be rewritten in, while another class is responsible for the physical implementation on how that actually occurs. Currently I think our structure is: Action contains Rewriter which extends a Planning Class with an Implementation. Because of this the Action and rewriter talk back and forth causing planning to occur in both the action and with parts of code in the rewriter. |
6e9370f
to
f236df9
Compare
….3 tests. Disabling Spark 3.4, 3.3 compilation as well.
This is just a WIP for the change, which uses git file renames. I hope it is easier to review this way. After the initial reviews, I need to do these steps
|
} | ||
|
||
@Override | ||
public void initPlan(FileRewritePlan<I, T, F, G> plan) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write now this method is just used to set config from the plan in the rewrite executor, i'm wondering if we should just be setting those parameters directly on the rewriter rather than inheriting them from the plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are fairly confident, that only writeMaxFileSize
and outputSpecId
comes from the plan, then we can add setWriteMaxFileSize
and setOutputSpecId
to the FileRewriteExecutor
API instead of the initPlan(FileRewritePlan)
As requested in #11497 extracted the Spark and Core related changes.
The change refactors the Spark RewriteFileGroup related code to core (
RewriteFileGroupPlanner
), so Flink could later reuse it.