Core: Optimize computing user-facing state in data task #8346

aokolnychyi · 2023-08-17T17:41:07Z

This PR optimizes computing user-facing state in data tasks to reduce the garbage and improve performance.

It is related to the work in #8336.

aokolnychyi · 2023-08-17T17:42:51Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

@@ -45,31 +50,67 @@ protected FileScanTask self() {

  @Override
  protected FileScanTask newSplitTask(FileScanTask parentTask, long offset, long length) {
-    return new SplitScanTask(offset, length, parentTask);
+    return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);


This ensures the size of deletes is only computed once for all split tasks generated from the same file task.

aokolnychyi · 2023-08-17T17:50:48Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+
+  @Override
+  public int filesCount() {
+    return 1 + deletes.length;


Override this to avoid materializing deletesAsList to simply compute the files count.

Nit: If we were adding more methods to the parent class, how can we make sure new methods are override in this method? Otherwise, it would probably accidentally materializing deletesAsList?

I don't think the above question is a blocker, and It would be great if we have some way/tests to detect that.

It would probably consider that separately from this PR. It would be unfortunate to materialize the list but it would not be the end of the world.

It would be unfortunate to materialize the list but it would not be the end of the world.

Of course. Current approach works for me.

Why not initialize the deletes list in the constructor?
Then no need to have these overrides.

That would mean BaseFileScanTask constructor needs to make a copy, but that's probably actually a good thing, since it would make the class more immutable.
It will have to make the copy anyway, since List deletes() is the only way to get the deletes information from the class, so it will be called sooner or later.

aokolnychyi · 2023-08-17T17:52:16Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+
+    @Override
+    public int filesCount() {
+      return fileScanTask.filesCount();


Delegate to the parent task to avoid materializing the list of deletes for the files count in the split task as well.

aokolnychyi · 2023-08-17T18:04:06Z

cc @RussellSpitzer @szehon-ho @flyrain @nastra @advancedxy

advancedxy

Thanks for pinging me.

Another meta question:
I saw a lot of lazily cache pattern in the code base, is there any simple construct that behavior likes Scala's lazy val xxx = .... I did a quick research, didn't find one. It would be nice if we could add such building block in the utility class(not in this PR's scope).

core/src/main/java/org/apache/iceberg/BaseContentScanTask.java

advancedxy · 2023-08-18T06:33:07Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

@@ -28,6 +28,10 @@ public class BaseFileScanTask extends BaseContentScanTask<FileScanTask, DataFile
    implements FileScanTask {
  private final DeleteFile[] deletes;

+  // lazy variables
+  private transient volatile List<DeleteFile> deletesAsList = null;
+  private transient volatile Long deletesSizeBytes = null;


Same as the long vs Long in BaseContentScanTask, I would prefer long instead.

Java's Long adds at least 16/24 bytes class header overhead compared to long.
There is also boxing overhead when returning as long.

I thought about that but was not sure about readability, I switched.

Using primitive values initialized to a custom value is not safe with custom serialization. Some Flink tests started to fail as variables were initialized to 0 after deserialization. Switched back to Long. Yes, we would box but I am not sure computing this value in the constructor would be a good idea for BaseFileScanTask, which is constructed by planFiles and may not be used for split planning.

Thanks for detail explanation.

On a second thought, how about declare it as a normal transient long, such as:

private transient volatile long deletesSizeBytes = 0; private long deletesSizeBytes() { if (deletesSizeBytes == 0) { // the deletesSizeBytes might not initialized yet. long size = 0L; for (DeleteFile deleteFile : deletes) { size += deleteFile.fileSizeInBytes(); } this.deletesSizeBytes = size; } return deletesSizeBytes; }

We just need to pay a small addition check for no delete file cases: which is iterating an empty array.

This should probably work and would avoid boxing and extra serialization overhead. I switched but use an extra check on the size of the delete array, I think that's more obvious to the reader.

advancedxy · 2023-08-18T06:45:27Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

  }

  @Override
  public Schema schema() {
    return super.schema();
  }

+  private long deletesSizeBytes() {
+    if (deletesSizeBytes == null) {


did another look at this, how about we calculate this deleteSizeBytes fields in the constructor directly?

No need to special check a -1 or Long.MIN_VALUE any more...

That would mean potentially serializing an extra field (while sending to executors). Given that can be many millions of such objects and fields, I'd probably not do it.

8 (size of long) * 1_000_000(1 million) = ~8MB, I wouldn't care too much about this especially the tasks are serialized to multiple executors in multiple rounds(in Spark query engine).

However it do add unnecessary overhead for ScanTask without delete files. So a transient long and lazy calculation would be nice.

It is not the size but rather the need to serialize extra values. If there is 1M files and each of them has 4 row groups, it is 4M values to serialize on the driver. If we read 10M files, that's like 40M extra values.

The new approach should avoid both the serialization as well as the boxing overhead.

Just checked the spark code(flink probably couldn't handle that scale), seems I was misunderstanding how tasks are serialized. I used to think only the specific task is serialized and sent to executor for one partition.

Seems like the rdd is serialized as a whole task binary for each task and sent to executor. That definitely adds a lot of overhead even for one extra field.

aokolnychyi · 2023-08-18T20:02:30Z

api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java


  public BaseScanTaskGroup(StructLike groupingKey, Collection<T> tasks) {
    Preconditions.checkNotNull(tasks, "tasks cannot be null");
    this.groupingKey = groupingKey;
    this.tasks = tasks.toArray();
+    this.taskCollection = Collections.unmodifiableCollection(tasks);


No need to create a list of tasks unless this task group was serialized. We pass a collection here that gets immediately converted to an array for serialization purposes. This task group is then still accessed on the driver via the public tasks() method, which creates another collection while we could have used the one that was passed to the constructor. It is not worth it if we have millions of files.

aokolnychyi · 2023-08-18T20:11:36Z

api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java

+  }
+
+  @Override
+  public long estimatedRowsCount() {


Caching isn't my primary goal. When profiling distributed planning, I noticed we generate tons of garbage while planning task groups and it sometimes takes up to 2/3 of the planning time to just plan groups for full table scans with millions of files. My primary motivation is to iterate over the array of tasks, instead of using the parent implementation with LongStream (which is slow and generates many unnecessary objects) or using an iterator-based approach (still has unnecessary overhead). For scans with 10+ million files, this overhead adds up, especially when we are running low on memory.

Internally, we did have a cache of tasks groups that were reused in multiple Spark scans. These metrics are being used for reporting stats to engines so while caching isn't the primary goal, it seems simple enough to do it and may be helpful if we also decide to cache task groups in the future.

I decided to drop the caching step for now. We can add it later.

Why not just calculate these values in the constructor? I feel like we pass over this array a bunch of times and we could just figure out these values at the beginning?

If we compute it in the constructor, it means we have to store that in variables and then serialize. We discussed it a bit here. Also, it may not always be needed. Given our current use cases, we don't benefit from caching. I decided to just optimize the computation itself for now.

aokolnychyi · 2023-08-18T21:00:55Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+    @Override
+    public long sizeBytes() {
+      if (sizeBytes == Long.MIN_VALUE) {
+        this.sizeBytes = FileScanTask.super.sizeBytes();


This would only be called after serialization. We init sizeBytes in the constructor otherwise.

advancedxy

LGTM

advancedxy · 2023-08-22T00:26:31Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+
+  @Override
+  public int filesCount() {
+    return 1 + deletes.length;


It would be unfortunate to materialize the list but it would not be the end of the world.

Of course. Current approach works for me.

jerqi · 2023-08-22T01:56:54Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

@@ -123,7 +170,19 @@ public boolean canMerge(ScanTask other) {
    @Override
    public SplitScanTask merge(ScanTask other) {
      SplitScanTask that = (SplitScanTask) other;
-      return new SplitScanTask(offset, len + that.length(), fileScanTask);
+      return new SplitScanTask(offset, len + that.length(), fileScanTask, deletesSizeBytes);


Should we use the method deleteSizeBytes() instead of the variable deleteSizeBytes? Should this place be consistent with other places? We use the method deleteSizeBytes() at BaseFileScanTask.java#50L.

We use the variable on purpose to avoid triggering the computation. Only use if already computed.

I should have added a comment, will do tomorrow.

SplitScanTask one = xxxx; SplitScanTask two = xxx; SplitScanTask three = one.merge(two); // Now the variable deleteSizeBytes is zero three.deleteSizeBytes(); // we will compute the deleteSizeBytes one.deleteSizeBytes(); // We will compute the deleteSizeBytes

It will recompute the deleteSizeBytes repeatedly in this situation. Could you point it out if I am wrong?

When SplitScanTask is created from BaseFileScanTask during planning, we are using deleteSizeBytes() to trigger the computation once per BaseFileScanTask instead of once per SplitScanTask (the number of split tasks usually matches the number of row groups). This is because each SplitScanTask has the same set of deletes as the parent task it was created from. After planning, adjacent SplitScanTasks in the same bin are merged together. While merging we are using deleteSizeBytes variable (not method), which should be already populated if the split tasks were created from the parent task. If the variable is not populated, it means tasks were created via a separate process (like parsing). In that case, we don't know whether it is beneficial to compute the size of deletes so we skip computing it unless requested.

@jerqi, the use case above won't happen in practice as TableScanUtil would only use the merged task.

When SplitScanTask is created from BaseFileScanTask during planning, we are using deleteSizeBytes() to trigger the computation once per BaseFileScanTask instead of once per SplitScanTask (the number of split tasks usually matches the number of row groups). This is because each SplitScanTask has the same set of deletes as the parent task it was created from. After planning, adjacent SplitScanTasks in the same bin are merged together. While merging we are using deleteSizeBytes variable (not method), which should be already populated if the split tasks were created from the parent task. If the variable is not populated, it means tasks were created via a separate process (like parsing). In that case, we don't know whether it is beneficial to compute the size of deletes so we skip computing it unless requested.

Thanks for your explanation. I get your point. Make sense.

@jerqi, the use case above won't happen in practice as TableScanUtil would only use the merged task.

Thanks. I got it. Just a little worried whether people will misuse them in the future. It's ok for me.

aokolnychyi · 2023-08-22T04:31:33Z

Here are some numbers with #8336 and this change.

After the changes:

Benchmark                                                                   Mode  Cnt           Score            Error   Units
TaskGroupPlanningBenchmark.planTaskGroups                                     ss    5           1.989 ±          0.097    s/op
TaskGroupPlanningBenchmark.planTaskGroups:·async                              ss                  NaN                      ---
TaskGroupPlanningBenchmark.planTaskGroups:·gc.alloc.rate                      ss    5         569.803 ±         30.506  MB/sec
TaskGroupPlanningBenchmark.planTaskGroups:·gc.alloc.rate.norm                 ss    5  1499818088.000 ±    3617083.698    B/op
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Eden_Space             ss    5         603.209 ±       3180.736  MB/sec
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Eden_Space.norm        ss    5  1596666675.200 ± 8422157507.465    B/op
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Survivor_Space         ss    5          25.930 ±        149.334  MB/sec
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Survivor_Space.norm    ss    5    68311168.000 ±  391184976.025    B/op
TaskGroupPlanningBenchmark.planTaskGroups:·gc.count                           ss    5           2.000                   counts
TaskGroupPlanningBenchmark.planTaskGroups:·gc.time                            ss    5          63.000                       ms

Before the changes:

Benchmark                                                                   Mode  Cnt            Score            Error   Units
TaskGroupPlanningBenchmark.planTaskGroups                                     ss    5            6.659 ±          0.214    s/op
TaskGroupPlanningBenchmark.planTaskGroups:·async                              ss                   NaN                      ---
TaskGroupPlanningBenchmark.planTaskGroups:·gc.alloc.rate                      ss    5         2529.137 ±         70.078  MB/sec
TaskGroupPlanningBenchmark.planTaskGroups:·gc.alloc.rate.norm                 ss    5  19052489617.600 ±    3591565.223    B/op
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Eden_Space             ss    5         2559.777 ±       1165.034  MB/sec
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Eden_Space.norm        ss    5  19285409792.000 ± 8829460580.255    B/op
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Survivor_Space         ss    5           38.440 ±         29.535  MB/sec
TaskGroupPlanningBenchmark.planTaskGroups:·gc.churn.PS_Survivor_Space.norm    ss    5    289638648.000 ±  225080948.753    B/op
TaskGroupPlanningBenchmark.planTaskGroups:·gc.count                           ss    5           22.000                   counts
TaskGroupPlanningBenchmark.planTaskGroups:·gc.time                            ss    5          908.000                       ms

Apart from time, there is 10+ times reduction in allocation rate and only 2 vs 22 garbage collections, which would make a big difference if we are running low on memory.

jerqi · 2023-08-22T06:16:35Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

@@ -45,31 +47,66 @@ protected FileScanTask self() {

  @Override
  protected FileScanTask newSplitTask(FileScanTask parentTask, long offset, long length) {
-    return new SplitScanTask(offset, length, parentTask);
+    return new SplitScanTask(offset, length, parentTask, deletesSizeBytes());


Why do we use the method deleteSizeBytes here? I can't get the difference from the 178th line.

I gave a bit of explanation here, let me know if that makes sense.

Make sense. I got it.

RussellSpitzer

This looks fine to me in general, but do we have any tests making sure that these methods are working correctly? I may be wrong here but I feel like we are changing some of these behaviors and I think normally this should break some tests.

RussellSpitzer · 2023-08-23T01:57:41Z

core/src/main/java/org/apache/iceberg/BaseCombinedScanTask.java

+  public int filesCount() {
+    int filesCount = 0;
+    for (FileScanTask task : tasks) {
+      filesCount += task.filesCount();


This is counting delete files that may be read for multiple data files multiple times, is that ok?

Same issue for bytes too right?

Yeah, that's the same behavior like we have today. I agree it is questionable but it is also useful to estimate the actual bytes read for delete files and how things overlap.

aokolnychyi · 2023-08-23T06:22:42Z

The behavior should be exactly like before. There were some some tests for planning as well as for Spark and Flink serialization. I'll find and post them tomorrow.

advancedxy

lgtm

aokolnychyi · 2023-08-24T06:44:11Z

There are tests in TestDataTableScan, TestBatchScans and in TestTableScanUtil + each engine has serialization tests. I feel like they cover most of what we need.

aokolnychyi · 2023-08-24T06:55:56Z

Thanks for reviewing, @advancedxy @jerqi @RussellSpitzer!

findepi · 2023-10-09T13:18:15Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

  }

  @Override
  public List<DeleteFile> deletes() {
-    return ImmutableList.copyOf(deletes);


Just noticed that in the Iceberg version used in Trino, the deletes() does copy on every invocation. Thank you for fixing this!

findepi · 2023-10-09T13:20:42Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

+
+  @Override
+  public int filesCount() {
+    return 1 + deletes.length;


Why not initialize the deletes list in the constructor?
Then no need to have these overrides.

That would mean BaseFileScanTask constructor needs to make a copy, but that's probably actually a good thing, since it would make the class more immutable.
It will have to make the copy anyway, since List deletes() is the only way to get the deletes information from the class, so it will be called sooner or later.

github-actions bot added the core label Aug 17, 2023

aokolnychyi commented Aug 17, 2023

View reviewed changes

aokolnychyi force-pushed the improve-file-scan-task branch from d685fd0 to d7e1c43 Compare August 17, 2023 18:01

advancedxy reviewed Aug 18, 2023

View reviewed changes

aokolnychyi force-pushed the improve-file-scan-task branch from d7e1c43 to 4027273 Compare August 18, 2023 19:54

github-actions bot added the API label Aug 18, 2023

aokolnychyi commented Aug 18, 2023

View reviewed changes

aokolnychyi force-pushed the improve-file-scan-task branch from 4027273 to f1b5b4d Compare August 19, 2023 00:12

aokolnychyi changed the title ~~Core: Lazily cache user-facing state in BaseFileScanTask~~ Core: Optimize computing user-facing state in data task Aug 19, 2023

aokolnychyi force-pushed the improve-file-scan-task branch 2 times, most recently from afd584a to 5f007b6 Compare August 19, 2023 01:03

advancedxy approved these changes Aug 22, 2023

View reviewed changes

jerqi reviewed Aug 22, 2023

View reviewed changes

Core: Optimize computing user-facing state in data tasks

fe2669a

aokolnychyi force-pushed the improve-file-scan-task branch from 0273afd to fe2669a Compare August 22, 2023 04:22

jerqi reviewed Aug 22, 2023

View reviewed changes

jerqi approved these changes Aug 23, 2023

View reviewed changes

RussellSpitzer reviewed Aug 23, 2023

View reviewed changes

RussellSpitzer approved these changes Aug 23, 2023

View reviewed changes

advancedxy approved these changes Aug 24, 2023

View reviewed changes

aokolnychyi merged commit 181d3e2 into apache:master Aug 24, 2023
42 checks passed

findepi reviewed Oct 9, 2023

View reviewed changes

Core: Optimize computing user-facing state in data task #8346

Core: Optimize computing user-facing state in data task #8346

Conversation

aokolnychyi commented Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Aug 17, 2023

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Aug 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi commented Aug 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Aug 23, 2023

advancedxy left a comment

Choose a reason for hiding this comment

aokolnychyi commented Aug 24, 2023

aokolnychyi commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Aug 17, 2023 •

edited

Loading

aokolnychyi Aug 19, 2023 •

edited

Loading

aokolnychyi Aug 18, 2023 •

edited

Loading

aokolnychyi Aug 18, 2023 •

edited

Loading

aokolnychyi Aug 18, 2023 •

edited

Loading

jerqi Aug 22, 2023 •

edited

Loading

aokolnychyi Aug 22, 2023 •

edited

Loading

jerqi Aug 22, 2023 •

edited

Loading