Don't rewrite single small file per partition during Optimize #18938

homar · 2023-09-05T21:53:57Z

Description

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.

findinpath · 2023-09-06T19:09:00Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableHandle.java

@@ -62,6 +63,7 @@ public enum WriteType

    // OPTIMIZE only. Coordinator-only
    private final boolean recordScannedFiles;
+    private final Optional<DeltaLakeTableExecuteHandle> executeHandle;


Reduce the dependency here.From reading the code, we need in DeltaLakeSplitManager to know whether we're in an OPTIMIZE situation from the DeltaLakeTableHandle. If yes, let's just add a boolean flag to point this out.
Do reflect this in equals() and hashCode() methods as well.

findinpath · 2023-09-06T19:11:20Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

+            ImmutableList<String> partitionColumns = metadataEntry.getOriginalPartitionColumns().stream()
+                    .map(partitionColumnMapping::get).collect(toImmutableList());


Suggested change

ImmutableList<String> partitionColumns = metadataEntry.getOriginalPartitionColumns().stream()

.map(partitionColumnMapping::get).collect(toImmutableList());

ImmutableList<String> partitionColumns = metadataEntry.getOriginalPartitionColumns().stream()

.map(partitionColumnMapping::get)

.collect(toImmutableList());

Gentle reminder.

findinpath · 2023-09-06T19:37:18Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

@@ -198,6 +221,11 @@ private Stream<DeltaLakeSplit> getSplits(
                        return Stream.empty();
                    }

+                    // no need to rewrite small file that is the only one in its partition
+                    if (isOptimize && filesPerBucket.get(getPartitionKey(originalPartitionColumns, addAction)) <= 1 && maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()) {


maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()

If we have only one file in the partition, there is nothing to optimize, right? The above highlighted condition is not necessary.

if it is bigger than maxScannedFileSizeInBytes then it should be optimzied and splitted into 2 files

findinpath · 2023-09-06T19:41:10Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

+        return counters;
+    }
+
+    private String getPartitionKey(List<String> partitionColumns, AddFileEntry addFileEntry)


Instead of this rather ad-hoc mechanism, wouldn't it be enough to make use of io.trino.plugin.deltalake.transactionlog.AddFileEntry#getCanonicalPartitionValues as a key? I'm guessing that this map can be used for equality.

An alternative would be a list of deserialized partition values io.trino.plugin.deltalake.transactionlog.TransactionLogParser#deserializePartitionValue (probably not necessary though).

I was kind of afraid that using map as a key in another map may not be reliable

The map is just an object with a hashcode - i don't think it matters whether you use string or the canonicalPartitionValues as key in the map.

so if it doesnt matter why do you insist? ;) i am kind of afraid of things like Set<Map<String, Optional<String>> i can do it but im not a big fun

I'm insisting because I don't see the point of having a special logic for geting the stringified "partition key".

findinpath · 2023-09-06T20:20:14Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

@@ -236,6 +264,33 @@ private Stream<DeltaLakeSplit> getSplits(
                });
    }

+    private Map<String, Long> countFilesPerPartition(List<String> partitionColumns, List<AddFileEntry> addFileEntries)


IIUC we want to keep (in case we're doing OPTIMIZE) only the AddFileEntries with duplicate canonical partition values per partition/table (in case the table is not partitioned).

We don't necessarily need to do counting in order to keep only the entries which have duplicate canonical partition values.

Here's a chatgpt proof of concept

public class Main { public static void main(String[] args) { // Create a list of elements (strings) List<String> elements = Arrays.asList("apple", "banana", "cherry", "banana", "date", "banana", "apple", "fig"); // Create a set to store elements seen before Set<String> seenElements = new HashSet<>(); // Create a set to store elements that occur more than once Set<String> multipleOccurrenceElements = new HashSet<>(); // Iterate through the list to find elements that occur more than once for (String element : elements) { if (!seenElements.add(element)) { // If the element was already in the set, it occurs more than once multipleOccurrenceElements.add(element); } } // Print elements that occur more than once for (String element : elements) { if (multipleOccurrenceElements.contains(element)) { System.out.println(element); } } } }

yep, great idea, thank you

findinpath · 2023-09-12T07:59:48Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableHandle.java

                maxScannedFileSize,
                readVersion);
    }

-    public DeltaLakeTableHandle forOptimize(boolean recordScannedFiles, DataSize maxScannedFileSize)
+    public DeltaLakeTableHandle forOptimize(boolean recordScannedFiles, DataSize maxScannedFileSize, boolean isOptimize)


nit: isOptimize is redundant because the method name is forOptimize

Did you remove the method parameter?

findinpath · 2023-09-12T08:07:22Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

@@ -198,6 +219,11 @@ private Stream<DeltaLakeSplit> getSplits(
                        return Stream.empty();
                    }

+                    // no need to rewrite small file that is the only one in its partition
+                    if (isOptimize && !partitionKeysWithMoreThanOneFile.contains(getPartitionKey(originalPartitionColumns, addAction)) && maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()) {


Suggested change

if (isOptimize && !partitionKeysWithMoreThanOneFile.contains(getPartitionKey(originalPartitionColumns, addAction)) && maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()) {

if (isOptimize && partitionKeysWithSingleFile.contains(getPartitionKey(originalPartitionColumns, addAction)) && maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()) {

~~The positive tense (instead of !) is a bit easier to follow from maintainer perspective.~~
Scratch this in case it doesn't fit with the optimized logic for retrieving partitions with more than one file.

findinpath · 2023-09-15T10:08:03Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

+        ImmutableSet.Builder<Map<String, Optional<String>>> partitionsWithAtMostOneFileBuilder = ImmutableSet.builder();
+        if (isOptimize) {
+            partitionsWithAtMostOneFileBuilder.addAll(findPartitionsWithAtMostOneFile(validDataFiles));
+        }
+        Set<Map<String, Optional<String>>> partitionsWithAtMostOneFile = partitionsWithAtMostOneFileBuilder.build();


Suggested change

ImmutableSet.Builder<Map<String, Optional<String>>> partitionsWithAtMostOneFileBuilder = ImmutableSet.builder();

if (isOptimize) {

partitionsWithAtMostOneFileBuilder.addAll(findPartitionsWithAtMostOneFile(validDataFiles));

}

Set<Map<String, Optional<String>>> partitionsWithAtMostOneFile = partitionsWithAtMostOneFileBuilder.build();

Set<Map<String, Optional<String>>> partitionsWithOneFileForOptimize = isOptimize ? findPartitionsWithAtMostOneFile(validDataFiles): Set.of();

ebyhr · 2023-09-19T09:36:27Z

/test-with-secrets sha=c65e54edd7972f08a28bd9d46a525cfe58b7895a

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

github-actions · 2023-09-19T11:00:59Z

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/6234333370

cla-bot bot added the cla-signed label Sep 5, 2023

github-actions bot added the delta-lake Delta Lake connector label Sep 6, 2023

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch 2 times, most recently from 8349345 to f8a05d9 Compare September 6, 2023 07:56

homar marked this pull request as ready for review September 6, 2023 10:30

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch from f8a05d9 to 1c9aaa9 Compare September 6, 2023 10:32

homar requested a review from findinpath September 6, 2023 10:40

findinpath reviewed Sep 6, 2023

View reviewed changes

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch from 1c9aaa9 to 8a577c1 Compare September 10, 2023 07:58

findinpath reviewed Sep 12, 2023

View reviewed changes

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch 2 times, most recently from 1812d35 to d0ed6dd Compare September 14, 2023 15:34

Reformat DeltaLakeSplitManager

31634cf

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch from d0ed6dd to a9f6fa0 Compare September 15, 2023 10:05

findinpath reviewed Sep 15, 2023

View reviewed changes

findinpath approved these changes Sep 15, 2023

View reviewed changes

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch 2 times, most recently from ddcb629 to c65e54e Compare September 15, 2023 12:10

ebyhr reviewed Sep 19, 2023

View reviewed changes

Don't rewrite single small file per partition during Optimize

92a3941

homar force-pushed the homar/prevent_optimize_for_rewriting_one_small_file branch from c65e54e to 92a3941 Compare September 19, 2023 16:11

homar requested a review from ebyhr September 20, 2023 07:15

ebyhr approved these changes Sep 20, 2023

View reviewed changes

ebyhr merged commit 40335ae into trinodb:master Sep 20, 2023
23 checks passed

github-actions bot added this to the 427 milestone Sep 20, 2023

colebow mentioned this pull request Sep 25, 2023

Add Trino 427 release notes #19023

Merged

findinpath mentioned this pull request Dec 10, 2023

Convert the retrieval of the active data files to streaming #20054

Merged

findinpath mentioned this pull request Dec 13, 2023

OPTIMIZE of Delta tables copies small files unchanged #20088

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't rewrite single small file per partition during Optimize #18938

Don't rewrite single small file per partition during Optimize #18938

homar commented Sep 5, 2023 •

edited by ebyhr

Loading

findinpath Sep 6, 2023

findinpath Sep 6, 2023

findinpath Sep 12, 2023

findinpath Sep 6, 2023

homar Sep 9, 2023

findinpath Sep 6, 2023

homar Sep 9, 2023

findinpath Sep 12, 2023

homar Sep 12, 2023

findinpath Sep 12, 2023

findinpath Sep 6, 2023

homar Sep 9, 2023 •

edited

Loading

findinpath Sep 12, 2023

findinpath Sep 14, 2023

findinpath Sep 12, 2023 •

edited

Loading

findinpath Sep 15, 2023

ebyhr commented Sep 19, 2023

github-actions bot commented Sep 19, 2023

		ImmutableList<String> partitionColumns = metadataEntry.getOriginalPartitionColumns().stream()
		.map(partitionColumnMapping::get).collect(toImmutableList());

	if (isOptimize && !partitionKeysWithMoreThanOneFile.contains(getPartitionKey(originalPartitionColumns, addAction)) && maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()) {
	if (isOptimize && partitionKeysWithSingleFile.contains(getPartitionKey(originalPartitionColumns, addAction)) && maxScannedFileSizeInBytes.isPresent() && addAction.getSize() < maxScannedFileSizeInBytes.get()) {

Don't rewrite single small file per partition during Optimize #18938

Don't rewrite single small file per partition during Optimize #18938

Conversation

homar commented Sep 5, 2023 • edited by ebyhr Loading

Description

Additional context and related issues

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

homar Sep 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Sep 19, 2023

github-actions bot commented Sep 19, 2023

homar commented Sep 5, 2023 •

edited by ebyhr

Loading

homar Sep 9, 2023 •

edited

Loading

findinpath Sep 12, 2023 •

edited

Loading