Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: omit unmodified files during merge write #1969

Merged
merged 14 commits into from
Dec 30, 2023

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Dec 14, 2023

Description

Implements a new Datafusion node called MergeBarrier that determines which files have modifications. For files that do not have modifications a remove action is no longer created.

Related Issue(s)

@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core labels Dec 14, 2023
@Blajda
Copy link
Collaborator Author

Blajda commented Dec 14, 2023

Benchmarks
Before is main and After is this PR

+-------------------------------------------------------------------------------------+---------------------+--------------------+------------------------------------------+
| name                                                                                | before_duration_avg | after_duration_avg | before_duration_avg / after_duration_avg |
+-------------------------------------------------------------------------------------+---------------------+--------------------+------------------------------------------+
| delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05                        | 6160.0              | 2928.0             | 2.1038251366120218                       |
| multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05            | 5416.0              | 2362.0             | 2.2929720575783237                       |
| multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.50            | 5510.0              | 2516.0             | 2.1899841017488075                       |
| multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0             | 5424.0              | 2527.0             | 2.146418678274634                        |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1   | 4938.0              | 2594.0             | 1.9036237471087125                       |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1    | 4368.0              | 2162.0             | 2.020351526364477                        |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0    | 5299.0              | 2582.0             | 2.052285050348567                        |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01   | 4606.0              | 2647.0             | 1.7400831129580658                       |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001  | 4835.0              | 2628.0             | 1.8398021308980212                       |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 | 4866.0              | 2661.0             | 1.8286358511837655                       |
| upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001  | 4690.0              | 2679.0             | 1.7506532288167227                       |
| upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.001_rowNotMatchedFraction_0.001 | 4878.0              | 2569.0             | 1.8987933047878551                       |
| upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.001_rowNotMatchedFraction_0.001 | 4729.0              | 2630.0             | 1.7980988593155893                       |
+-------------------------------------------------------------------------------------+---------------------+--------------------+------------------------------------------+

Logs from benchmarks
Shows that the entire delta table is not rewritten during the operation.

Test: delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05 Sample: 0
Total File count: 2114
File sample count: 127
MergeMetrics { num_source_rows: 225, num_target_rows_inserted: 0, num_target_rows_updated: 0, num_target_rows_deleted: 225, num_target_rows_copied: 3440, num_output_rows: 3440, num_target_files_added: 99, num_target_files_removed: 100, execution_time_ms: 2870, scan_time_ms: 0, rewrite_time_ms: 1940 }
Seconds: 2.928915
Test: multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 Sample: 0
Total File count: 2114
File sample count: 107
MergeMetrics { num_source_rows: 159, num_target_rows_inserted: 159, num_target_rows_updated: 0, num_target_rows_deleted: 0, num_target_rows_copied: 0, num_output_rows: 159, num_target_files_added: 77, num_target_files_removed: 0, execution_time_ms: 2307, scan_time_ms: 0, rewrite_time_ms: 1447 }
Seconds: 2.362089
Test: multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.50 Sample: 0
Total File count: 2114
File sample count: 92
MergeMetrics { num_source_rows: 1533, num_target_rows_inserted: 1533, num_target_rows_updated: 0, num_target_rows_deleted: 0, num_target_rows_copied: 0, num_output_rows: 1533, num_target_files_added: 91, num_target_files_removed: 0, execution_time_ms: 2465, scan_time_ms: 0, rewrite_time_ms: 1590 }
Seconds: 2.516183
Test: multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 Sample: 0
Total File count: 2114
File sample count: 116
MergeMetrics { num_source_rows: 3748, num_target_rows_inserted: 3748, num_target_rows_updated: 0, num_target_rows_deleted: 0, num_target_rows_copied: 0, num_output_rows: 3748, num_target_files_added: 116, num_target_files_removed: 0, execution_time_ms: 2475, scan_time_ms: 0, rewrite_time_ms: 1613 }
Seconds: 2.5279317
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1 Sample: 0
Total File count: 2114
File sample count: 104
MergeMetrics { num_source_rows: 386, num_target_rows_inserted: 359, num_target_rows_updated: 27, num_target_rows_deleted: 0, num_target_rows_copied: 970, num_output_rows: 1356, num_target_files_added: 91, num_target_files_removed: 23, execution_time_ms: 2540, scan_time_ms: 0, rewrite_time_ms: 1817 }
Seconds: 2.5945745
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1 Sample: 0
Total File count: 2114
File sample count: 91
MergeMetrics { num_source_rows: 273, num_target_rows_inserted: 273, num_target_rows_updated: 0, num_target_rows_deleted: 0, num_target_rows_copied: 0, num_output_rows: 273, num_target_files_added: 85, num_target_files_removed: 0, execution_time_ms: 2109, scan_time_ms: 0, rewrite_time_ms: 1409 }
Seconds: 2.162729
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0 Sample: 0
Total File count: 2114
File sample count: 109
MergeMetrics { num_source_rows: 354, num_target_rows_inserted: 0, num_target_rows_updated: 354, num_target_rows_deleted: 0, num_target_rows_copied: 2918, num_output_rows: 3272, num_target_files_added: 95, num_target_files_removed: 95, execution_time_ms: 2528, scan_time_ms: 0, rewrite_time_ms: 1826 }
Seconds: 2.5823224
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01 Sample: 0
Total File count: 2114
File sample count: 110
MergeMetrics { num_source_rows: 370, num_target_rows_inserted: 36, num_target_rows_updated: 334, num_target_rows_deleted: 0, num_target_rows_copied: 2954, num_output_rows: 3324, num_target_files_added: 101, num_target_files_removed: 99, execution_time_ms: 2592, scan_time_ms: 0, rewrite_time_ms: 1866 }
Seconds: 2.6472437
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001 Sample: 0
Total File count: 2114
File sample count: 109
MergeMetrics { num_source_rows: 1956, num_target_rows_inserted: 6, num_target_rows_updated: 1950, num_target_rows_deleted: 0, num_target_rows_copied: 1920, num_output_rows: 3876, num_target_files_added: 108, num_target_files_removed: 108, execution_time_ms: 2574, scan_time_ms: 0, rewrite_time_ms: 1856 }
Seconds: 2.628581
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 Sample: 0
Total File count: 2114
File sample count: 101
MergeMetrics { num_source_rows: 3150, num_target_rows_inserted: 5, num_target_rows_updated: 3145, num_target_rows_deleted: 0, num_target_rows_copied: 40, num_output_rows: 3190, num_target_files_added: 101, num_target_files_removed: 101, execution_time_ms: 2606, scan_time_ms: 0, rewrite_time_ms: 1837 }
Seconds: 2.661284
Test: upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001 Sample: 0
Total File count: 2114
File sample count: 120
MergeMetrics { num_source_rows: 3785, num_target_rows_inserted: 2, num_target_rows_updated: 3783, num_target_rows_deleted: 0, num_target_rows_copied: 0, num_output_rows: 3785, num_target_files_added: 120, num_target_files_removed: 120, execution_time_ms: 2620, scan_time_ms: 0, rewrite_time_ms: 1895 }
Seconds: 2.679585
Test: upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.001_rowNotMatchedFraction_0.001 Sample: 0
Total File count: 2114
File sample count: 1036
MergeMetrics { num_source_rows: 72, num_target_rows_inserted: 39, num_target_rows_updated: 33, num_target_rows_deleted: 0, num_target_rows_copied: 1167, num_output_rows: 1239, num_target_files_added: 72, num_target_files_removed: 33, execution_time_ms: 2514, scan_time_ms: 0, rewrite_time_ms: 1785 }
Seconds: 2.5690942
Test: upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.001_rowNotMatchedFraction_0.001 Sample: 0
Total File count: 2114
File sample count: 2114
MergeMetrics { num_source_rows: 152, num_target_rows_inserted: 84, num_target_rows_updated: 68, num_target_rows_deleted: 0, num_target_rows_copied: 5570, num_output_rows: 5722, num_target_files_added: 142, num_target_files_removed: 67, execution_time_ms: 2576, scan_time_ms: 0, rewrite_time_ms: 1873 }
Seconds: 2.630923

@thomasfrederikhoeck
Copy link
Contributor

@Blajda does this perform both partition pruning and additional pruning based on source statistics as you mentioed in #1846 ? That would be great :-D

@Blajda
Copy link
Collaborator Author

Blajda commented Dec 17, 2023

@thomasfrederikhoeck Predicates can be pushed down to the delta scan now if a full scan is not required. The optimizer will attempt to prune files based on the predicate. E.G If you have a predicate like s.id = t.id and t.partition in ("2023-12-17") then it push the predicate down to the scan.

#1958 takes it to the next level by determining distinct partition values that occur in the source and then prunes from the scan.

@Blajda Blajda marked this pull request as ready for review December 17, 2023 21:41
@thomasfrederikhoeck
Copy link
Contributor

@Blajda ah, ok. Thank you for explaing! I'm looking forward to trying them out!

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!! Left two small comments

@@ -1,5 +1,7 @@
//! Logical Operations for DataFusion

use std::collections::HashSet;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use hashbrown here instead?

I read somewhere it's the default in a newer version in the std library but not sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah as you called out they are changing the underlying implementation to be hashbrown

@Blajda Blajda merged commit 74f9d33 into delta-io:main Dec 30, 2023
21 checks passed
@Blajda Blajda deleted the merge-barrier branch December 30, 2023 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants