Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scantask-2): Implement new module for splitting Parquet ScanTask #3628

Merged
merged 3 commits into from
Jan 7, 2025

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Dec 20, 2024

Implements a new module daft-scan/src/scan_task_iters/split_parquet which contains all the functionality for splitting an iterator of ScanTasks into smaller ScanTasks if they are Parquet files.

The public interface is SplitParquetScanTasks, which under the hood uses private functionality in the form of DecideSplitIterator -> RetrieveParquetMetadataIterator -> flatten to produce a new (split) iterator of ScanTasks.

@jaychia jaychia requested a review from samster25 December 20, 2024 07:30
@jaychia jaychia force-pushed the jay/split-all-files-iter-2 branch from 2188e09 to 326ff1d Compare December 20, 2024 08:10
Base automatically changed from jay/split-all-files-iter to main January 7, 2025 00:02
@github-actions github-actions bot added the feat label Jan 7, 2025
@jaychia jaychia force-pushed the jay/split-all-files-iter-2 branch from 326ff1d to 18dc97b Compare January 7, 2025 00:05
@jaychia jaychia enabled auto-merge (squash) January 7, 2025 00:07
Copy link

codspeed-hq bot commented Jan 7, 2025

CodSpeed Performance Report

Merging #3628 will degrade performances by 41.5%

Comparing jay/split-all-files-iter-2 (18dc97b) with main (836297a)

Summary

❌ 2 regressions
✅ 25 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main jay/split-all-files-iter-2 Change
test_iter_rows_first_row[100 Small Files] 110.4 ms 188.7 ms -41.5%
test_show[100 Small Files] 15.7 ms 18.9 ms -16.95%

@jaychia jaychia merged commit e2d4c86 into main Jan 7, 2025
40 of 41 checks passed
@jaychia jaychia deleted the jay/split-all-files-iter-2 branch January 7, 2025 00:26
Copy link

codecov bot commented Jan 7, 2025

Codecov Report

Attention: Patch coverage is 11.32075% with 47 lines in your changes missing coverage. Please review.

Project coverage is 78.04%. Comparing base (ff2619b) to head (18dc97b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...task_iters/split_parquet/fetch_parquet_metadata.rs 0.00% 17 Missing ⚠️
...daft-scan/src/scan_task_iters/split_parquet/mod.rs 0.00% 10 Missing ⚠️
src/daft-scan/src/scan_task_iters/mod.rs 40.00% 9 Missing ⚠️
...task_iters/split_parquet/split_parquet_decision.rs 0.00% 9 Missing ⚠️
...can_task_iters/split_parquet/split_parquet_file.rs 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3628      +/-   ##
==========================================
+ Coverage   77.28%   78.04%   +0.76%     
==========================================
  Files         721      725       +4     
  Lines       90549    89217    -1332     
==========================================
- Hits        69978    69633     -345     
+ Misses      20571    19584     -987     
Files with missing lines Coverage Δ
...can_task_iters/split_parquet/split_parquet_file.rs 0.00% <0.00%> (ø)
src/daft-scan/src/scan_task_iters/mod.rs 91.28% <40.00%> (-2.68%) ⬇️
...task_iters/split_parquet/split_parquet_decision.rs 0.00% <0.00%> (ø)
...daft-scan/src/scan_task_iters/split_parquet/mod.rs 0.00% <0.00%> (ø)
...task_iters/split_parquet/fetch_parquet_metadata.rs 0.00% <0.00%> (ø)

... and 19 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants