Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connect): distinct + sort #3677

Merged
merged 30 commits into from
Jan 15, 2025

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented Jan 13, 2025

Copy link

codspeed-hq bot commented Jan 13, 2025

CodSpeed Performance Report

Merging #3677 will degrade performances by 32.02%

Comparing universalmind303:connect_distinct (a189caf) with main (809e411)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 25 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main universalmind303:connect_distinct Change
test_iter_rows_first_row[100 Small Files] 256.6 ms 218.6 ms +17.41%
test_show[100 Small Files] 16.3 ms 24 ms -32.02%

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 53.22581% with 29 lines in your changes missing coverage. Please review.

Project coverage is 75.92%. Comparing base (809e411) to head (a189caf).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-connect/src/translation/logical_plan.rs 51.72% 28 Missing ⚠️
...onnect/src/translation/expr/unresolved_function.rs 75.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3677      +/-   ##
==========================================
- Coverage   77.85%   75.92%   -1.94%     
==========================================
  Files         729      728       -1     
  Lines       89983    91440    +1457     
==========================================
- Hits        70057    69425     -632     
- Misses      19926    22015    +2089     
Files with missing lines Coverage Δ
...onnect/src/translation/expr/unresolved_function.rs 83.15% <75.00%> (+8.96%) ⬆️
src/daft-connect/src/translation/logical_plan.rs 37.02% <51.72%> (+4.81%) ⬆️

... and 20 files with indirect coverage changes

Comment on lines +101 to +102
let arg = if arg.as_literal().and_then(|lit| lit.as_i32()) == Some(1i32) {
col("*")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in spark, df.count() is processed as count(1), which is equivalent to our count(*)

arg
};

let count = arg.count(CountMode::All).cast(&DataType::Int64);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark doesn't support u64's so we need to cast it to i64

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could we add a test for count(*) as well?

@universalmind303
Copy link
Contributor Author

LGTM. Could we add a test for count(*) as well?

AFAIK, you can't do df.count("*") in spark.

@universalmind303 universalmind303 merged commit 34d2036 into Eventual-Inc:main Jan 15, 2025
38 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants