Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add sample function for Dataframe #1770

Merged
merged 6 commits into from
Jan 11, 2024
Merged

[FEAT] Add sample function for Dataframe #1770

merged 6 commits into from
Jan 11, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Jan 9, 2024

Closes #1759

Added new sample function based on https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sample.html

Changes:

  • Modified existing sampling logic in table and micropartition to instead accept fraction, with_replacement, and seed.
  • Added logical + physical ops for sample
  • Added end to end tests

@github-actions github-actions bot added the enhancement New feature or request label Jan 9, 2024
Copy link

codecov bot commented Jan 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8f33256) 84.69% compared to head (c48958e) 84.84%.
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1770      +/-   ##
==========================================
+ Coverage   84.69%   84.84%   +0.15%     
==========================================
  Files          55       55              
  Lines        5554     5584      +30     
==========================================
+ Hits         4704     4738      +34     
+ Misses        850      846       -4     
Files Coverage Δ
daft/dataframe/dataframe.py 87.42% <100.00%> (+0.14%) ⬆️
daft/execution/execution_step.py 93.19% <100.00%> (+0.08%) ⬆️
daft/execution/physical_plan.py 93.43% <100.00%> (+0.03%) ⬆️
daft/execution/rust_physical_plan_shim.py 93.25% <100.00%> (+1.30%) ⬆️
daft/logical/builder.py 89.56% <100.00%> (+0.27%) ⬆️
daft/table/micropartition.py 90.04% <100.00%> (+0.30%) ⬆️
daft/table/table.py 84.53% <100.00%> (+0.31%) ⬆️

... and 1 file with indirect coverage changes

@@ -777,7 +777,7 @@ def sort(
partial_metadatas=None,
)
.add_instruction(
instruction=execution_step.Sample(sort_by=sort_by),
instruction=execution_step.Sample(fraction=0.05, sort_by=sort_by),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose 0.05 as the fraction for sampling during sort (used to be a default value of 20), lmk if this is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sampling code for sort, it would be better to be able to set a number like 20 rather than take a fraction. For example, if our partition is only 10 elements, we rather take all the elements than just sample 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled sample by fraction and sample by size for table and micropartition, so sort will use sample by size.

pub struct Sample {
// Upstream node.
pub input: Arc<LogicalPlan>,
pub fraction: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the trait bounds #[derive(Clone, Debug, PartialEq, Eq, Hash)] i couldn't store fraction as a f64, so i stored it as a string instead, then later on in physical plan I parse it as a f64. Is there a better way to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to impl Eq for Sample and set the behavior for comparing the f64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, did the impl for Eq and Hash

Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean! Nice work! Just a few comments :)

@@ -777,7 +777,7 @@ def sort(
partial_metadatas=None,
)
.add_instruction(
instruction=execution_step.Sample(sort_by=sort_by),
instruction=execution_step.Sample(size=20, sort_by=sort_by),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have some centralized configs, might be a good place to put this constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the constant: sample_size_for_sort in DaftExecutionConfig here https://github.com/Eventual-Inc/Daft/pull/1770/files#diff-9feb783dcfde34fbbab50546b081d87cfb98ec050b89cfbe6ff9fe64e1404029R37 . I didn't add this as a parameter in `set_execution_config' though, I assume that this isn't something that the user would need to change. but lmk if I'm wrong!

Returns:
DataFrame: DataFrame with a fraction of rows.
"""
builder = self._builder.sample(fraction, with_replacement, seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure that 0.0 <= fraction <= 1.0 somewhere on the dataframe side as well. Right now I only see that check on the execution side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this check in the public method for the Dataframe API as well

@@ -186,7 +188,8 @@ impl PhysicalPlan {
// TODO(Clark): Estimate row/column pruning to get a better size approximation.
Self::Filter(Filter { input, .. })
| Self::Limit(Limit { input, .. })
| Self::Project(Project { input, .. }) => input.approximate_size_bytes(),
| Self::Project(Project { input, .. })
| Self::Sample(Sample { input, .. }) => input.approximate_size_bytes(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sample you can probably do input.approximate_size_bytes() * fraction


# no arguments
with pytest.raises(ValueError, match="Must specify either `fraction` or `size`"):
daft_table.sample()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for when fraction < 0.0 and when > 1.0

def test_sample_over_sample(make_df, valid_data: list[dict[str, float]]) -> None:
df = make_df(valid_data)

df = df.sample(fraction=2.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should throw an error if the fraction is over 1.0

@colin-ho colin-ho requested a review from samster25 January 10, 2024 22:50
Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Nice work :)

@colin-ho colin-ho merged commit c4bd1b3 into main Jan 11, 2024
42 checks passed
@colin-ho colin-ho deleted the colin/sample branch January 11, 2024 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for sample
2 participants