-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: update df.corr, df.cov to be used with more than 30 columns case. #1161
Merged
Merged
Changes from 12 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
fc80e17
perf: update df.corr, df.cov to be used with more than 30 columns case.
Genesis929 52aaa38
add large test
Genesis929 359aea9
remove print
Genesis929 9c2b997
fix_index
Genesis929 dcb1eb2
fix index
Genesis929 b657bbb
test fix
Genesis929 9515803
fix test
Genesis929 493e80e
fix test
Genesis929 f62a626
Merge branch 'main' into corr_update_huanc
Genesis929 1abd51e
slightly improve multi_apply_unary_op to avoid RecursionError
Genesis929 0ffa778
update recursion limit for nox session
Genesis929 52f0dfe
skip the test in e2e/python 3.12
Genesis929 19764ad
simplify code
Genesis929 bd46366
simplify code
Genesis929 d9177ad
Merge branch 'main' into corr_update_huanc
Genesis929 4842060
Merge branch 'main' into corr_update_huanc
Genesis929 000ab9d
Merge branch 'main' into corr_update_huanc
Genesis929 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1197,15 +1197,203 @@ def corr(self, method="pearson", min_periods=None, numeric_only=False) -> DataFr | |
else: | ||
frame = self._drop_non_numeric() | ||
|
||
return DataFrame(frame._block.calculate_pairwise_metric(op=agg_ops.CorrOp())) | ||
orig_columns = frame.columns | ||
# Replace column names with 0 to n - 1 to keep order | ||
# and avoid the influence of duplicated column name | ||
frame.columns = pandas.Index(range(len(orig_columns))) | ||
frame = frame.astype(bigframes.dtypes.FLOAT_DTYPE) | ||
block = frame._block | ||
|
||
# A new column that uniquely identifies each row | ||
block, ordering_col = frame._block.promote_offsets(label="_bigframes_idx") | ||
|
||
val_col_ids = [ | ||
col_id for col_id in block.value_columns if col_id != ordering_col | ||
] | ||
|
||
block = block.melt( | ||
[ordering_col], val_col_ids, ["_bigframes_variable"], "_bigframes_value" | ||
) | ||
|
||
block = block.merge( | ||
block, | ||
left_join_ids=[ordering_col], | ||
right_join_ids=[ordering_col], | ||
how="inner", | ||
sort=False, | ||
) | ||
|
||
frame = DataFrame(block) | ||
frame = frame.dropna(subset=["_bigframes_value_x", "_bigframes_value_y"]) | ||
|
||
paired_mean_frame = ( | ||
frame.groupby(["_bigframes_variable_x", "_bigframes_variable_y"]) | ||
.agg( | ||
_bigframes_paired_mean_x=bigframes.pandas.NamedAgg( | ||
column="_bigframes_value_x", aggfunc="mean" | ||
), | ||
_bigframes_paired_mean_y=bigframes.pandas.NamedAgg( | ||
column="_bigframes_value_y", aggfunc="mean" | ||
), | ||
) | ||
.reset_index() | ||
) | ||
|
||
frame = frame.merge( | ||
paired_mean_frame, on=["_bigframes_variable_x", "_bigframes_variable_y"] | ||
) | ||
frame["_bigframes_value_x"] -= frame["_bigframes_paired_mean_x"] | ||
frame["_bigframes_value_y"] -= frame["_bigframes_paired_mean_y"] | ||
|
||
frame["_bigframes_dividend"] = ( | ||
frame["_bigframes_value_x"] * frame["_bigframes_value_y"] | ||
) | ||
frame["_bigframes_x_square"] = ( | ||
frame["_bigframes_value_x"] * frame["_bigframes_value_x"] | ||
) | ||
frame["_bigframes_y_square"] = ( | ||
frame["_bigframes_value_y"] * frame["_bigframes_value_y"] | ||
) | ||
|
||
result = ( | ||
frame.groupby(["_bigframes_variable_x", "_bigframes_variable_y"]) | ||
.agg( | ||
_bigframes_dividend_sum=bigframes.pandas.NamedAgg( | ||
column="_bigframes_dividend", aggfunc="sum" | ||
), | ||
_bigframes_x_square_sum=bigframes.pandas.NamedAgg( | ||
column="_bigframes_x_square", aggfunc="sum" | ||
), | ||
_bigframes_y_square_sum=bigframes.pandas.NamedAgg( | ||
column="_bigframes_y_square", aggfunc="sum" | ||
), | ||
) | ||
.reset_index() | ||
) | ||
result["_bigframes_corr"] = result["_bigframes_dividend_sum"] / ( | ||
( | ||
result["_bigframes_x_square_sum"] * result["_bigframes_y_square_sum"] | ||
)._apply_unary_op(ops.sqrt_op) | ||
) | ||
result = result._pivot( | ||
index="_bigframes_variable_x", | ||
columns="_bigframes_variable_y", | ||
values="_bigframes_corr", | ||
) | ||
|
||
map_data = { | ||
f"_bigframes_level_{i}": orig_columns.get_level_values(i) | ||
for i in range(orig_columns.nlevels) | ||
} | ||
map_data["_bigframes_keys"] = range(len(orig_columns)) | ||
map_df = bigframes.dataframe.DataFrame( | ||
map_data, | ||
session=self._get_block().expr.session, | ||
).set_index("_bigframes_keys") | ||
result = result.join(map_df) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: collapse with the line below? |
||
result = result.sort_index() | ||
index_columns = [f"_bigframes_level_{i}" for i in range(orig_columns.nlevels)] | ||
result = result.set_index(index_columns) | ||
result.index.names = orig_columns.names | ||
result.columns = orig_columns | ||
|
||
return result | ||
|
||
def cov(self, *, numeric_only: bool = False) -> DataFrame: | ||
if not numeric_only: | ||
frame = self._raise_on_non_numeric("corr") | ||
else: | ||
frame = self._drop_non_numeric() | ||
|
||
return DataFrame(frame._block.calculate_pairwise_metric(agg_ops.CovOp())) | ||
orig_columns = frame.columns | ||
# Replace column names with 0 to n - 1 to keep order | ||
# and avoid the influence of duplicated column name | ||
frame.columns = pandas.Index(range(len(orig_columns))) | ||
frame = frame.astype(bigframes.dtypes.FLOAT_DTYPE) | ||
block = frame._block | ||
|
||
# A new column that uniquely identifies each row | ||
block, ordering_col = frame._block.promote_offsets(label="_bigframes_idx") | ||
|
||
val_col_ids = [ | ||
col_id for col_id in block.value_columns if col_id != ordering_col | ||
] | ||
|
||
block = block.melt( | ||
[ordering_col], val_col_ids, ["_bigframes_variable"], "_bigframes_value" | ||
) | ||
block = block.merge( | ||
block, | ||
left_join_ids=[ordering_col], | ||
right_join_ids=[ordering_col], | ||
how="inner", | ||
sort=False, | ||
) | ||
|
||
frame = DataFrame(block) | ||
frame = frame.dropna(subset=["_bigframes_value_x", "_bigframes_value_y"]) | ||
|
||
paired_mean_frame = ( | ||
frame.groupby(["_bigframes_variable_x", "_bigframes_variable_y"]) | ||
.agg( | ||
_bigframes_paired_mean_x=bigframes.pandas.NamedAgg( | ||
column="_bigframes_value_x", aggfunc="mean" | ||
), | ||
_bigframes_paired_mean_y=bigframes.pandas.NamedAgg( | ||
column="_bigframes_value_y", aggfunc="mean" | ||
), | ||
) | ||
.reset_index() | ||
) | ||
|
||
frame = frame.merge( | ||
paired_mean_frame, on=["_bigframes_variable_x", "_bigframes_variable_y"] | ||
) | ||
frame["_bigframes_value_x"] -= frame["_bigframes_paired_mean_x"] | ||
frame["_bigframes_value_y"] -= frame["_bigframes_paired_mean_y"] | ||
|
||
frame["_bigframes_dividend"] = ( | ||
frame["_bigframes_value_x"] * frame["_bigframes_value_y"] | ||
) | ||
|
||
result = ( | ||
frame.groupby(["_bigframes_variable_x", "_bigframes_variable_y"]) | ||
.agg( | ||
_bigframes_dividend_sum=bigframes.pandas.NamedAgg( | ||
column="_bigframes_dividend", aggfunc="sum" | ||
), | ||
_bigframes_dividend_count=bigframes.pandas.NamedAgg( | ||
column="_bigframes_dividend", aggfunc="count" | ||
), | ||
) | ||
.reset_index() | ||
) | ||
result["_bigframes_cov"] = result["_bigframes_dividend_sum"] / ( | ||
result["_bigframes_dividend_count"] - 1 | ||
) | ||
result = result._pivot( | ||
index="_bigframes_variable_x", | ||
columns="_bigframes_variable_y", | ||
values="_bigframes_cov", | ||
) | ||
|
||
map_data = { | ||
f"_bigframes_level_{i}": orig_columns.get_level_values(i) | ||
for i in range(orig_columns.nlevels) | ||
} | ||
map_data["_bigframes_keys"] = range(len(orig_columns)) | ||
map_df = bigframes.dataframe.DataFrame( | ||
map_data, | ||
session=self._get_block().expr.session, | ||
).set_index("_bigframes_keys") | ||
result = result.join(map_df) | ||
result = result.sort_index() | ||
index_columns = [f"_bigframes_level_{i}" for i in range(orig_columns.nlevels)] | ||
result = result.set_index(index_columns) | ||
result.index.names = orig_columns.names | ||
result.columns = orig_columns | ||
|
||
return result | ||
|
||
def to_arrow( | ||
self, | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
import sys | ||
|
||
import pandas as pd | ||
import pytest | ||
|
||
|
||
@pytest.mark.skipif( | ||
sys.version_info >= (3, 12), | ||
# See: https://github.com/python/cpython/issues/112282 | ||
reason="setrecursionlimit has no effect on the Python C stack since Python 3.12.", | ||
) | ||
def test_corr_w_numeric_only(scalars_df_numeric_150_columns_maybe_ordered): | ||
scalars_df, scalars_pandas_df = scalars_df_numeric_150_columns_maybe_ordered | ||
bf_result = scalars_df.corr(numeric_only=True).to_pandas() | ||
pd_result = scalars_pandas_df.corr(numeric_only=True) | ||
|
||
pd.testing.assert_frame_equal( | ||
bf_result, | ||
pd_result, | ||
check_dtype=False, | ||
check_index_type=False, | ||
check_column_type=False, | ||
) | ||
|
||
|
||
@pytest.mark.skipif( | ||
sys.version_info >= (3, 12), | ||
# See: https://github.com/python/cpython/issues/112282 | ||
reason="setrecursionlimit has no effect on the Python C stack since Python 3.12.", | ||
) | ||
def test_cov_w_numeric_only(scalars_df_numeric_150_columns_maybe_ordered): | ||
scalars_df, scalars_pandas_df = scalars_df_numeric_150_columns_maybe_ordered | ||
bf_result = scalars_df.cov(numeric_only=True).to_pandas() | ||
pd_result = scalars_pandas_df.cov(numeric_only=True) | ||
|
||
pd.testing.assert_frame_equal( | ||
bf_result, | ||
pd_result, | ||
check_dtype=False, | ||
check_index_type=False, | ||
check_column_type=False, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: collapse this line with the line below?