-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554
Conversation
also cc @itholic |
Codecov Report
@@ Coverage Diff @@
## master #1554 +/- ##
==========================================
+ Coverage 94.15% 94.51% +0.36%
==========================================
Files 38 38
Lines 8600 8717 +117
==========================================
+ Hits 8097 8239 +142
+ Misses 503 478 -25
Continue to review full report at Codecov.
|
Maybe |
Seems good enough to me except several questions. |
@itholic ah, right. it's a typo. updated the description. |
Oh, sorry I just noticed that I missed reviewing some files. let me check these tonight. |
@@ -408,10 +414,6 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False): | |||
pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy) | |||
super(DataFrame, self).__init__(InternalFrame.from_pandas(pdf)) | |||
|
|||
@property | |||
def _sdf(self) -> spark.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
databricks/koalas/internal.py
Outdated
@@ -793,12 +793,26 @@ def to_pandas_frame(self) -> pd.DataFrame: | |||
] | |||
return pdf | |||
|
|||
@lazy_property | |||
def applied(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe call it like:
resolved_copy
applied_copy
new_sdf_copy
- ...
?
The name applied
doesn't look clear that it's going to have a new Spark DataFrame internally that changes anchor.
@@ -698,7 +698,7 @@ def spark_columns(self) -> List[spark.Column]: | |||
index_spark_columns = self.index_spark_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add some docstrings clearly to describe when to use which clearly?
For example, spark_frame
now should always be used via df._internal.applied.spark_frame
for Spark DataFrame APIs that internally creates new query execution plan with the different output length.
For expressions and/or functions, df._internal.spark_frame
should be used together with Spark column instances, in order to avoid the operations on different DataFrames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could also document that spark_frame
is just the base Spark DataFrame where the expressions and functions are not applied.
We might have to consider mentioning about:
- Spark expressions/functions, to create new Spark Columns against the same DataFrame.
- Spark DataFrame APIs that internally creates query execution plans, to create a new DataFrame.
Looks good. The documentation (#1554 (comment)) I can do it in a separate PR if you're busy for something else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Um.. honestly But I think we can merge this for now since I have no better idea of naming it. Otherwise, LGTM |
Thanks! I'd merge this now. |
This PR makes huge changes on the way of
InternalFrame
management to avoid creating very many Spark DataFrames.It will make a lot of DataFrame operations without enabling "compute.ops_on_diff_frames" option possible.
The new way in functions to manage
InternalFrame
is:InternalFrame. with_new_columns
, use the new columns without creating new Spark DataFrame. Basically we can just use thewith_new_columns
to create a newInternalFrame
._internal.spark_frame
with columns from_internal.spark_column_for
. Working with a Spark DataFrame from_internal.spark_frame
and column names from_internal.spark_column_name_for
will usually NOT work.pivot_table
or functions with udfs, use_internal.applied.spark_frame
instead. The_internal.applied.spark_frame
will be applied all the changes. Note that the_internal.applied.spark_frame
won't work with Spark columns from_internal.spark_column_for
.DataFrame._sdf
was removed to explicitly specify whichspark_frame
should be used,_internal.spark_frame
or_internal.applied.spark_frame
.