Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554

Merged
merged 3 commits into from
Jun 1, 2020

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented May 29, 2020

This PR makes huge changes on the way of InternalFrame management to avoid creating very many Spark DataFrames.
It will make a lot of DataFrame operations without enabling "compute.ops_on_diff_frames" option possible.

df + df + df
df['foo'] = df['bar']['baz']
df[['x', 'y']] = df[['x', 'y']].fillna(0)

The new way in functions to manage InternalFrame is:

  • If the function only needs to handle Spark columns, e.g., functions using InternalFrame. with_new_columns, use the new columns without creating new Spark DataFrame. Basically we can just use the with_new_columns to create a new InternalFrame.
  • If the function needs to create a new Spark DataFrame, e.g., functions need filter, order, groupby, use _internal.spark_frame with columns from _internal.spark_column_for. Working with a Spark DataFrame from _internal.spark_frame and column names from _internal.spark_column_name_for will usually NOT work.
  • If the function can only access by the column name, e.g., pivot_table or functions with udfs, use _internal.applied.spark_frame instead. The _internal.applied.spark_frame will be applied all the changes. Note that the _internal.applied.spark_frame won't work with Spark columns from _internal.spark_column_for.
  • DataFrame._sdf was removed to explicitly specify which spark_frame should be used, _internal.spark_frame or _internal.applied.spark_frame.

@ueshin ueshin requested a review from HyukjinKwon May 29, 2020 18:26
@ueshin
Copy link
Collaborator Author

ueshin commented May 29, 2020

also cc @itholic

@codecov-commenter
Copy link

codecov-commenter commented May 29, 2020

Codecov Report

Merging #1554 into master will increase coverage by 0.36%.
The diff coverage is 99.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1554      +/-   ##
==========================================
+ Coverage   94.15%   94.51%   +0.36%     
==========================================
  Files          38       38              
  Lines        8600     8717     +117     
==========================================
+ Hits         8097     8239     +142     
+ Misses        503      478      -25     
Impacted Files Coverage Δ
databricks/koalas/config.py 99.00% <ø> (+0.01%) ⬆️
databricks/koalas/generic.py 96.66% <ø> (+0.06%) ⬆️
databricks/koalas/testing/utils.py 80.79% <ø> (+1.80%) ⬆️
databricks/koalas/frame.py 96.79% <98.33%> (+1.03%) ⬆️
databricks/conftest.py 100.00% <100.00%> (+3.77%) ⬆️
databricks/koalas/base.py 97.36% <100.00%> (+0.09%) ⬆️
databricks/koalas/groupby.py 90.52% <100.00%> (+0.20%) ⬆️
databricks/koalas/indexes.py 96.82% <100.00%> (+0.02%) ⬆️
databricks/koalas/indexing.py 92.25% <100.00%> (+0.09%) ⬆️
databricks/koalas/internal.py 97.55% <100.00%> (+0.83%) ⬆️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35f3084...c2b0726. Read the comment docs.

@itholic
Copy link
Contributor

itholic commented May 31, 2020

... Working with a Spark DataFrame from _internal.spark_frame and column names from InternalFrame.spark_column_name_for will usually NOT work.

Maybe _internal.applied.spark_frame and column names from _internal.spark_column_name_for will usually NOT work ??
(+ nit: what is the difference between _internal and InternalFrame ? - might be a typo ?)

@itholic
Copy link
Contributor

itholic commented May 31, 2020

Seems good enough to me except several questions.

@ueshin
Copy link
Collaborator Author

ueshin commented May 31, 2020

@itholic ah, right. it's a typo. updated the description.

@itholic
Copy link
Contributor

itholic commented May 31, 2020

Oh, sorry I just noticed that I missed reviewing some files. let me check these tonight.

@@ -408,10 +414,6 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False):
pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy)
super(DataFrame, self).__init__(InternalFrame.from_pandas(pdf))

@property
def _sdf(self) -> spark.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -793,12 +793,26 @@ def to_pandas_frame(self) -> pd.DataFrame:
]
return pdf

@lazy_property
def applied(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe call it like:

  • resolved_copy
  • applied_copy
  • new_sdf_copy
  • ...

?

The name applied doesn't look clear that it's going to have a new Spark DataFrame internally that changes anchor.

@@ -698,7 +698,7 @@ def spark_columns(self) -> List[spark.Column]:
index_spark_columns = self.index_spark_columns
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add some docstrings clearly to describe when to use which clearly?

For example, spark_frame now should always be used via df._internal.applied.spark_frame for Spark DataFrame APIs that internally creates new query execution plan with the different output length.

For expressions and/or functions, df._internal.spark_frame should be used together with Spark column instances, in order to avoid the operations on different DataFrames.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also document that spark_frame is just the base Spark DataFrame where the expressions and functions are not applied.

We might have to consider mentioning about:

  • Spark expressions/functions, to create new Spark Columns against the same DataFrame.
  • Spark DataFrame APIs that internally creates query execution plans, to create a new DataFrame.

@HyukjinKwon
Copy link
Member

Looks good. The documentation (#1554 (comment)) I can do it in a separate PR if you're busy for something else.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@itholic
Copy link
Contributor

itholic commented Jun 1, 2020

Um.. honestly resolved_copy seems more complicated for me than applied.

But I think we can merge this for now since I have no better idea of naming it.

Otherwise, LGTM

@ueshin
Copy link
Collaborator Author

ueshin commented Jun 1, 2020

Thanks! I'd merge this now.

@ueshin ueshin merged commit ae57c2a into databricks:master Jun 1, 2020
@ueshin ueshin deleted the update_sdf_lazily branch June 1, 2020 18:23
@ueshin ueshin mentioned this pull request Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants