Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554

ueshin · 2020-05-29T18:26:27Z

This PR makes huge changes on the way of InternalFrame management to avoid creating very many Spark DataFrames.
It will make a lot of DataFrame operations without enabling "compute.ops_on_diff_frames" option possible.

df + df + df
df['foo'] = df['bar']['baz']
df[['x', 'y']] = df[['x', 'y']].fillna(0)

The new way in functions to manage InternalFrame is:

If the function only needs to handle Spark columns, e.g., functions using InternalFrame. with_new_columns, use the new columns without creating new Spark DataFrame. Basically we can just use the with_new_columns to create a new InternalFrame.
If the function needs to create a new Spark DataFrame, e.g., functions need filter, order, groupby, use _internal.spark_frame with columns from _internal.spark_column_for. Working with a Spark DataFrame from _internal.spark_frame and column names from _internal.spark_column_name_for will usually NOT work.
If the function can only access by the column name, e.g., pivot_table or functions with udfs, use _internal.applied.spark_frame instead. The _internal.applied.spark_frame will be applied all the changes. Note that the _internal.applied.spark_frame won't work with Spark columns from _internal.spark_column_for.
DataFrame._sdf was removed to explicitly specify which spark_frame should be used, _internal.spark_frame or _internal.applied.spark_frame.

ueshin · 2020-05-29T18:26:39Z

also cc @itholic

codecov-commenter · 2020-05-29T18:50:25Z

Codecov Report

Merging #1554 into master will increase coverage by 0.36%.
The diff coverage is 99.44%.

@@            Coverage Diff             @@
##           master    #1554      +/-   ##
==========================================
+ Coverage   94.15%   94.51%   +0.36%     
==========================================
  Files          38       38              
  Lines        8600     8717     +117     
==========================================
+ Hits         8097     8239     +142     
+ Misses        503      478      -25

Impacted Files	Coverage Δ
databricks/koalas/config.py	`99.00% <ø> (+0.01%)`	⬆️
databricks/koalas/generic.py	`96.66% <ø> (+0.06%)`	⬆️
databricks/koalas/testing/utils.py	`80.79% <ø> (+1.80%)`	⬆️
databricks/koalas/frame.py	`96.79% <98.33%> (+1.03%)`	⬆️
databricks/conftest.py	`100.00% <100.00%> (+3.77%)`	⬆️
databricks/koalas/base.py	`97.36% <100.00%> (+0.09%)`	⬆️
databricks/koalas/groupby.py	`90.52% <100.00%> (+0.20%)`	⬆️
databricks/koalas/indexes.py	`96.82% <100.00%> (+0.02%)`	⬆️
databricks/koalas/indexing.py	`92.25% <100.00%> (+0.09%)`	⬆️
databricks/koalas/internal.py	`97.55% <100.00%> (+0.83%)`	⬆️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35f3084...c2b0726. Read the comment docs.

databricks/koalas/internal.py

itholic · 2020-05-31T20:07:55Z

... Working with a Spark DataFrame from _internal.spark_frame and column names from InternalFrame.spark_column_name_for will usually NOT work.

Maybe _internal.applied.spark_frame and column names from _internal.spark_column_name_for will usually NOT work ??
(+ nit: what is the difference between _internal and InternalFrame ? - might be a typo ?)

itholic · 2020-05-31T20:09:20Z

Seems good enough to me except several questions.

ueshin · 2020-05-31T20:20:46Z

@itholic ah, right. it's a typo. updated the description.

itholic · 2020-05-31T20:25:05Z

Oh, sorry I just noticed that I missed reviewing some files. let me check these tonight.

HyukjinKwon · 2020-06-01T02:13:37Z

databricks/koalas/frame.py

@@ -408,10 +414,6 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False):
                pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy)
            super(DataFrame, self).__init__(InternalFrame.from_pandas(pdf))

-    @property
-    def _sdf(self) -> spark.DataFrame:


HyukjinKwon · 2020-06-01T03:56:40Z

databricks/koalas/internal.py

@@ -793,12 +793,26 @@ def to_pandas_frame(self) -> pd.DataFrame:
            ]
        return pdf

+    @lazy_property
+    def applied(self):


Should we maybe call it like:

resolved_copy

applied_copy

new_sdf_copy

...

?

The name applied doesn't look clear that it's going to have a new Spark DataFrame internally that changes anchor.

HyukjinKwon · 2020-06-01T04:04:51Z

databricks/koalas/internal.py

@@ -698,7 +698,7 @@ def spark_columns(self) -> List[spark.Column]:
        index_spark_columns = self.index_spark_columns


Shall we add some docstrings clearly to describe when to use which clearly?

For example, spark_frame now should always be used via df._internal.applied.spark_frame for Spark DataFrame APIs that internally creates new query execution plan with the different output length.

For expressions and/or functions, df._internal.spark_frame should be used together with Spark column instances, in order to avoid the operations on different DataFrames.

Maybe we could also document that spark_frame is just the base Spark DataFrame where the expressions and functions are not applied.

We might have to consider mentioning about:

Spark expressions/functions, to create new Spark Columns against the same DataFrame.

Spark DataFrame APIs that internally creates query execution plans, to create a new DataFrame.

HyukjinKwon · 2020-06-01T04:27:09Z

Looks good. The documentation (#1554 (comment)) I can do it in a separate PR if you're busy for something else.

HyukjinKwon

LGTM

itholic · 2020-06-01T08:50:07Z

Um.. honestly resolved_copy seems more complicated for me than applied.

But I think we can merge this for now since I have no better idea of naming it.

Otherwise, LGTM

ueshin · 2020-06-01T18:23:10Z

Thanks! I'd merge this now.

Manage data_spark_columns to avoid creating very many Spark DataFrames.

b8f43e0

ueshin requested a review from HyukjinKwon May 29, 2020 18:26

itholic reviewed May 31, 2020

View reviewed changes

databricks/koalas/internal.py Show resolved Hide resolved

HyukjinKwon reviewed Jun 1, 2020

View reviewed changes

Run tests.

04201b1

HyukjinKwon reviewed Jun 1, 2020

View reviewed changes

applied -> resolved_copy

c2b0726

ueshin force-pushed the update_sdf_lazily branch from 82930cf to c2b0726 Compare June 1, 2020 06:39

HyukjinKwon approved these changes Jun 1, 2020

View reviewed changes

ueshin merged commit ae57c2a into databricks:master Jun 1, 2020

ueshin deleted the update_sdf_lazily branch June 1, 2020 18:23

ueshin mentioned this pull request Jun 30, 2020

Implement Series.cov #1620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554

Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554

ueshin commented May 29, 2020 •

edited

Loading

ueshin commented May 29, 2020

codecov-commenter commented May 29, 2020 •

edited

Loading

itholic commented May 31, 2020

itholic commented May 31, 2020

ueshin commented May 31, 2020

itholic commented May 31, 2020

HyukjinKwon Jun 1, 2020

HyukjinKwon Jun 1, 2020

HyukjinKwon Jun 1, 2020 •

edited

Loading

HyukjinKwon Jun 1, 2020

HyukjinKwon commented Jun 1, 2020

HyukjinKwon left a comment

itholic commented Jun 1, 2020

ueshin commented Jun 1, 2020

		@@ -698,7 +698,7 @@ def spark_columns(self) -> List[spark.Column]:
		index_spark_columns = self.index_spark_columns

Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554

Manage data_spark_columns to avoid creating very many Spark DataFrames. #1554

Conversation

ueshin commented May 29, 2020 • edited Loading

ueshin commented May 29, 2020

codecov-commenter commented May 29, 2020 • edited Loading

Codecov Report

itholic commented May 31, 2020

itholic commented May 31, 2020

ueshin commented May 31, 2020

itholic commented May 31, 2020

HyukjinKwon Jun 1, 2020

Choose a reason for hiding this comment

HyukjinKwon Jun 1, 2020

Choose a reason for hiding this comment

HyukjinKwon Jun 1, 2020 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jun 1, 2020

Choose a reason for hiding this comment

HyukjinKwon commented Jun 1, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

itholic commented Jun 1, 2020

ueshin commented Jun 1, 2020

ueshin commented May 29, 2020 •

edited

Loading

codecov-commenter commented May 29, 2020 •

edited

Loading

HyukjinKwon Jun 1, 2020 •

edited

Loading