Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

HyukjinKwon · 2019-08-19T02:23:41Z

This PR is a followup of and proposes two things:

Exclude Index columns for exposed Spark DataFrame
Disallow Koalas DataFrame with no index

So, for instance, to_spark() now shows:

-       __index_level_0__  x
-    0                  0  0
-    1                  1  1
+       x
+    0  0
+    1  1

and index_map is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per #639

HyukjinKwon · 2019-08-19T11:03:53Z

databricks/koalas/groupby.py

-            kdf = DataFrame(pdf)
-            return_schema = kdf._sdf.schema
-            if len(pdf) <= limit:
-                return kdf


Mainly just a refactoring to disallow _InternalDataFrame(sdf=sdf, index_map=index_map).

… DataFrame with no index

HyukjinKwon · 2019-08-19T12:35:56Z

databricks/koalas/frame.py

@@ -1915,10 +1910,27 @@ def rename(index):
                     index_name if index_name is not None else rename(index_name)))
                index_map.remove(info)

+        new_data_columns = [


These codes are needed because previously it removed the index when reset_index. Now it sets the default index.

codecov-io · 2019-08-19T12:56:10Z

Codecov Report

Merging #655 into master will increase coverage by 0.23%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #655      +/-   ##
==========================================
+ Coverage   93.19%   93.43%   +0.23%     
==========================================
  Files          32       32              
  Lines        5294     5314      +20     
==========================================
+ Hits         4934     4965      +31     
+ Misses        360      349      -11

Impacted Files	Coverage Δ
databricks/koalas/base.py	`93.67% <ø> (+1.59%)`	⬆️
databricks/koalas/namespace.py	`89.2% <ø> (+1.2%)`	⬆️
databricks/koalas/sql.py	`94.79% <ø> (+1.04%)`	⬆️
databricks/koalas/series.py	`93.33% <100%> (+0.31%)`	⬆️
databricks/koalas/indexing.py	`93.06% <100%> (+0.2%)`	⬆️
databricks/koalas/groupby.py	`86.12% <100%> (+0.37%)`	⬆️
databricks/koalas/internal.py	`96.19% <100%> (-0.24%)`	⬇️
databricks/koalas/frame.py	`94.63% <100%> (+0.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c87a849...c4bfbce. Read the comment docs.

softagram-bot · 2019-08-19T12:56:27Z

Softagram Impact Report for pull/655 (head commit: `c4bfbce`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/655

Give feedback on this report to [email protected]

ueshin

LGTM.

HyukjinKwon · 2019-08-19T22:40:13Z

Thanks Takuya!!

thunterdb · 2019-08-20T08:13:12Z

Why disallow the KDFs with no index?

HyukjinKwon · 2019-08-20T10:14:13Z

Biggest reason was: #633 introduced operations on different Koalas DataFrame which is based upon index.

So, if no index is set, operations cannot be performed on different Koalas DataFrames (because this feature is based upon a join with index column). Namely, previously the cases below

ks.range(10) + ks.range(10)

# This was an actual usecase reported
spark_df = spark.read.redshift("...")
ks.DataFrame(spark_df) + ks.DataFrame(spark_df)

were disallowed because it has no index ^. Now, it has a default index - #639.

Secondly, it makes much easier implementing other APIs. Currently, many APIs such as cumulative sum, max, etc. don't work when index does not exist because the implementation is dependent on index column.

Lastly, since Koalas DataFrame, form now on, always sets the default index (as pandas does), we can allow those cases now, which makes more sense and closer to pandas'

https://github.com/databricks/koalas/pull/655/files#diff-ebfd51978d1aa348373b958a1b27aa47R936
https://github.com/databricks/koalas/pull/655/files#diff-ebfd51978d1aa348373b958a1b27aa47R543
https://github.com/databricks/koalas/pull/655/files#diff-66320dc549688da8eecb81b165764a86R37-R40

HyukjinKwon · 2019-09-05T23:59:02Z

cc @rxin, BTW, this is the PR we disabled no-index case.

HyukjinKwon force-pushed the always-set-index branch 3 times, most recently from f361f81 to 4ca989e Compare August 19, 2019 02:50

HyukjinKwon requested review from ueshin and rxin August 19, 2019 03:23

HyukjinKwon force-pushed the always-set-index branch 6 times, most recently from cf533d2 to 9740e70 Compare August 19, 2019 03:49

HyukjinKwon changed the title ~~[WIP] Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index~~ Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Aug 19, 2019

HyukjinKwon closed this Aug 19, 2019

HyukjinKwon reopened this Aug 19, 2019

HyukjinKwon force-pushed the always-set-index branch from 9740e70 to 6f3c98e Compare August 19, 2019 08:02

HyukjinKwon changed the title ~~Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index~~ [WIP] Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Aug 19, 2019

HyukjinKwon commented Aug 19, 2019

View reviewed changes

HyukjinKwon changed the title ~~[WIP] Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index~~ Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Aug 19, 2019

Exclude Index columns for exposed Spark DataFrame and disallow Koalas…

ad958b8

… DataFrame with no index

HyukjinKwon force-pushed the always-set-index branch from dc131cc to ad958b8 Compare August 19, 2019 11:09

Fix indexing stuff as well

c4bfbce

HyukjinKwon commented Aug 19, 2019

View reviewed changes

ueshin approved these changes Aug 19, 2019

View reviewed changes

HyukjinKwon merged commit f2a718d into databricks:master Aug 19, 2019

HyukjinKwon deleted the always-set-index branch November 6, 2019 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

HyukjinKwon commented Aug 19, 2019 •

edited

Loading

HyukjinKwon Aug 19, 2019 •

edited

Loading

HyukjinKwon Aug 19, 2019

codecov-io commented Aug 19, 2019 •

edited

Loading

softagram-bot commented Aug 19, 2019

ueshin left a comment

HyukjinKwon commented Aug 19, 2019

thunterdb commented Aug 20, 2019

HyukjinKwon commented Aug 20, 2019 •

edited

Loading

HyukjinKwon commented Sep 5, 2019

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

Conversation

HyukjinKwon commented Aug 19, 2019 • edited Loading

HyukjinKwon Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Aug 19, 2019

Choose a reason for hiding this comment

codecov-io commented Aug 19, 2019 • edited Loading

Codecov Report

softagram-bot commented Aug 19, 2019

Softagram Impact Report for pull/655 (head commit: c4bfbce)

⭐ Change Overview

⭐ Details of Dependency Changes

📄 Full report

ueshin left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 19, 2019

thunterdb commented Aug 20, 2019

HyukjinKwon commented Aug 20, 2019 • edited Loading

HyukjinKwon commented Sep 5, 2019

HyukjinKwon commented Aug 19, 2019 •

edited

Loading

HyukjinKwon Aug 19, 2019 •

edited

Loading

codecov-io commented Aug 19, 2019 •

edited

Loading

Softagram Impact Report for pull/655 (head commit: `c4bfbce`)

HyukjinKwon commented Aug 20, 2019 •

edited

Loading