Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

Merged
merged 2 commits into from
Aug 19, 2019

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 19, 2019

This PR is a followup of and proposes two things:

  • Exclude Index columns for exposed Spark DataFrame
  • Disallow Koalas DataFrame with no index

So, for instance, to_spark() now shows:

-       __index_level_0__  x
-    0                  0  0
-    1                  1  1
+       x
+    0  0
+    1  1

and index_map is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per #639

@HyukjinKwon HyukjinKwon force-pushed the always-set-index branch 3 times, most recently from f361f81 to 4ca989e Compare August 19, 2019 02:50
@HyukjinKwon HyukjinKwon force-pushed the always-set-index branch 6 times, most recently from cf533d2 to 9740e70 Compare August 19, 2019 03:49
@HyukjinKwon HyukjinKwon changed the title [WIP] Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Aug 19, 2019
@HyukjinKwon HyukjinKwon reopened this Aug 19, 2019
@HyukjinKwon HyukjinKwon changed the title Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index [WIP] Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Aug 19, 2019
kdf = DataFrame(pdf)
return_schema = kdf._sdf.schema
if len(pdf) <= limit:
return kdf
Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly just a refactoring to disallow _InternalDataFrame(sdf=sdf, index_map=index_map).

@HyukjinKwon HyukjinKwon changed the title [WIP] Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index Aug 19, 2019
@@ -1915,10 +1910,27 @@ def rename(index):
index_name if index_name is not None else rename(index_name)))
index_map.remove(info)

new_data_columns = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes are needed because previously it removed the index when reset_index. Now it sets the default index.

@codecov-io
Copy link

codecov-io commented Aug 19, 2019

Codecov Report

Merging #655 into master will increase coverage by 0.23%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #655      +/-   ##
==========================================
+ Coverage   93.19%   93.43%   +0.23%     
==========================================
  Files          32       32              
  Lines        5294     5314      +20     
==========================================
+ Hits         4934     4965      +31     
+ Misses        360      349      -11
Impacted Files Coverage Δ
databricks/koalas/base.py 93.67% <ø> (+1.59%) ⬆️
databricks/koalas/namespace.py 89.2% <ø> (+1.2%) ⬆️
databricks/koalas/sql.py 94.79% <ø> (+1.04%) ⬆️
databricks/koalas/series.py 93.33% <100%> (+0.31%) ⬆️
databricks/koalas/indexing.py 93.06% <100%> (+0.2%) ⬆️
databricks/koalas/groupby.py 86.12% <100%> (+0.37%) ⬆️
databricks/koalas/internal.py 96.19% <100%> (-0.24%) ⬇️
databricks/koalas/frame.py 94.63% <100%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c87a849...c4bfbce. Read the comment docs.

@softagram-bot
Copy link

Softagram Impact Report for pull/655 (head commit: c4bfbce)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

details of dependency changes - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to [email protected]

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@HyukjinKwon
Copy link
Member Author

Thanks Takuya!!

@HyukjinKwon HyukjinKwon merged commit f2a718d into databricks:master Aug 19, 2019
@thunterdb
Copy link
Contributor

Why disallow the KDFs with no index?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 20, 2019

Biggest reason was: #633 introduced operations on different Koalas DataFrame which is based upon index.

So, if no index is set, operations cannot be performed on different Koalas DataFrames (because this feature is based upon a join with index column). Namely, previously the cases below

ks.range(10) + ks.range(10)
# This was an actual usecase reported
spark_df = spark.read.redshift("...")
ks.DataFrame(spark_df) + ks.DataFrame(spark_df)

were disallowed because it has no index ^. Now, it has a default index - #639.

Secondly, it makes much easier implementing other APIs. Currently, many APIs such as cumulative sum, max, etc. don't work when index does not exist because the implementation is dependent on index column.

Lastly, since Koalas DataFrame, form now on, always sets the default index (as pandas does), we can allow those cases now, which makes more sense and closer to pandas'

https://github.com/databricks/koalas/pull/655/files#diff-ebfd51978d1aa348373b958a1b27aa47R936
https://github.com/databricks/koalas/pull/655/files#diff-ebfd51978d1aa348373b958a1b27aa47R543
https://github.com/databricks/koalas/pull/655/files#diff-66320dc549688da8eecb81b165764a86R37-R40

@HyukjinKwon
Copy link
Member Author

cc @rxin, BTW, this is the PR we disabled no-index case.

@HyukjinKwon HyukjinKwon deleted the always-set-index branch November 6, 2019 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants