Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Index._with_new_scol to update its anchor. #1334

Merged
merged 1 commit into from
Mar 12, 2020

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Mar 11, 2020

As discussed at #1328 (comment), we can consolidate the Index.repeat() and MultiIndex.repeat() by updating its anchor with a new scol.
Also resolves #1190 and other functions which were mistakenly using index_scols in Index, whose usage is valid now.

@ueshin ueshin requested a review from HyukjinKwon March 11, 2020 00:11
@codecov-io
Copy link

codecov-io commented Mar 11, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f0af49c). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1334   +/-   ##
=========================================
  Coverage          ?   95.22%           
=========================================
  Files             ?       34           
  Lines             ?     7430           
  Branches          ?        0           
=========================================
  Hits              ?     7075           
  Misses            ?      355           
  Partials          ?        0
Impacted Files Coverage Δ
databricks/koalas/series.py 96.73% <ø> (ø)
databricks/koalas/indexing.py 95.29% <ø> (ø)
databricks/koalas/indexes.py 96.38% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0af49c...cbcecca. Read the comment docs.

internal = _InternalFrame(
spark_frame=sdf, index_map=OrderedDict(zip(sdf.columns, self._internal.index_names))
)
return DataFrame(internal).index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. I think it will trigger the operations on different DataFrames among the operations in Index .. e.g.)

import databricks.koalas as ks
index = ks.range(100).index
index + index + index

There looks another issue in compute.ops_on_diff_frames ... operations against the indexes from different frames are not supported ..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

The way we manage index Spark columns is different from Series, so we need further refactoring in _InternalFrame, e.g., we should manage them as Spark columns instead of Spark column names. The way in Series is not good enough yet, though.

Actually creating new DataFrame as an anchor is better than now compared to pandas behavior:

>>> pdf = pd.DataFrame({'a':[1,2,3]})
>>> pidx = pdf.index
>>> pidx.name = 'i'
>>> pdf
   a
i
0  1
1  2
2  3
>>> (pidx + pidx).name = 'ii'
>>> pdf
   a
i
0  1
1  2
2  3

When operating pidx + pidx, the connection to the original pdf seems dropped.

Whereas:

>>> kdf = ks.DataFrame({'a':[1,2,3]})
>>> kidx = kdf.index
>>> kidx.name = 'i'
>>> kdf
   a
i
0  1
1  2
2  3
>>> (kidx + kidx).name = 'ii'
>>> kdf
    a
ii
0   1
1   2
2   3

It updates index name even after kidx + kidx.

@HyukjinKwon HyukjinKwon merged commit a5b635f into databricks:master Mar 12, 2020
@ueshin ueshin deleted the index branch March 12, 2020 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix the index of Index.to_series().
3 participants