Fix Index._with_new_scol to update its anchor. #1334

ueshin · 2020-03-11T00:11:32Z

As discussed at #1328 (comment), we can consolidate the Index.repeat() and MultiIndex.repeat() by updating its anchor with a new scol.
Also resolves #1190 and other functions which were mistakenly using index_scols in Index, whose usage is valid now.

codecov-io · 2020-03-11T00:32:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f0af49c). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #1334   +/-   ##
=========================================
  Coverage          ?   95.22%           
=========================================
  Files             ?       34           
  Lines             ?     7430           
  Branches          ?        0           
=========================================
  Hits              ?     7075           
  Misses            ?      355           
  Partials          ?        0

Impacted Files	Coverage Δ
databricks/koalas/series.py	`96.73% <ø> (ø)`
databricks/koalas/indexing.py	`95.29% <ø> (ø)`
databricks/koalas/indexes.py	`96.38% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0af49c...cbcecca. Read the comment docs.

HyukjinKwon · 2020-03-11T09:59:33Z

databricks/koalas/indexes.py

+        internal = _InternalFrame(
+            spark_frame=sdf, index_map=OrderedDict(zip(sdf.columns, self._internal.index_names))
+        )
+        return DataFrame(internal).index


Hm .. I think it will trigger the operations on different DataFrames among the operations in Index .. e.g.)

import databricks.koalas as ks index = ks.range(100).index index + index + index

There looks another issue in compute.ops_on_diff_frames ... operations against the indexes from different frames are not supported ..

Good point.

The way we manage index Spark columns is different from Series, so we need further refactoring in _InternalFrame, e.g., we should manage them as Spark columns instead of Spark column names. The way in Series is not good enough yet, though.

Actually creating new DataFrame as an anchor is better than now compared to pandas behavior:

>>> pdf = pd.DataFrame({'a':[1,2,3]}) >>> pidx = pdf.index >>> pidx.name = 'i' >>> pdf a i 0 1 1 2 2 3 >>> (pidx + pidx).name = 'ii' >>> pdf a i 0 1 1 2 2 3

When operating pidx + pidx, the connection to the original pdf seems dropped.

Whereas:

>>> kdf = ks.DataFrame({'a':[1,2,3]}) >>> kidx = kdf.index >>> kidx.name = 'i' >>> kdf a i 0 1 1 2 2 3 >>> (kidx + kidx).name = 'ii' >>> kdf a ii 0 1 1 2 2 3

It updates index name even after kidx + kidx.

ueshin requested a review from HyukjinKwon March 11, 2020 00:11

HyukjinKwon force-pushed the index branch from 035cd87 to adca6e5 Compare March 11, 2020 09:48

Fix Index._with_new_scol to update its anchor.

cbcecca

HyukjinKwon force-pushed the index branch from adca6e5 to cbcecca Compare March 11, 2020 09:49

HyukjinKwon reviewed Mar 11, 2020

View reviewed changes

HyukjinKwon approved these changes Mar 12, 2020

View reviewed changes

HyukjinKwon merged commit a5b635f into databricks:master Mar 12, 2020

ueshin deleted the index branch March 12, 2020 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Index._with_new_scol to update its anchor. #1334

Fix Index._with_new_scol to update its anchor. #1334

ueshin commented Mar 11, 2020

codecov-io commented Mar 11, 2020 •

edited

Loading

HyukjinKwon Mar 11, 2020

ueshin Mar 11, 2020

Fix Index._with_new_scol to update its anchor. #1334

Fix Index._with_new_scol to update its anchor. #1334

Conversation

ueshin commented Mar 11, 2020

codecov-io commented Mar 11, 2020 • edited Loading

Codecov Report

HyukjinKwon Mar 11, 2020

Choose a reason for hiding this comment

ueshin Mar 11, 2020

Choose a reason for hiding this comment

codecov-io commented Mar 11, 2020 •

edited

Loading