-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Index._with_new_scol to update its anchor. #1334
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1334 +/- ##
=========================================
Coverage ? 95.22%
=========================================
Files ? 34
Lines ? 7430
Branches ? 0
=========================================
Hits ? 7075
Misses ? 355
Partials ? 0
Continue to review full report at Codecov.
|
internal = _InternalFrame( | ||
spark_frame=sdf, index_map=OrderedDict(zip(sdf.columns, self._internal.index_names)) | ||
) | ||
return DataFrame(internal).index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. I think it will trigger the operations on different DataFrames among the operations in Index .. e.g.)
import databricks.koalas as ks
index = ks.range(100).index
index + index + index
There looks another issue in compute.ops_on_diff_frames
... operations against the indexes from different frames are not supported ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
The way we manage index Spark columns is different from Series, so we need further refactoring in _InternalFrame
, e.g., we should manage them as Spark columns instead of Spark column names. The way in Series is not good enough yet, though.
Actually creating new DataFrame as an anchor is better than now compared to pandas behavior:
>>> pdf = pd.DataFrame({'a':[1,2,3]})
>>> pidx = pdf.index
>>> pidx.name = 'i'
>>> pdf
a
i
0 1
1 2
2 3
>>> (pidx + pidx).name = 'ii'
>>> pdf
a
i
0 1
1 2
2 3
When operating pidx + pidx
, the connection to the original pdf
seems dropped.
Whereas:
>>> kdf = ks.DataFrame({'a':[1,2,3]})
>>> kidx = kdf.index
>>> kidx.name = 'i'
>>> kdf
a
i
0 1
1 2
2 3
>>> (kidx + kidx).name = 'ii'
>>> kdf
a
ii
0 1
1 2
2 3
It updates index name even after kidx + kidx
.
As discussed at #1328 (comment), we can consolidate the
Index.repeat()
andMultiIndex.repeat()
by updating its anchor with a new scol.Also resolves #1190 and other functions which were mistakenly using
index_scols
inIndex
, whose usage is valid now.