Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate MultiIndex names from levels #27242

Merged
merged 5 commits into from
Oct 16, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jul 5, 2019

In #27138 I proposed doing some changes to MultiIndex, so that the index type can have its data collected in _data as type List[Categorical],+ adding MultiIndex.arrays in order to access each full level as zero-copy Categorical.

This is the first part of that proposal, and drops setting the names on the levels[x].name attribute and instead sets the names on the MultiIndex._names attribute.

This PR is a minorly backward-breaking change (so would be good to get into 0.25), while the followup will not break anything.

@topper-123 topper-123 force-pushed the MultiIndex._names branch 3 times, most recently from 7d93c89 to ab2fdf5 Compare July 5, 2019 10:52
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in MultiIndex, no problem setting the level.name attribute, but if you are in frame.py or reshape.py I would avoid doing this and instead use the name= parameter in ._shallow_copy() or use .rename()? (on a level)

@@ -259,6 +259,7 @@ def __new__(
result._set_levels(levels, copy=copy, validate=False)
result._set_codes(codes, copy=copy, validate=False)

result._names = [None for _ in levels]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[None] * len(levels)

@@ -260,10 +260,13 @@ def get_new_values(self):
def get_new_columns(self):
if self.value_columns is None:
if self.lift == 0:
return self.removed_level
lev = self.removed_level._shallow_copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't you do
lev = self.removed_level._shallow_copy(name=self.removed_name) ?

Copy link
Contributor Author

@topper-123 topper-123 Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_shallow_copy and rename and other indirect methods to set the .name all call ._set_names, which does a lot of checks. Those checks are not needed in these internal functionality, as the name has already been validated.

Perhaps have a fastpath parameter in _set_names?

@@ -658,7 +663,9 @@ def _convert_level_number(level_num, columns):
new_names = this.columns.names[:-1]
new_columns = MultiIndex.from_tuples(unique_groups, names=new_names)
else:
new_columns = unique_groups = this.columns.levels[0]
new_columns = this.columns.levels[0]._shallow_copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use name= here

@@ -302,7 +305,9 @@ def get_new_index(self):
lev, lab = self.new_index_levels[0], result_codes[0]
if (lab == -1).any():
lev = lev.insert(len(lev), lev._na_value)
return lev.take(lab)
new_index = lev.take(lab)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use .rename()


lev = self.removed_level
return lev.insert(0, lev._na_value)
lev = self.removed_level.insert(0, item=self.removed_level._na_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use .rename()

@@ -979,7 +979,7 @@ def test_reset_index(self, float_frame):
):
values = lev.take(level_codes)
name = names[i]
tm.assert_index_equal(values, Index(deleveled[name]))
tm.assert_index_equal(values, Index(deleveled[name]), check_names=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lev.take(level_codes) doesn't provide a name any more, while a rest index does provides its Series with a name (as it should.

I've added a test assert values.name is None to make this more explicit.

@topper-123
Copy link
Contributor Author

I’ve changed how the name is set. This is a bit slower (many checks that are not needed), but than could be fixed seperately in a later PR.

@@ -1609,12 +1607,12 @@ def test_constructor_with_tz(self):
)

result = MultiIndex.from_arrays([index, columns])
tm.assert_index_equal(result.levels[0], index)
tm.assert_index_equal(result.levels[1], columns)
tm.assert_index_equal(result.levels[0], index, check_names=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now find these tests very confusing that we lose the names on the levels themselves. (I know that's the point of this PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a .set_names(index.name) (for example) and remove the check_names arg (so its the default of True)

Copy link
Contributor Author

@topper-123 topper-123 Jul 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made changes so we avoid check_names=False (so is implicitly True).

@topper-123 topper-123 force-pushed the MultiIndex._names branch 5 times, most recently from 5bcf204 to efcfeac Compare July 8, 2019 22:16
@topper-123
Copy link
Contributor Author

Is this ok? I'd like to get this in 0.25, as this is a breaking change. I'll add a 0.25 label as a reminder.

The rest of #27138 will be non-breaking, so can go in later, if needed.

@topper-123 topper-123 added this to the 0.25.0 milestone Jul 8, 2019
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 9, 2019 via email

@topper-123
Copy link
Contributor Author

Yeah, I understand that, but the followups to this PR will in addition to the benefits mentioned in #27138 also allow some nice simplifications of MultiIndex (by delegating all single-level checks to Categorical) and I assume release of 0.25 will mean a stop to breaking changes for a while,, because next up will be 1.0?

@jreback
Copy link
Contributor

jreback commented Jul 9, 2019

@TomAugspurger I don't believe we have held off on merging even breaking changes to an RC. I don't see this as a big deal and would merge as is.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 10, 2019 via email

@jreback jreback removed this from the 0.25.0 milestone Jul 17, 2019
@topper-123
Copy link
Contributor Author

Ping.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.0 milestone Jul 25, 2019
@topper-123
Copy link
Contributor Author

I've moved the whatsnew section to 1.0.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2019

@topper-123 lgtm (tiny comment). @TomAugspurger any objections?

@TomAugspurger
Copy link
Contributor

Question on the broad goal of #27138: IIUC, the motivation is for MultiIndex to be backed by a _data: List[Categorical]. I don't necessarily agree that that goal requires API breaking changes here.

Why can .levels not return something like

@property
def levels(self):
    return FrozenList(pd.Index(self._data[i], name=self.names[i] for i in idx.nlevels))

@TomAugspurger
Copy link
Contributor

@topper-123 do you have thoughts on #27242 (comment)?

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 16, 2019

As I see it, this will for practical purposes keep the names in two locations

No, they'll be stored in one place and accessible from two locations. This is the same as on master, only the source of truth and the referring location are swapped.

But if someone does

>>> mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])
>>> lev = mi.levels[0]
>>> mi.set_names('z', level=0)
# then
>>> mi.names[0], lev.name
'z', 'x'

So the names will be stored in two places (or users should not store individual levels seperately, which they can't be expected to know). So for this reason I think it's the most most practical to make a clean cut.

EDIT: Ok I got an idea: What if we deprecate levels and use the name categories instead? In that case we could do (after implementing #27138):

@property
def levels(self) -> FrozenList[Index]:
    warnings.warn(...)
    return FrozenList(pd.Index(lev.categories, name=name) for name, lev in zip(self.names, self._data))

@property
def categories(self) -> FrozenList[Index]:
    return FrozenList(lev.categories for lev in self._data)

This would also make the API for MultiIndex be more similar to CategoricalIndex.

@TomAugspurger
Copy link
Contributor

Ahh, a new name for levels is a good idea for getting around this problem.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2019

+1 on @topper-123 new idea.

@TomAugspurger
Copy link
Contributor

On the new name, is .categories what we want?

I suspect that in the near-term we'll have a DictEncodedArray that's like a Categorical, except it won't have the same semantics around "unobserved" categories and new categories can be added implicitly. I'm not sure what we would call the set of unique values for that thing (perhaps categories too).

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 16, 2019

I'm not set on exact name for this, but would like consistency. So maybe if you make a suggestion on the attribute name for that new array type?

I BTW don't know if I like the name DictEncodedArray . Can't it be just EncodedArray instead?

Edit: Or is the idea that what is now categories will in the new implementation be a dict instead of an Index? Then maybe call it mapping...?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 16, 2019 via email

@topper-123
Copy link
Contributor Author

Ok, if I can get this PR merged, I will start implementing MultiIndex._data and will deprecate levels and add categories.

@TomAugspurger
Copy link
Contributor

👍 I'm fine with merging this as long as we also do the .levels / .categories for 1.0.

@jreback jreback merged commit 46e89b0 into pandas-dev:master Oct 16, 2019
@jreback
Copy link
Contributor

jreback commented Oct 16, 2019

thanks @topper-123

no need to create an issue, you can just ref this PR.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 16, 2019 via email

@jorisvandenbossche
Copy link
Member

This is breaking pyarrow (https://issues.apache.org/jira/browse/ARROW-6922).

The above is a very long discussion (which I didn't really follow before, sorry for that), but what I somewhat understand is that for 1.0 we want to restore the .levels behaviour of being able to get the name? But why was this PR then merged?
Or why was the suggestion of Tom to wrap the categories in an Index with the name as the levels (#27242 (comment)) not implemented?

@jorisvandenbossche
Copy link
Member

If the idea is to deprecate the .levels attribute of a MultiIndex, I think that deserves first some separate focused discussion (as that is a big change).

Short-term, can we add back the names to .level? From the discussion above, it seems this does not need to conflict with the refactor of the MultiIndex internals (for getting the name)

@TomAugspurger
Copy link
Contributor

The above is a very long discussion (which I didn't really follow before, sorry for that), but what I somewhat understand is that for 1.0 we want to restore the .levels behaviour of being able to get the name? But why was this PR then merged?

Right. Restoring getting is relatively straightforward. I can put a PR up with that later today.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Oct 17, 2019
xref https://issues.apache.org/jira/browse/ARROW-6922 /
pandas-dev#27242 (comment)
/ pandas-dev#29032

No docs yet, since it isn't clear how this will eventually sort out. But
we at least want to preserve this behavior for 1.0
jreback pushed a commit that referenced this pull request Oct 18, 2019
* API: Restore getting name from MultiIndex level

xref https://issues.apache.org/jira/browse/ARROW-6922 /
#27242 (comment)
/ #29032

No docs yet, since it isn't clear how this will eventually sort out. But
we at least want to preserve this behavior for 1.0

* fixups
HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this pull request Oct 29, 2019
* API: Restore getting name from MultiIndex level

xref https://issues.apache.org/jira/browse/ARROW-6922 /
pandas-dev#27242 (comment)
/ pandas-dev#29032

No docs yet, since it isn't clear how this will eventually sort out. But
we at least want to preserve this behavior for 1.0

* fixups
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* API: Restore getting name from MultiIndex level

xref https://issues.apache.org/jira/browse/ARROW-6922 /
pandas-dev#27242 (comment)
/ pandas-dev#29032

No docs yet, since it isn't clear how this will eventually sort out. But
we at least want to preserve this behavior for 1.0

* fixups
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* API: Restore getting name from MultiIndex level

xref https://issues.apache.org/jira/browse/ARROW-6922 /
pandas-dev#27242 (comment)
/ pandas-dev#29032

No docs yet, since it isn't clear how this will eventually sort out. But
we at least want to preserve this behavior for 1.0

* fixups
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
* API: Restore getting name from MultiIndex level

xref https://issues.apache.org/jira/browse/ARROW-6922 /
pandas-dev#27242 (comment)
/ pandas-dev#29032

No docs yet, since it isn't clear how this will eventually sort out. But
we at least want to preserve this behavior for 1.0

* fixups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants