API: Remove nan-likes from MultiIndex levels #29111

topper-123 · 2019-10-20T09:54:58Z

Working on #27138 I've found that MultiIndex keeps nan-likes in the levels, but encode them all to -1:

>>> levels, codes = [[nan, None, pd.NaT, 128, 2]], [[0, -1, 1, 2, 3, 4]]
>>> mi = pd.MultiIndex(levels, codes)
>>> mi.codes[0]
[-1, -1, -1, -1, 3, 4]
>>> mi.levels[0]
Index([nan, None, NaT, 128, 2], dtype='object')

All the MultiIndex nan-likes are encoded to -1, so it's not possible to decode them to their constituent values. So it's not possible to get more than one nan-like values out of the MultiIndex, so in this case None and NaT disappears when converting:

>>> mi.to_frame()[0].array
<PandasArray>
[nan, nan, nan, nan, 128, 2]
Length: 6, dtype: object

I think if nan-likes are all encoded to -1, it'd be more consistent to not have them in the levels, similarly to how Categorical does it already.

>>> c = pd.Categorical(levels[0])
>>> c.codes
array([-1, -1, -1,  1,  0], dtype=int8)
>>> c.categories
>>> Int64Index([2, 128], dtype='int64')

Is there acceptance to change the MultiIndex API so we get nan-likes out of the labels? That would give them an API more similar to Categorical.

@pandas-dev/pandas-core.

The text was updated successfully, but these errors were encountered:

topper-123 · 2019-10-23T16:32:31Z

@jreback @TomAugspurger , @jorisvandenbossche, any comments?

All of the nan-likes are encoded to -1 already, so no information will really be lost from this, and this will unite Categorical.categories and MultiIndex.levels.

jorisvandenbossche · 2019-10-23T18:28:19Z

In principle, it would certainly be nice to clean that up I think, as that doesn't look good.

Can we think of potential changes that can impact users?
Eg what is the impact on indexing a series/dataframe with such an index? Does that change how those different missing indicators are treated?

This is also only applicable for object dtype level?

topper-123 · 2019-10-23T18:52:35Z

Yes, I think this should only affect object dtype levels, because other dtypes can only have one nan-like value.

I can't think how it could affect indexing, because indexing works using the codes, so all of NaN, NaT, None etc. already translate to the same code (-1).

I could start working on it, and if I'm missing some effect that has unexpected implications, we could discuss it again.

jorisvandenbossche · 2019-10-23T18:54:37Z

I can't think how it could affect indexing, because indexing works using the codes, so all of NaN, NaT, None etc. already translate to the same code (-1).

Can you try some examples with the index you show above?

jorisvandenbossche · 2019-10-23T18:55:35Z

Eg pd.Series(range(len(mi)), index=mi)[np.nan] does actually not even work for me. But I suppose there is some way to index with missing values into a MultiIndex?

topper-123 · 2019-10-23T19:08:06Z

Yes, I agree, seems like indexing MultiIndex by nans doesn't work currently.

Also, given that all the nan-likes are encoded to -1 in .codes, even it it indexing with nans did work, there AFAIKC, couldn't be any way to differentiate between nan and None anyway, so wouldn't be useful.

Also, object dtype seems to be an anamoly as other dtypes actually don't keep nan-likes in the level:

>>> mi = pd.MultiIndex.from_product([[10, np.nan]])
>>> mi.levels[0]
Int64Index([10], dtype='int64')
>>> mi.codes[0]
FrozenNDArray([0, -1], dtype='int8')

TomAugspurger · 2019-10-24T12:32:12Z

Your proposal sounds reasonable.

…

On Wed, Oct 23, 2019 at 2:08 PM Terji Petersen ***@***.***> wrote: Yes, I agree, seems like indexing MultiIndex by nans doesn't work currently. Also, given that all the nan-likes are encoded to -1 in .codes, even it it indexing with nans did work, there AFAIKC, couldn't be any way to differentiate between nan and None anyway... Also, object dtype seems to be an anamoly as other dtypes actually don't keep nan.likes in the level: >>> mi = pd.MultiIndex.from_product([[10, np.nan]])>>> mi.levels[0] Int64Index([1], dtype='int64')>>> mi.codes[0] FrozenNDArray([0, -1], dtype='int8') — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29111?email_source=notifications&email_token=AAKAOIRJUGUCLSQKO3KWGJ3QQCOKBA5CNFSM4JCTSLCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECCRE3Y#issuecomment-545591919>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQ6T5WFM6VLHIYEXBLQQCOKBANCNFSM4JCTSLCA> .

jorisvandenbossche · 2019-10-25T11:24:02Z

cc @toobaz in case you have any experience with NaNs in object MultiIndexes

Also, given that all the nan-likes are encoded to -1 in .codes, even it it indexing with nans did work, there AFAIKC, couldn't be any way to differentiate between nan and None anyway,

If it would have worked, we might need to ensure it keeps working (even when that specific indicator would not be in the values anymore), or have some deprecation for it. But ok, since it seems to not work, no need to argue for this ;)

jorisvandenbossche · 2019-10-25T11:42:27Z

Some random other observations / thoughts:

I was wondering what happens if you convert such a MI into a normal index by eg dropping index levels:

In [54]: levels, codes = [[np.nan, None, pd.NaT, 128, 2], [0]], [[0, -1, 1, 2, 3, 4], [0]*6] 
    ...: mi = pd.MultiIndex(levels, codes)  

In [55]: df = pd.DataFrame({'a': range(len(mi))}, index=mi).reset_index(level=1, drop=True)  

In [56]: df.index       
Out[56]: Index([nan, nan, nan, nan, 128, 2], dtype='object')

In [57]: levels, codes = [[pd.NaT, None, np.nan, 128, 2], [0]], [[0, -1, 1, 2, 3, 4], [0]*6] 
    ...: mi = pd.MultiIndex(levels, codes)  

In [58]: df = pd.DataFrame({'a': range(len(mi))}, index=mi).reset_index(level=1, drop=True)

In [59]: df.index   
Out[59]: Index([nan, nan, nan, nan, 128, 2], dtype='object')

So it seems you only get NaNs, and that also does not depend on the order of the missing value indicators in the levels (so if NaN is not the first, you still get NaN).

If you create a MultiIndex in a more typical way (not by constructing it with the MI constructor from levels and codes, but eg by setting columns as the index), you get "properly" constructed MIs:

In [67]: df = pd.DataFrame({'l1': [1, 'a', np.nan, None, pd.NaT], 'l2': range(5), 'a': range(5)})  

In [68]: df  
Out[68]: 
     l1  l2  a
0     1   0  0
1     a   1  1
2   NaN   2  2
3  None   3  3
4   NaT   4  4

In [69]: df.set_index(['l1', 'l2']).index 
Out[69]: 
MultiIndex([(  1, 0),
            ('a', 1),
            (nan, 2),
            (nan, 3),
            (nan, 4)],
           names=['l1', 'l2'])

In [70]: df.set_index(['l1', 'l2']).index.levels 
Out[70]: FrozenList([[1, 'a'], [0, 1, 2, 3, 4]])

In [71]: df.set_index(['l1', 'l2']).index.codes  
Out[71]: FrozenList([[0, 1, -1, -1, -1], [0, 1, 2, 3, 4]])

The same is true for pd.MultiIndex.from_tuples or pd.MultiIndex.from_product:

In [77]: pd.MultiIndex.from_product([pd.Index([1, 'a', None, pd.NaT, np.nan], dtype=object), [1, 2]])  
Out[77]: 
MultiIndex([(  1, 1),
            (  1, 2),
            ('a', 1),
            ('a', 2),
            (nan, 1),
            (nan, 2),
            (nan, 1),
            (nan, 2),
            (nan, 1),
            (nan, 2)],
           )

In [78]: pd.MultiIndex.from_product([pd.Index([1, 'a', None, pd.NaT, np.nan], dtype=object), [1, 2]]).levels
Out[78]: FrozenList([[1, 'a'], [1, 2]])

All more reasons to fix this inconsistency. However, on:

Also, object dtype seems to be an anamoly as other dtypes actually don't keep nan-likes in the level

It seems not unique to object dtype (you used from_product instead of the main constructor, and also for object dtype that is sanitizing the missing values):

In [85]: levels, codes = [[np.nan, 128, 2]], [[0, -1, 1, 2]] 

In [86]: mi = pd.MultiIndex(levels, codes)

In [87]: mi  
Out[87]: 
MultiIndex([(  nan,),
            (  nan,),
            (128.0,),
            (  2.0,)],
           )

In [88]: mi.levels 
Out[88]: FrozenList([[nan, 128.0, 2.0]])

In [89]: mi.codes
Out[89]: FrozenList([[-1, -1, 1, 2]])

So it seems this is a general issue with the MultiIndex constructor.

mvashishtha · 2024-02-29T08:51:44Z

Can we mark this as a bug? It seems that there is agreement that levels should consistently keep nan-like values.

topper-123 changed the title ~~API: MultiIndex keeps nan-likes in levels~~ API: Remove nan-likes from MultiIndex levels Oct 20, 2019

topper-123 added MultiIndex API Design labels Oct 20, 2019

jorisvandenbossche mentioned this issue Jan 7, 2020

NA is not included in MultiIndex.levels if we construct MI with nan #30750

Open

mroeschke added Enhancement and removed API Design labels Jul 21, 2021

coroa mentioned this issue Jun 25, 2023

isna does not work with explicit MultiIndex nan-representation coroa/pandas-indexing#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Remove nan-likes from MultiIndex levels #29111

API: Remove nan-likes from MultiIndex levels #29111

topper-123 commented Oct 20, 2019 •

edited

Loading

topper-123 commented Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

topper-123 commented Oct 23, 2019 •

edited

Loading

jorisvandenbossche commented Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

topper-123 commented Oct 23, 2019 •

edited

Loading

TomAugspurger commented Oct 24, 2019 via email

jorisvandenbossche commented Oct 25, 2019

jorisvandenbossche commented Oct 25, 2019

mvashishtha commented Feb 29, 2024

API: Remove nan-likes from MultiIndex levels #29111

API: Remove nan-likes from MultiIndex levels #29111

Comments

topper-123 commented Oct 20, 2019 • edited Loading

topper-123 commented Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

topper-123 commented Oct 23, 2019 • edited Loading

jorisvandenbossche commented Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

topper-123 commented Oct 23, 2019 • edited Loading

TomAugspurger commented Oct 24, 2019 via email

jorisvandenbossche commented Oct 25, 2019

jorisvandenbossche commented Oct 25, 2019

mvashishtha commented Feb 29, 2024

topper-123 commented Oct 20, 2019 •

edited

Loading

topper-123 commented Oct 23, 2019 •

edited

Loading

topper-123 commented Oct 23, 2019 •

edited

Loading