Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using .loc with MultiIndex containing np.nan unexpected behavior #43814

Open
1 of 3 tasks
deponovo opened this issue Sep 30, 2021 · 15 comments
Open
1 of 3 tasks

Using .loc with MultiIndex containing np.nan unexpected behavior #43814

deponovo opened this issue Sep 30, 2021 · 15 comments

Comments

@deponovo
Copy link
Contributor

deponovo commented Sep 30, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "temp_playlist": [0, 0, 0, 0],
        "objId": ["o1", np.nan, "o1", np.nan],
        "x": [1, 2, 3, 4],
    }
)

agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].agg(list)
print(agg_df.loc[agg_df.index[-1]])  # KeyError: because it is (0, np.nan), wanted to get [2, 4]

Issue Description

This issue is a follow-up of the discussion in this SO question.
It appears to be a bug, but if not, meaning, if this is desired behavior it should be documented.
As shown in the Reproducible Example, after grouping x data on the temp_playlist and objId columns, there is a MultiIndex (0, nan). This index is meaningful and I wanted to access the data via it as I can perform with any other index from agg_df.index as agg_df.loc[<index_pos>]. This is not possible for the index containing the nan (agg_info_df.loc[agg_info_df.index[-1]]). However, it does work if that same index is provided in a list of indices. So this seems at least inconsistent if not a bug entirely.
For more info, please consult the SO question, especially this answer.

Expected Behavior

agg_info_df.loc[(0, np.nan)] should return [2, 4]

Installed Versions

python 3.8.5, pandas 1.3.1, numpy 1.20.3

@deponovo deponovo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2021
@phofl
Copy link
Member

phofl commented Sep 30, 2021

Please provide a minimal reproducible example, see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@deponovo
Copy link
Contributor Author

Edited the original report.

@phofl
Copy link
Member

phofl commented Sep 30, 2021

Are that many rows necessary? Is the groupby necessary to reproduce? An example is minimal if you can't remove anything without causing the bug to disappear

@deponovo
Copy link
Contributor Author

deponovo commented Sep 30, 2021

I got to the finding via the use of groupby. This answer states also that the finding can only be reproduced on a df obtained by the groupby and not on one explicitly created with the MultiIndex (see 'Original Attempt to Reproduce Error' part).

Are that many rows necessary?

Maybe not, that's at least the minimal example I prepared to report this.

@phofl
Copy link
Member

phofl commented Sep 30, 2021

Please trim your example down then.

I can only speak for myself but my motivation decreases significantly if the example is unnecessary complicated and contains a lot of other function calls.

@deponovo
Copy link
Contributor Author

I updated the original report again.

@phofl
Copy link
Member

phofl commented Sep 30, 2021

It would be great if you could remove the groupby if its not necessary. Otherwise this looks good now

@deponovo
Copy link
Contributor Author

deponovo commented Sep 30, 2021

The groupby, as referred previously, is required.

@phofl
Copy link
Member

phofl commented Sep 30, 2021

Sorry misread your previous answer

@AlexKirko AlexKirko added Groupby MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2021
@AlexKirko
Copy link
Member

I have checked this on the latest version, same KeyError. @deponovo , would you be interested in investigating the cause or contributing a fix?

@phofl
Copy link
Member

phofl commented Sep 30, 2021

This is a regression caused by #35852

@phofl phofl added the Regression Functionality that used to work in a prior pandas version label Sep 30, 2021
@CloseChoice
Copy link
Member

take

@jreback jreback added this to the 1.3.4 milestone Oct 10, 2021
@simonjayhawkins
Copy link
Member

This is a regression caused by #35852

#35852 was merged for 1.1.2. AFAICT the same KeyError is raised in 1.1.1. In 1.0.5 and earlier, the code sample gives TypeError: groupby() got an unexpected keyword argument 'dropna'

@phofl
Copy link
Member

phofl commented Oct 12, 2021

Can't reproduce this either now. I might have focused only on the factorize function.

@simonjayhawkins simonjayhawkins removed this from the 1.3.4 milestone Oct 13, 2021
@jreback jreback added this to the 1.4 milestone Oct 17, 2021
@simonjayhawkins simonjayhawkins removed the Regression Functionality that used to work in a prior pandas version label Nov 8, 2021
@simonjayhawkins
Copy link
Member

removing milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants