keyerror in groupby checking #978

stevenlis · 2022-10-24T02:38:53Z

stevenlis
Oct 24, 2022

I'm new to pandera and tried to validate if each ID has one unique sex, but i keep getting a keyerror and can't figure out why.
I thought a check_fn could take a pandas.groupby object. Am I missing something here?

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "id": ["A", "B", "A", "B"],
    "sex": ['F', 'M', 'F', 'M']
})

check_unique_sex = pa.Check(
    lambda g: g['sex'].nunique().max() == 1, groupby=['id']
)

schema = pa.DataFrameSchema({
    "id": pa.Column(str),
    "sex": pa.Column(str, check_unique_sex)
})

schema.validate(df)

SchemaError: Error while executing check function: KeyError("sex")
Traceback (most recent call last):
  File "/opt/miniconda3/envs/desktop/lib/python3.9/site-packages/pandera/schemas.py", line 2053, in validate
    _handle_check_results(
  File "/opt/miniconda3/envs/desktop/lib/python3.9/site-packages/pandera/schemas.py", line 2424, in _handle_check_results
    check_result = check(check_obj, *check_args)
  File "/opt/miniconda3/envs/desktop/lib/python3.9/site-packages/pandera/checks.py", line 407, in __call__
    check_output = check_fn(check_obj)
  File "<ipython-input-18-206422ca87e2>", line 12, in <lambda>
    lambda g: g['sex'].nunique().max() == 1, groupby=['id']
KeyError: 'sex'

What I usually do in pandas:

df.groupby('id').sex.nunique().max() == 1

Answered by cosmicBboy

Oct 24, 2022

hi @StevenLi-DS! good question... so the Check groupby behavior is a little janky right now, there's an issue for fixing this to be more intuitive: #488.

Basically, the current behavior is the g is a dictionary mapping keys from "id" to Series objects of "sex", basically unpacking the SeriesGroupby object to {"key": pd.Series}. Not great, admittedly 😅 but #488 should make that better.

In the mean time, here's a solution:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "id": ["A", "B", "A", "B"],
    "sex": ['F', 'M', 'F', 'M']
})

# so this implements the check that you intended
check_unique_sex = pa.Check(
    # g is a dictionary mapping keys in "id", e.g. "A", "B" to S…

View full answer

cosmicBboy · 2022-10-24T03:30:58Z

cosmicBboy
Oct 24, 2022
Maintainer

hi @StevenLi-DS! good question... so the Check groupby behavior is a little janky right now, there's an issue for fixing this to be more intuitive: #488.

Basically, the current behavior is the g is a dictionary mapping keys from "id" to Series objects of "sex", basically unpacking the SeriesGroupby object to {"key": pd.Series}. Not great, admittedly 😅 but #488 should make that better.

In the mean time, here's a solution:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "id": ["A", "B", "A", "B"],
    "sex": ['F', 'M', 'F', 'M']
})

# so this implements the check that you intended
check_unique_sex = pa.Check(
    # g is a dictionary mapping keys in "id", e.g. "A", "B" to Series objects grouped by "id"
    lambda g: all(g[i].nunique() == 1 for i in g), groupby=['id']
)

schema = pa.DataFrameSchema(
    {
        "id": pa.Column(str),
        "sex": pa.Column(str)
    },
    # but the recommended way is to use a dataframe-level check that pretty much uses the code you use with regular pandas
    checks=pa.Check(lambda df: df.groupby("id").sex.nunique().max() == 1)
)

schema.validate(df)

4 replies

stevenlis Oct 24, 2022
Author

Thanks for the detailed explanation @cosmicBboy. However, it seems like a dataframe-level check will print out the entire schema in the SchemaError... not even telling me which test is failed 😂... is there the expected behavior for now? or maybe I missed something?

cosmicBboy Oct 25, 2022
Maintainer

Ah! so you can name your custom checks so you know what error is being raised:

pa.Check(lambda df: df.groupby("id").sex.nunique().max() == 1, name="check_unique_sex_per_id")

Unfortunately groupby checks currently don't offer more introspection/granular metadata as to what exactly went wrong. There's #488, which will add support for an error callback that will allow for more use error message for groupby checks.

stevenlis Oct 25, 2022
Author

Thanks for your reply, man. This seems a bit weird since we are using a lambda/anonymous function here, and then we give it a name 😂. But I got the idea.

I've been playing with pandera for a while and trying to get a sense of what it can do at this moment. I really do believe it could be very useful, but I also find the error msg is a bit messy and hard to read in general. I think it would be easier to debug if it could print out a summary like:

Column-level
- ✅ column1: 
- - ✅ dtype: passed
- - ✅ check1: passed
- ✅ column2: 
- - ✅ dtype: passed
- ❌ column3:
- - ✅ dtype: passed
- - ❌ check1: failed
- - ✅ check2: passed
=================
Dataframe-level
- ✅ check1: passed
- ❌ check2: failed

and then you could focus on the places in which the tests are failed and use validate(df) to print out the problematic rows. I also think it's perhaps a good idea to print out the result column by column so that you know which column test is finished, instead of printing out everything at once at the end. This could be useful when you have a huge dataframe.

cosmicBboy Oct 26, 2022
Maintainer

This seems a bit weird since we are using a lambda/anonymous function here, and then we give it a name 😂

Actually you're naming the pa.Check, not the lambda function... it's either that, or good luck trying to parse out the lambda code into a string! If this feels unnatural to you can always def my_groupby_check: ... and pandera will use my_groupby_check as the check name.

I really do believe it could be very useful, but I also find the error msg is a bit messy and hard to read in general.

If you have ideas on how to improve the error messages do feel free to open an issue we can discuss there and perhaps you can eventually make a PR for those improvements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keyerror in groupby checking #978

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

keyerror in groupby checking #978

stevenlis Oct 24, 2022

Replies: 1 comment · 4 replies

cosmicBboy Oct 24, 2022 Maintainer

stevenlis Oct 24, 2022 Author

cosmicBboy Oct 25, 2022 Maintainer

stevenlis Oct 25, 2022 Author

cosmicBboy Oct 26, 2022 Maintainer

stevenlis
Oct 24, 2022

Replies: 1 comment 4 replies

cosmicBboy
Oct 24, 2022
Maintainer

stevenlis Oct 24, 2022
Author

cosmicBboy Oct 25, 2022
Maintainer

stevenlis Oct 25, 2022
Author

cosmicBboy Oct 26, 2022
Maintainer