keyerror in groupby checking #978
-
I'm new to pandera and tried to validate if each ID has one unique sex, but i keep getting a keyerror and can't figure out why. import pandas as pd
import pandera as pa
df = pd.DataFrame({
"id": ["A", "B", "A", "B"],
"sex": ['F', 'M', 'F', 'M']
})
check_unique_sex = pa.Check(
lambda g: g['sex'].nunique().max() == 1, groupby=['id']
)
schema = pa.DataFrameSchema({
"id": pa.Column(str),
"sex": pa.Column(str, check_unique_sex)
})
schema.validate(df)
What I usually do in pandas:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
hi @StevenLi-DS! good question... so the Basically, the current behavior is the In the mean time, here's a solution: import pandas as pd
import pandera as pa
df = pd.DataFrame({
"id": ["A", "B", "A", "B"],
"sex": ['F', 'M', 'F', 'M']
})
# so this implements the check that you intended
check_unique_sex = pa.Check(
# g is a dictionary mapping keys in "id", e.g. "A", "B" to Series objects grouped by "id"
lambda g: all(g[i].nunique() == 1 for i in g), groupby=['id']
)
schema = pa.DataFrameSchema(
{
"id": pa.Column(str),
"sex": pa.Column(str)
},
# but the recommended way is to use a dataframe-level check that pretty much uses the code you use with regular pandas
checks=pa.Check(lambda df: df.groupby("id").sex.nunique().max() == 1)
)
schema.validate(df) |
Beta Was this translation helpful? Give feedback.
hi @StevenLi-DS! good question... so the
Check
groupby behavior is a little janky right now, there's an issue for fixing this to be more intuitive: #488.Basically, the current behavior is the
g
is a dictionary mapping keys from "id" to Series objects of "sex", basically unpacking the SeriesGroupby object to{"key": pd.Series}
. Not great, admittedly 😅 but #488 should make that better.In the mean time, here's a solution: