Checking bad datatypes and invalid values with a single schema. #619
-
I'm trying to check if values in my dataframe are either int or float and that they are larger than 0. If not, I want to find out which rows are invalid so I can count them and remove them. I can't however figure out how to do that using a single schema. If I use a dataframe and a schema like this: bad_df = pd.DataFrame({"integer_col_1": [1, "xxx", 3, -4, "aaa"],
"integer_col_2": [42,56,"yyy",12,56],
"float_col_1": [12.45,78.11,"zzz",11.1,-145.1]})
schema = pa.DataFrameSchema(
{
"integer_col_1": pa.Column(int, pa.Check.greater_than(0)),
"integer_col_2": pa.Column(int, pa.Check.greater_than(0)),
"float_col_1": pa.Column(float, pa.Check.greater_than(0))
}, coerce=True) and then use failure cases from SchemaErrors exemption to find invalid rows: try:
schema.validate(bad_df, lazy=True)
except pa.errors.SchemaErrors as e:
print(e.failure_cases) i get:
So I can exctract which rows have invalid types, but not invalid values. I fugured what's happening is that validation tries to coerce the columns, is unsucessful and then tries to check the columns as is, which results in a TypeError instead of a bool series. Is there a way to get the desired behaviour, or do I have to first drop the invalid types and validate again? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
hey @JiriFranek92 good question! The reason you're getting the To get the desired behavior you'll have to use import pandas as pd
import pandera as pa
bad_df = pd.DataFrame({"integer_col_1": [1, "xxx", 3, -4, "aaa"],
"integer_col_2": [42,56,"yyy",12,56],
"float_col_1": [12.45,78.11,"zzz",11.1,-145.1]})
def greater_than_zero(x):
try:
return x > 0
except TypeError:
return False
check_gt_zero = pa.Check(greater_than_zero, element_wise=True)
schema = pa.DataFrameSchema(
{
"integer_col_1": pa.Column(int, check_gt_zero),
"integer_col_2": pa.Column(int, check_gt_zero),
"float_col_1": pa.Column(float, check_gt_zero),
},
coerce=True
)
try:
schema.validate(bad_df, lazy=True)
except pa.errors.SchemaErrors as e:
print(e.failure_cases) output:
This makes me think of introducing an # NOT WORKING CODE
pa.Check.greater_than(0, element_wise=True, on_error="false") If this use case comes up several more times (or +1s on this discussion) we can write out an issue to open up for contributions from the community! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick reply. Yes, the The Anyway, this is for my pet/self-learning project, so I don't know if it's really that an important use-case. |
Beta Was this translation helpful? Give feedback.
hey @JiriFranek92 good question!
The reason you're getting the
TypeError
is that the built-in checks use vectorized implementations of checks, soCheck.greater_than(0)
is basicallyseries > 0
. This means that if one of the values in a series does not support the vectorized operation, it'll raise thatTypeError
.pandera
only reports the runtime errors raised bypandas
.pandera
does this to take advantage of the speed gains from using the native pandas vectorized operations (the limitation being that it doesn't out-right support your desired behavior).To get the desired behavior you'll have to use
element_wise
checks and explicitly handle theTypeError
to return aFalse
value.