Checking bad datatypes and invalid values with a single schema. #619

JiriFranek92 · 2021-09-13T13:25:05Z

JiriFranek92
Sep 13, 2021

I'm trying to check if values in my dataframe are either int or float and that they are larger than 0. If not, I want to find out which rows are invalid so I can count them and remove them. I can't however figure out how to do that using a single schema.

If I use a dataframe and a schema like this:

bad_df = pd.DataFrame({"integer_col_1": [1, "xxx", 3, -4, "aaa"],
                       "integer_col_2": [42,56,"yyy",12,56],
                       "float_col_1": [12.45,78.11,"zzz",11.1,-145.1]})

schema = pa.DataFrameSchema(
    {
        "integer_col_1": pa.Column(int, pa.Check.greater_than(0)),
        "integer_col_2": pa.Column(int, pa.Check.greater_than(0)),
        "float_col_1": pa.Column(float, pa.Check.greater_than(0))
}, coerce=True)

and then use failure cases from SchemaErrors exemption to find invalid rows:

try:
    schema.validate(bad_df, lazy=True)
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

i get:

schema_context         column                    check check_number  \
0         Column  integer_col_1    coerce_dtype('int64')         None   
1         Column  integer_col_1    coerce_dtype('int64')         None   
2         Column  integer_col_2    coerce_dtype('int64')         None   
3         Column    float_col_1  coerce_dtype('float64')         None   
4         Column  integer_col_1           dtype('int64')         None   
5         Column  integer_col_1          greater_than(0)            0   
6         Column  integer_col_2           dtype('int64')         None   
7         Column  integer_col_2          greater_than(0)            0   
8         Column    float_col_1         dtype('float64')         None   
9         Column    float_col_1          greater_than(0)            0   

                                        failure_case index  
0                                                xxx     1  
1                                                aaa     4  
2                                                yyy     2  
3                                                zzz     2  
4                                             object  None  
5  TypeError("'>' not supported between instances...  None  
6                                             object  None  
7  TypeError("'>' not supported between instances...  None  
8                                             object  None  
9  TypeError("'>' not supported between instances...  None

So I can exctract which rows have invalid types, but not invalid values.

I fugured what's happening is that validation tries to coerce the columns, is unsucessful and then tries to check the columns as is, which results in a TypeError instead of a bool series. Is there a way to get the desired behaviour, or do I have to first drop the invalid types and validate again?

Answered by cosmicBboy

Sep 14, 2021

hey @JiriFranek92 good question!

The reason you're getting the TypeError is that the built-in checks use vectorized implementations of checks, so Check.greater_than(0) is basically series > 0. This means that if one of the values in a series does not support the vectorized operation, it'll raise that TypeError. pandera only reports the runtime errors raised by pandas. pandera does this to take advantage of the speed gains from using the native pandas vectorized operations (the limitation being that it doesn't out-right support your desired behavior).

To get the desired behavior you'll have to use element_wise checks and explicitly handle the TypeError to return a False value.

import pandas …

View full answer

cosmicBboy · 2021-09-14T02:02:39Z

cosmicBboy
Sep 14, 2021
Maintainer

hey @JiriFranek92 good question!

The reason you're getting the TypeError is that the built-in checks use vectorized implementations of checks, so Check.greater_than(0) is basically series > 0. This means that if one of the values in a series does not support the vectorized operation, it'll raise that TypeError. pandera only reports the runtime errors raised by pandas. pandera does this to take advantage of the speed gains from using the native pandas vectorized operations (the limitation being that it doesn't out-right support your desired behavior).

To get the desired behavior you'll have to use element_wise checks and explicitly handle the TypeError to return a False value.

import pandas as pd
import pandera as pa


bad_df = pd.DataFrame({"integer_col_1": [1, "xxx", 3, -4, "aaa"],
                       "integer_col_2": [42,56,"yyy",12,56],
                       "float_col_1": [12.45,78.11,"zzz",11.1,-145.1]})


def greater_than_zero(x):
    try:
        return x > 0
    except TypeError:
        return False


check_gt_zero = pa.Check(greater_than_zero, element_wise=True)

schema = pa.DataFrameSchema(
    {
        "integer_col_1": pa.Column(int, check_gt_zero),
        "integer_col_2": pa.Column(int, check_gt_zero),
        "float_col_1": pa.Column(float, check_gt_zero),
    },
    coerce=True
)

try:
    schema.validate(bad_df, lazy=True)
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

output:

   schema_context         column                    check check_number failure_case index
0          Column  integer_col_1    coerce_dtype('int64')         None          xxx     1
1          Column  integer_col_1    coerce_dtype('int64')         None          aaa     4
2          Column  integer_col_2    coerce_dtype('int64')         None          yyy     2
3          Column    float_col_1  coerce_dtype('float64')         None          zzz     2
4          Column  integer_col_1           dtype('int64')         None       object  None
5          Column  integer_col_1        greater_than_zero            0          xxx     1
6          Column  integer_col_1        greater_than_zero            0           -4     3
7          Column  integer_col_1        greater_than_zero            0          aaa     4
8          Column  integer_col_2           dtype('int64')         None       object  None
9          Column  integer_col_2        greater_than_zero            0          yyy     2
10         Column    float_col_1         dtype('float64')         None       object  None
11         Column    float_col_1        greater_than_zero            0          zzz     2
12         Column    float_col_1        greater_than_zero            0       -145.1     4

This makes me think of introducing an on_error check option that raises an error if on_error="raise" and return False if on_error="false". Additionally, if built-in checks supported an element_wise option, this would open up the possibility of this built-in check that gets your desired behavior:

# NOT WORKING CODE
pa.Check.greater_than(0, element_wise=True, on_error="false")

If this use case comes up several more times (or +1s on this discussion) we can write out an issue to open up for contributions from the community!

0 replies

JiriFranek92 · 2021-09-15T15:54:58Z

JiriFranek92
Sep 15, 2021
Author

Thanks for the quick reply. Yes, the TypeError from vectorised series > 0 was what I was trying to say. I tried using element-wise comparison, but didn't know you have to explicitely handle the TypeError.

The element_wise option for built-in checks would be a simple solution to get the desired behaviour. Although, you'd lose the performance advantage of vectorised operations, even in cases when the dtypes can be coerced so the built-in checks can be used vectorised as intended.

Anyway, this is for my pet/self-learning project, so I don't know if it's really that an important use-case.

2 replies

JiriFranek92 Sep 20, 2021
Author

Ok, finally tried to implement this. In my project I use pandera to validate a dataframe imported from a csv file. Turns out that's actually an issue.

If I create a dataframe in code and import a dataframe with the same values from a csv file they are seemingly the same. However if I try to compare them using df_A.equals(df_B) I get False. The problem is when I try to then validate the imported dataframe, I get failed "greater_than_0" checks where the value is actually positive.

So looks like I might have to use two schema validation anyway, because of some pandas weirdness (The only explanation I found mentions rounding of float values, but I tried using a df without floats and the problem persisted, so I'm at a loss).

Code:

import pandera as pa
import pandas as pd

bad_df = pd.DataFrame({"integer_col_1": [1, "xxx", 3, -4, "aaa"],
                       "integer_col_2": [42,56,-55,12,56],
                       "float_col_1": [12.45,78.11,"zzz",11.1,-145.1]})
bad_df_from_csv = pd.read_csv("https://pastebin.com/raw/75bp3pY0")

# comparing dataframes
print("DATAFRAMES:")
print("Created from scratch:")
print(bad_df)
print("\n" + "Imported from csv file:")
print(bad_df_from_csv)
print(f"Are the the same?: {bad_df.equals(bad_df_from_csv)}\n")

# creating schema
def greater_than_zero(x):
    try:
        return x > 0
    except TypeError:
        return False


check_gt_zero = pa.Check(greater_than_zero, element_wise=True)

schema = pa.DataFrameSchema(
    {
        "integer_col_1": pa.Column(int, check_gt_zero),
        "integer_col_2": pa.Column(int, check_gt_zero),
        "float_col_1": pa.Column(float, check_gt_zero),
    },
    coerce=True
)

# validating dataframes
print("FAILURE CASES:")
try:
    schema.validate(bad_df, lazy=True)
except pa.errors.SchemaErrors as e:
    print("Created from scratch:")
    print(e.failure_cases[['column','check','failure_case','index']].sort_values("index"))

try:
    schema.validate(bad_df_from_csv, lazy=True)
except pa.errors.SchemaErrors as e:
    print("\n" + "Imported from csv file:")
    print(e.failure_cases[['column','check','failure_case','index']].sort_values("index"))

Output:

DATAFRAMES:
Created from scratch:
  integer_col_1  integer_col_2 float_col_1
0             1             42       12.45
1           xxx             56       78.11
2             3            -55         zzz
3            -4             12        11.1
4           aaa             56      -145.1

Imported from csv file:
  integer_col_1  integer_col_2 float_col_1
0             1             42       12.45
1           xxx             56       78.11
2             3            -55         zzz
3            -4             12        11.1
4           aaa             56      -145.1
Are the the same?: False

FAILURE CASES:
Created from scratch:
           column                    check failure_case index
0   integer_col_1    coerce_dtype('int64')          xxx     1
4   integer_col_1        greater_than_zero          xxx     1
2     float_col_1  coerce_dtype('float64')          zzz     2
7   integer_col_2        greater_than_zero          -55     2
9     float_col_1        greater_than_zero          zzz     2
5   integer_col_1        greater_than_zero           -4     3
1   integer_col_1    coerce_dtype('int64')          aaa     4
6   integer_col_1        greater_than_zero          aaa     4
10    float_col_1        greater_than_zero       -145.1     4
3   integer_col_1           dtype('int64')       object  None
8     float_col_1         dtype('float64')       object  None

Imported from csv file:
           column                    check failure_case index
4   integer_col_1        greater_than_zero            1     0
11    float_col_1        greater_than_zero        12.45     0
0   integer_col_1    coerce_dtype('int64')          xxx     1
5   integer_col_1        greater_than_zero          xxx     1
12    float_col_1        greater_than_zero        78.11     1
2     float_col_1  coerce_dtype('float64')          zzz     2
6   integer_col_1        greater_than_zero            3     2
9   integer_col_2        greater_than_zero          -55     2
13    float_col_1        greater_than_zero          zzz     2
7   integer_col_1        greater_than_zero           -4     3
14    float_col_1        greater_than_zero         11.1     3
1   integer_col_1    coerce_dtype('int64')          aaa     4
8   integer_col_1        greater_than_zero          aaa     4
15    float_col_1        greater_than_zero       -145.1     4
3   integer_col_1           dtype('int64')       object  None
10    float_col_1         dtype('float64')       object  None

jeffzi Sep 20, 2021
Collaborator

pandas.DataFrame.equals requires the type of the individual values to be the same. See last example in the documentation of equals.

DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types.

Somehow, read_csv interprets some values as str but pandas.DataFrame interprets the same values as int/float:

for col in bad_df.columns:
    print(f"Column '{col}'")
    for from_scratch, from_csv in zip(
        bad_df["float_col_1"], bad_df_from_csv["float_col_1"]
    ):
        print(
            f"from_scratch {from_scratch}: type={type(from_scratch)}"
            + f", from_csv {from_csv}: type={type(from_csv)}"
        )
#> Column 'integer_col_1'
#> from_scratch 12.45: type=<class 'float'>, from_csv 12.45: type=<class 'str'>
#> from_scratch 78.11: type=<class 'float'>, from_csv 78.11: type=<class 'str'>
#> from_scratch zzz: type=<class 'str'>, from_csv zzz: type=<class 'str'>
#> from_scratch 11.1: type=<class 'float'>, from_csv 11.1: type=<class 'str'>
#> from_scratch -145.1: type=<class 'float'>, from_csv -145.1: type=<class 'str'>
#> Column 'integer_col_2'
#> from_scratch 12.45: type=<class 'float'>, from_csv 12.45: type=<class 'str'>
#> from_scratch 78.11: type=<class 'float'>, from_csv 78.11: type=<class 'str'>
#> from_scratch zzz: type=<class 'str'>, from_csv zzz: type=<class 'str'>
#> from_scratch 11.1: type=<class 'float'>, from_csv 11.1: type=<class 'str'>
#> from_scratch -145.1: type=<class 'float'>, from_csv -145.1: type=<class 'str'>
#> Column 'float_col_1'
#> from_scratch 12.45: type=<class 'float'>, from_csv 12.45: type=<class 'str'>
#> from_scratch 78.11: type=<class 'float'>, from_csv 78.11: type=<class 'str'>
#> from_scratch zzz: type=<class 'str'>, from_csv zzz: type=<class 'str'>
#> from_scratch 11.1: type=<class 'float'>, from_csv 11.1: type=<class 'str'>
#> from_scratch -145.1: type=<class 'float'>, from_csv -145.1: type=<class 'str'>

The problem is when I try to then validate the imported dataframe, I get failed "greater_than_0" checks where the value is actually positive.

That's because the positive values are actually of type str in bad_df_from_csv and the check returns False when catching the TypeError.

You could force columns with mixed-dtype to be string when reading the csv and then attempt to cast in the check:

bad_df_from_csv = pd.read_csv("https://pastebin.com/raw/75bp3pY0", dtype="string")


def greater_than_zero(x):
    try:
        return float(x) > 0
    except ValueError: # not a TypeError anymore
        return False

However, coercion will not work until you remove the string values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking bad datatypes and invalid values with a single schema. #619

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Checking bad datatypes and invalid values with a single schema. #619

JiriFranek92 Sep 13, 2021

Replies: 2 comments · 2 replies

cosmicBboy Sep 14, 2021 Maintainer

JiriFranek92 Sep 15, 2021 Author

JiriFranek92 Sep 20, 2021 Author

jeffzi Sep 20, 2021 Collaborator

JiriFranek92
Sep 13, 2021

Replies: 2 comments 2 replies

cosmicBboy
Sep 14, 2021
Maintainer

JiriFranek92
Sep 15, 2021
Author

JiriFranek92 Sep 20, 2021
Author

jeffzi Sep 20, 2021
Collaborator