Validate function run on DataFrameSchema with lazy=True doesn't report all error schema violations (reopen) #531

NathanCastroPacheco · 2021-06-28T15:49:35Z

Description:
Not all rows whose value(s) violate a defined schema are being reported when schema.validate(df, lazy=True) is run. In cases where the value causing the violation is identical between consecutive rows only the first row is reported in pa.errors.SchemaErrors.failure_cases (reopening of #527)

I apologize for my delayed response for #527 and am opening a new issue in hopes that you can address this bug in a key feature soon. Also I hope you had a good weekend! #528 didn't fix the issue as can be verified by running the code snippet below.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample

import pandas as pd
import pandera as pa

baseSchema = pa.DataFrameSchema (
    columns={
        "AnalysisPath": pa.Column(pa.String),
        "runID": pa.Column(pa.String),
        "SampleType": pa.Column(pa.String, pa.Check.isin(['DNA', 'RNA'])),
        "SampleValid": pa.Column(pa.String, pa.Check.isin(['Yes', 'No'])),
    },
    strict=False,
    coerce=True
)

df = pd.DataFrame.from_dict({'AnalysisPath': ['/','/','/','/', '/'],
                            'runID':['1','2','3','4', '5'],
                            'SampleType': ['DNA', 'RNA', 'DNA', 'RNA', 'RNA'], 
                            #Notice only the first entry for SampleValid adheres to the defined schema constraints
                            'SampleValid': ['Yes', 'YES', 'YES', 'NO', 'NO']})
try: 
    baseSchema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc: 
    # Should contain a row for every row in df where SampleValid does not meet schema rules (all rows but the first)
    # Instead only returns violations rows with index 1 and index 3. The two rows with the first SampleValid values of 'YES' and 'NO' respectively
    print(exc.failure_cases)

Expected behavior

To report all schema violations observed in the dataframe

Edit: This behavior occurs even if the rows with identical values in SampleValid violating the constraints are not consecutive:
i.e -

df = pd.DataFrame.from_dict({'AnalysisPath': ['/','/','/','/', '/'],
                            'runID':['1','2','3','4', '5'],
                            'SampleType': ['DNA', 'RNA', 'DNA', 'RNA', 'RNA'], 
                            #Notice only the first entry for SampleValid adheres to the defined schema constraints
                            'SampleValid': ['Yes', 'YES', 'NO', 'YES','NO']})

The validation only reports the first row with a given value in the SampleValid column that violates constraints

Desktop (please complete the following information):

OS: macOS
Browser chrome
Version dev branch

Screenshots

**The result of printing pa.errors.SchemaErrors.failure_cases (only two rows are present when 4 should be): **

cosmicBboy · 2021-07-12T13:06:50Z

fixed by #550

NathanCastroPacheco added the bug Something isn't working label Jun 28, 2021

cosmicBboy mentioned this issue Jun 29, 2021

bugfix: don't drop duplicates for series failure cases #535

Merged

cosmicBboy mentioned this issue Jul 12, 2021

bugfixes: lazy validation, strategies #550

Merged

cosmicBboy closed this as completed Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate function run on DataFrameSchema with lazy=True doesn't report all error schema violations (reopen) #531

Validate function run on DataFrameSchema with lazy=True doesn't report all error schema violations (reopen) #531

NathanCastroPacheco commented Jun 28, 2021 •

edited

Loading

cosmicBboy commented Jul 12, 2021

Validate function run on DataFrameSchema with lazy=True doesn't report all error schema violations (reopen) #531

Validate function run on DataFrameSchema with lazy=True doesn't report all error schema violations (reopen) #531

Comments

NathanCastroPacheco commented Jun 28, 2021 • edited Loading

Code Sample

Expected behavior

Desktop (please complete the following information):

Screenshots

cosmicBboy commented Jul 12, 2021

NathanCastroPacheco commented Jun 28, 2021 •

edited

Loading