Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate function run on DataFrameSchema with lazy=True doesn't report all error schema violations (reopen) #531

Closed
3 tasks done
NathanCastroPacheco opened this issue Jun 28, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@NathanCastroPacheco
Copy link

NathanCastroPacheco commented Jun 28, 2021

Description:
Not all rows whose value(s) violate a defined schema are being reported when schema.validate(df, lazy=True) is run. In cases where the value causing the violation is identical between consecutive rows only the first row is reported in pa.errors.SchemaErrors.failure_cases (reopening of #527)

I apologize for my delayed response for #527 and am opening a new issue in hopes that you can address this bug in a key feature soon. Also I hope you had a good weekend! #528 didn't fix the issue as can be verified by running the code snippet below.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample

import pandas as pd
import pandera as pa

baseSchema = pa.DataFrameSchema (
    columns={
        "AnalysisPath": pa.Column(pa.String),
        "runID": pa.Column(pa.String),
        "SampleType": pa.Column(pa.String, pa.Check.isin(['DNA', 'RNA'])),
        "SampleValid": pa.Column(pa.String, pa.Check.isin(['Yes', 'No'])),
    },
    strict=False,
    coerce=True
)

df = pd.DataFrame.from_dict({'AnalysisPath': ['/','/','/','/', '/'],
                            'runID':['1','2','3','4', '5'],
                            'SampleType': ['DNA', 'RNA', 'DNA', 'RNA', 'RNA'], 
                            #Notice only the first entry for SampleValid adheres to the defined schema constraints
                            'SampleValid': ['Yes', 'YES', 'YES', 'NO', 'NO']})
try: 
    baseSchema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc: 
    # Should contain a row for every row in df where SampleValid does not meet schema rules (all rows but the first)
    # Instead only returns violations rows with index 1 and index 3. The two rows with the first SampleValid values of 'YES' and 'NO' respectively
    print(exc.failure_cases)

Expected behavior

To report all schema violations observed in the dataframe

Edit: This behavior occurs even if the rows with identical values in SampleValid violating the constraints are not consecutive:
i.e -

df = pd.DataFrame.from_dict({'AnalysisPath': ['/','/','/','/', '/'],
                            'runID':['1','2','3','4', '5'],
                            'SampleType': ['DNA', 'RNA', 'DNA', 'RNA', 'RNA'], 
                            #Notice only the first entry for SampleValid adheres to the defined schema constraints
                            'SampleValid': ['Yes', 'YES', 'NO', 'YES','NO']})

The validation only reports the first row with a given value in the SampleValid column that violates constraints
Screen Shot 2021-06-28 at 9 11 58 AM

Desktop (please complete the following information):

  • OS: macOS
  • Browser chrome
  • Version dev branch

Screenshots

**The result of printing pa.errors.SchemaErrors.failure_cases (only two rows are present when 4 should be): **
Screen Shot 2021-06-28 at 8 48 38 AM

@cosmicBboy
Copy link
Collaborator

fixed by #550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants