You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
Not all rows whose value(s) violate a defined schema are being reported when schema.validate(df, lazy=True) is run. In cases where the value causing the violation is identical between consecutive rows only the first row is reported in pa.errors.SchemaErrors.failure_cases (reopening of #527)
I apologize for my delayed response for #527 and am opening a new issue in hopes that you can address this bug in a key feature soon. Also I hope you had a good weekend! #528 didn't fix the issue as can be verified by running the code snippet below.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample
importpandasaspdimportpanderaaspabaseSchema=pa.DataFrameSchema (
columns={
"AnalysisPath": pa.Column(pa.String),
"runID": pa.Column(pa.String),
"SampleType": pa.Column(pa.String, pa.Check.isin(['DNA', 'RNA'])),
"SampleValid": pa.Column(pa.String, pa.Check.isin(['Yes', 'No'])),
},
strict=False,
coerce=True
)
df=pd.DataFrame.from_dict({'AnalysisPath': ['/','/','/','/', '/'],
'runID':['1','2','3','4', '5'],
'SampleType': ['DNA', 'RNA', 'DNA', 'RNA', 'RNA'],
#Notice only the first entry for SampleValid adheres to the defined schema constraints'SampleValid': ['Yes', 'YES', 'YES', 'NO', 'NO']})
try:
baseSchema.validate(df, lazy=True)
exceptpa.errors.SchemaErrorsasexc:
# Should contain a row for every row in df where SampleValid does not meet schema rules (all rows but the first)# Instead only returns violations rows with index 1 and index 3. The two rows with the first SampleValid values of 'YES' and 'NO' respectivelyprint(exc.failure_cases)
Expected behavior
To report all schema violations observed in the dataframe
Edit: This behavior occurs even if the rows with identical values in SampleValid violating the constraints are not consecutive:
i.e -
df=pd.DataFrame.from_dict({'AnalysisPath': ['/','/','/','/', '/'],
'runID':['1','2','3','4', '5'],
'SampleType': ['DNA', 'RNA', 'DNA', 'RNA', 'RNA'],
#Notice only the first entry for SampleValid adheres to the defined schema constraints'SampleValid': ['Yes', 'YES', 'NO', 'YES','NO']})
The validation only reports the first row with a given value in the SampleValid column that violates constraints
Desktop (please complete the following information):
OS: macOS
Browser chrome
Version dev branch
Screenshots
**The result of printing pa.errors.SchemaErrors.failure_cases (only two rows are present when 4 should be): **
The text was updated successfully, but these errors were encountered:
Description:
Not all rows whose value(s) violate a defined schema are being reported when schema.validate(df, lazy=True) is run. In cases where the value causing the violation is identical between consecutive rows only the first row is reported in pa.errors.SchemaErrors.failure_cases (reopening of #527)
I apologize for my delayed response for #527 and am opening a new issue in hopes that you can address this bug in a key feature soon. Also I hope you had a good weekend! #528 didn't fix the issue as can be verified by running the code snippet below.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample
Expected behavior
To report all schema violations observed in the dataframe
Edit: This behavior occurs even if the rows with identical values in SampleValid violating the constraints are not consecutive:
i.e -
The validation only reports the first row with a given value in the SampleValid column that violates constraints
Desktop (please complete the following information):
Screenshots
**The result of printing pa.errors.SchemaErrors.failure_cases (only two rows are present when 4 should be): **
The text was updated successfully, but these errors were encountered: