You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where in the given example with df = pd.DataFrame({"counter": ["1", "2", "3"]}) no row is actually dropped, even though there should be validation errors, since the values are string type and not integer type.
During debugging, I found that the bug occurs due to fact that in pandera/backends/pandas/container.py , the following if-clause on line 118 evaluates to False if drop_invalid_rows is set to True:
if error_handler.collected_errors:
if getattr(schema, "drop_invalid_rows", False):
check_obj = self.drop_invalid_rows(check_obj, error_handler)
return check_obj
else:
raise SchemaErrors(
schema=schema,
schema_errors=error_handler.schema_errors,
data=check_obj,
)
This in turn seems to be caused by the fact that pandera internally communicates validation errors by collecting exceptions via try-except clauses but when drop_invalid_rows is set to True, no exceptions are raised, which is why bool(error_handler.collected_errors) evaluates to False.
If drop_invalid_rows were not set to true, then the validation errors would have raised exceptions in pandera/backends/pandas/array.py in ArraySchemaBackend.validate which in turn would have been collected in the try-except block in pandera/backends/pandas/container.py in run_schema_component_checks.
To fix this bug, I would humbly suggest considering refactoring the code so that it does not communicate via try-except statements. Validation errors should be collected into e.g. lists and these lists passed between functions. Exceptions should only be raised if lazy=False and not be used to pass data between functions.
Update: Further debugging on the main branch of pandera led me to realize that the bug does not occur for DataFrameSchema(drop_invalid_rows=True) . Which is why the unit tests for drop_invalid_rows=True are green in test_schemas, where drop_invalid_rows is passed as an argument to DataFrameSchema and not to Column.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
schema_test=pa.DataFrameSchema({"c": pa.Column(str, drop_invalid_rows=True, checks=pa.Check.str_length(max_value=5))})
df_result=schema_test.validate(pd.DataFrame({"c": ["this string is too long", "fine"]}), lazy=True)
>>> print(df_result)
c
0 this string is too long
1 fine
Expected behavior
>>> print(df_result)
c
1 fine
Desktop (please complete the following information):
Windows 10
The text was updated successfully, but these errors were encountered:
JohannHansing
changed the title
drop_invalid_rows=True has no effect for Pandas DataFrames
pa.Column(drop_invalid_rows=True) has no effect for Pandas DataFrames
Oct 13, 2024
Describe the bug
This is my first bug in report in an open source repo, so I apologize in advance if it's not done adequately.
The flag
Column(drop_invalid_rows=True)
has no effect when validating pandas dataframes. This can readily be observed in the documentation:https://pandera.readthedocs.io/en/stable/drop_invalid_rows.html
Where in the given example with
df = pd.DataFrame({"counter": ["1", "2", "3"]})
no row is actually dropped, even though there should be validation errors, since the values are string type and not integer type.During debugging, I found that the bug occurs due to fact that in
pandera/backends/pandas/container.py
, the following if-clause on line 118 evaluates toFalse
if drop_invalid_rows is set to True:This in turn seems to be caused by the fact that pandera internally communicates validation errors by collecting exceptions via try-except clauses but when
drop_invalid_rows
is set to True, no exceptions are raised, which is whybool(error_handler.collected_errors)
evaluates toFalse
.If
drop_invalid_rows
were not set to true, then the validation errors would have raised exceptions inpandera/backends/pandas/array.py
inArraySchemaBackend.validate
which in turn would have been collected in the try-except block inpandera/backends/pandas/container.py
inrun_schema_component_checks
.To fix this bug, I would humbly suggest considering refactoring the code so that it does not communicate via try-except statements. Validation errors should be collected into e.g. lists and these lists passed between functions. Exceptions should only be raised if lazy=False and not be used to pass data between functions.
Update: Further debugging on the main branch of
pandera
led me to realize that the bug does not occur forDataFrameSchema(drop_invalid_rows=True)
. Which is why the unit tests fordrop_invalid_rows=True
are green in test_schemas, wheredrop_invalid_rows
is passed as an argument toDataFrameSchema
and not toColumn
.Code Sample, a copy-pastable example
Expected behavior
Desktop (please complete the following information):
Windows 10
The text was updated successfully, but these errors were encountered: