Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pa.Column(drop_invalid_rows=True) has no effect for Pandas DataFrames #1830

Open
3 tasks done
JohannHansing opened this issue Oct 11, 2024 · 1 comment
Open
3 tasks done
Labels
bug Something isn't working

Comments

@JohannHansing
Copy link

JohannHansing commented Oct 11, 2024

Describe the bug

This is my first bug in report in an open source repo, so I apologize in advance if it's not done adequately.

The flag Column(drop_invalid_rows=True) has no effect when validating pandas dataframes. This can readily be observed in the documentation:

https://pandera.readthedocs.io/en/stable/drop_invalid_rows.html

Where in the given example with df = pd.DataFrame({"counter": ["1", "2", "3"]}) no row is actually dropped, even though there should be validation errors, since the values are string type and not integer type.

During debugging, I found that the bug occurs due to fact that in pandera/backends/pandas/container.py , the following if-clause on line 118 evaluates to False if drop_invalid_rows is set to True:

        if error_handler.collected_errors:
            if getattr(schema, "drop_invalid_rows", False):
                check_obj = self.drop_invalid_rows(check_obj, error_handler)
                return check_obj
            else:
                raise SchemaErrors(
                    schema=schema,
                    schema_errors=error_handler.schema_errors,
                    data=check_obj,
                )

This in turn seems to be caused by the fact that pandera internally communicates validation errors by collecting exceptions via try-except clauses but when drop_invalid_rows is set to True, no exceptions are raised, which is why bool(error_handler.collected_errors) evaluates to False.

If drop_invalid_rows were not set to true, then the validation errors would have raised exceptions in pandera/backends/pandas/array.py in ArraySchemaBackend.validate which in turn would have been collected in the try-except block in pandera/backends/pandas/container.py in run_schema_component_checks.

To fix this bug, I would humbly suggest considering refactoring the code so that it does not communicate via try-except statements. Validation errors should be collected into e.g. lists and these lists passed between functions. Exceptions should only be raised if lazy=False and not be used to pass data between functions.

Update: Further debugging on the main branch of pandera led me to realize that the bug does not occur for DataFrameSchema(drop_invalid_rows=True) . Which is why the unit tests for drop_invalid_rows=True are green in test_schemas, where drop_invalid_rows is passed as an argument to DataFrameSchema and not to Column.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

schema_test = pa.DataFrameSchema({"c": pa.Column(str, drop_invalid_rows=True, checks=pa.Check.str_length(max_value=5))})
df_result = schema_test.validate(pd.DataFrame({"c": ["this string is too long", "fine"]}), lazy=True)
>>> print(df_result)
                         c
0  this string is too long
1                     fine

Expected behavior

>>> print(df_result)
                         c
1                     fine

Desktop (please complete the following information):

Windows 10

@JohannHansing JohannHansing added the bug Something isn't working label Oct 11, 2024
@JohannHansing JohannHansing changed the title drop_invalid_rows=True has no effect for Pandas DataFrames pa.Column(drop_invalid_rows=True) has no effect for Pandas DataFrames Oct 13, 2024
@JohannHansing
Copy link
Author

JohannHansing commented Oct 14, 2024

Another weird and perhaps related observations:

This produces a red unit test result, since the invalid rows are not dropped:

        (
            DataFrameSchema(
                {
                    "c": Column(int, checks=[Check(lambda x: x >= 3)]),
                },
                drop_invalid_rows=True,
            ),
            pd.DataFrame({"c": [1, 2, 3, 4, 5, 6],}),
            pd.DataFrame({"c": [3, 4, 5, 6]}),
        ),
    ],
)
def test_drop_invalid_for_dataframe_schema(schema, obj, expected_obj):

But the following doesn't, where I only replaced "c" with "numbers"

        (
            DataFrameSchema(
                {
                    "numbers": Column(int, checks=[Check(lambda x: x >= 3)]),
                },
                drop_invalid_rows=True,
            ),
            pd.DataFrame({"numbers": [1, 2, 3, 4, 5, 6],}),
            pd.DataFrame({"numbers": [3, 4, 5, 6]}),
        ),
    ],
)
def test_drop_invalid_for_dataframe_schema(schema, obj, expected_obj):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant