-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: drop invalid rows on validate with new param #1189
Enhancement: drop invalid rows on validate with new param #1189
Conversation
Signed-off-by: kykyi <[email protected]>
Signed-off-by: kykyi <[email protected]>
Signed-off-by: kykyi <[email protected]>
Signed-off-by: kykyi <[email protected]>
…match Signed-off-by: kykyi <[email protected]>
…lasses and functions Signed-off-by: kykyi <[email protected]>
Signed-off-by: kykyi <[email protected]>
Signed-off-by: kykyi <[email protected]>
Signed-off-by: kykyi <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>
environment.yml
Outdated
@@ -47,7 +47,6 @@ dependencies: | |||
|
|||
# testing | |||
- isort >= 5.7.0 | |||
- codecov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure what this change is, fairly sure it was removed in #1136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I need to rebase with the main of the non-forked repo!
pandera/api/pandas/drop_invalid.py
Outdated
|
||
|
||
def drop_invalid(validate: Callable) -> Callable: | ||
"""Decorator around `validate()` methods to handle the dropping of invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo: docstring format to be updated.
pandera/api/pandas/drop_invalid.py
Outdated
) | ||
|
||
|
||
def drop_invalid(validate: Callable) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosmicBboy not thrilled about the naming here, but couldn't find a convention to follow. Think it should be something a bit more verbose, but I guess it does what it says!
Rethinking this approach, don't think the decorator is the best idea |
pandera/backends/pandas/container.py
Outdated
] | ||
|
||
# run the checks | ||
error_handler = self.__run_checks_and_handle_errors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosmicBboy how do you feel about name mangling? Can just as easily swap to a single _
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do a little refactoring here:
- if we move the core checks to its own
run_core_checks
method (same forrun_core_parsers
, we clean this up a bit (I'm not familiar with the name mangling, but I don't think it's needed here: exposing a public method for running the core checks makes sense to me. - we don't need the
try... except
here, since we have an error_handler object that we can use to check if there are schema errors. - The
drop_invalid
logic can live underneathif error_handler.collected_errors:
and return the check object. - Probably worth creating a
drop_invalid_data
method to encapsulate the dropping logic. I was also thinking about how to handle metadata-level failures (e.g. wrong data type): should those columns be dropped to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tips! I have updated the PR accordingly.
On the last point though, currently the PR would remove the rows in the check_obj
where the index was a failure cases, such as for the reason of being an incorrect data type. If the whole column was incorrect, then the whole check_obj
would be wiped 😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing invalid columns is slightly different than dropping invalid rows, and could make more sense to treat them as distinct cases with distinct apis?
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1189 +/- ##
==========================================
+ Coverage 97.23% 97.25% +0.01%
==========================================
Files 65 65
Lines 5066 5101 +35
==========================================
+ Hits 4926 4961 +35
Misses 140 140
☔ View full report in Codecov by Sentry. |
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
…method Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! see inline comment for refactoring suggestions
pandera/backends/pandas/container.py
Outdated
] | ||
|
||
# run the checks | ||
error_handler = self.__run_checks_and_handle_errors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do a little refactoring here:
- if we move the core checks to its own
run_core_checks
method (same forrun_core_parsers
, we clean this up a bit (I'm not familiar with the name mangling, but I don't think it's needed here: exposing a public method for running the core checks makes sense to me. - we don't need the
try... except
here, since we have an error_handler object that we can use to check if there are schema errors. - The
drop_invalid
logic can live underneathif error_handler.collected_errors:
and return the check object. - Probably worth creating a
drop_invalid_data
method to encapsulate the dropping logic. I was also thinking about how to handle metadata-level failures (e.g. wrong data type): should those columns be dropped to?
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
This is awesome! Apologies for switching things up a little bit more... bit the more I think about it, the more I think it makes sense to make
Let me know if you need help making these modifications, I'm happy to do them! We'll also need docs for this new feature. |
@cosmicBboy don't be sorry, better to get it right 👌 So just to get the api right, are you imagining something like: schema = DataFrameSchema({"col": Column(int, checks=[Check(lambda x: x >= 3)])}, drop_invalid=True) # include kwarg here
df = pd.DataFrame({"col": [1, 2, 3, 4, 5]})
schema.validate(df) # rather than `.validate(df, drop_invalid=True)`
=> pd.DataFrame({"col": [3, 4, 5]}) Shouldn't be too onerous to change now that I am familiar with the repo 👍 |
yep! that's correct. The changes to the class-based API need to be made too, so the |
Add drop_invalid attr to BaseConfig Signed-off-by: Baden Ashford <[email protected]>
@cosmicBboy I have pushed up the requested changes! Am I correct that the docs should go in |
pandera/backends/pandas/array.py
Outdated
) | ||
|
||
if lazy and error_handler.collected_errors: | ||
if hasattr(schema, "drop_invalid") and schema.drop_invalid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guard against a repeat of #1188
Signed-off-by: Baden Ashford <[email protected]>
if hasattr(schema, "drop_invalid") and schema.drop_invalid: | ||
check_obj = self.drop_invalid_data(check_obj, error_handler) | ||
return check_obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird, it seems like the test_drop_invalid_for_dataframe_schema
test should cover this case no? Is this a codecov bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hasattr? That was just some defensive code as there was a recent issue with the new default attr not being available on pickled schemas. Can drop if you think it is overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's okay to have that in there, it's just that codecov is complaining that this part of the code wasn't executed, meaning that lines 113 weren't executed during CI. But it seems like test_drop_invalid_for_dataframe_schema
should have caused this code to run no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no it doesn't appear so locally! Codecov saving us 😆
This was my bad. So this code in this file:
if is_table(check_obj[column_name]):
for i in range(check_obj[column_name].shape[1]):
validate_column(
check_obj[column_name].iloc[:, [i]], column_name
)
else:
if hasattr(schema, "drop_invalid") and schema.drop_invalid:
# replace the check_obj with the validated check_obj
check_obj = validate_column(
check_obj, column_name, return_check_obj=True
)
else:
validate_column(check_obj, column_name)
if lazy and error_handler.collected_errors:
if hasattr(schema, "drop_invalid") and schema.drop_invalid: # the line in question!
check_obj = self.drop_invalid_data(check_obj, error_handler)
return check_obj
else:
raise SchemaErrors(
schema=schema,
schema_errors=error_handler.collected_errors,
data=check_obj,
)
Raises an error on validate_column
but does not rescue it. The rescue is done higher up in ArraySchemaBackend.validate()
. This is now obvious as there is no except
between validate_column
and if lazy and error_handler.collected_errors
!
I've removed these lines and pushed the changes👌
Signed-off-by: Baden Ashford <[email protected]>
The PR is looking like it's in great shape! Two more things: How to handle invalid columnsBased on the current implementation of drop_invalid_data it looks like it's only considering data-level failure cases (which are associated with an
I'm leaning towards (c) because it's the easiest to reason about, and it also means no further code changes need to be made to this PR :) DocsWith this feature I think it's worth creating a new docs page ( Again, thanks so much for your work on this, it's a huge feature! |
Hi @cosmicBboy I have added some draft docs 📝 . I also noticed I hadn't properly implemented the drop logic for the Model based schema so have updated that too. Re scope of "drop invalid", I think your option c) sounds reasonable. Dropping columns seems like a different operation. As a Pandora user it could be kind of confusing for drop_invalid to remove columns as well as rows, as this could reduce your df to So if we are sticking with the implemented logic, I think the PR is good for a review ✅ 🙇 ! |
Also @cosmicBboy as a follow up to this PR, I'm thinking of adding a |
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Signed-off-by: Baden Ashford <[email protected]>
Thanks! It looks like there are some rst formatting issues: https://github.com/unionai-oss/pandera/actions/runs/5221954728/jobs/9426870512?pr=1189#step:24:568 You can test the build out locally with pip install requirements-docs.txt
make docs |
What would that field do? In any case it should be a separate PR. |
@cosmicBboy idk if this is an ongoing issue with the repo or just me, but I get this dependency hell when I run
If I just manually install I then get:
😅 |
hmm, okay lemme see if I can debug... yeah I needa fix the pip environment for docs, I use conda and need to maintain both 😅 |
while I'm at it, mind if I update the option to |
Yeah change whatever you like! Maybe rows over row? |
@cosmicBboy are you still making changes or any action needed from me here? |
Signed-off-by: Niels Bantilan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the changes @kykyi, merging!
In response to issue: #44
Overview
Currently a user must implement
try/except
logic in order to drop invalid rows in a df or series being validated. This PR adds a newdrop_invalid
kwarg to the schema constructor to allow for invalid rows to be dropped on df validation.An important note is that for the drop_invalid logic to work, lazy must be true! I don't know if this a good thing or a bad thing, but it's a thing. Otherwise we would need to be more specific and probably handle this logic at the point of raising or not raising the
SchemaError
, rather than at the point of raising or not raising theSchemaError
s (plural) as this PR does 🤖How it looks
Before, some version of:
Now:
To-do before merge: