-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fallback coercion function as column/field argument #1082
Comments
@a-recknagel thanks for the input... development on a parser was only stalling because of #913, but now that it's merged we can finally add a parser! While the A parser would pretty much behave as you intend with the flowchart LR
Coerce --> Check
With the parser, you'll have flowchart LR
Coerce --> Parse --> Check
Here's a quick sketch of what the user-facing API might look like: import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({"foo": pa.Column(
int,
# operate on the full series
pa.Parser(lambda s: s.mask(s.isna(), 0)),
# check or list of checks, as usual
pa.Check.ge(0),
coerce=True,
)})
df = pd.DataFrame({"foo": [7.7, None]} Similar to the # only operate on the coercion failures
pa.Parser(lambda s: ..., coercion_failures_only=True)
# element-wise transform
pa.Parser(lambda x: ..., element_wise=True)
# element-wise transform, only operate on the coercion failures
pa.Parser(lambda x: ..., element_wise=True, coercion_failures_only=True) I think this would be a medium lift (2-3 weeks of free-time work) if this is scoped down to only work in columns/indexes for now, tho I could see it being applied at the dataframe-level too. |
Sounds good, I'll close this issue then. Thanks for the in-depth answer. Just for curiosity's sake, can you give some input one these, too?
|
I actually go back and forth on this. In one sense, coercion IS parsing because coercion is fundamentally about transforming some data into a form that most appropriately matches the use case. On the other hand, I think it also makes sense to think of datatype coercion as a first-class step in the validation process, separate from user-defined parsers, which can further transform the coerced data (or failed coercion cases).
Good point! I think one of the limitations of pa.Parse.clip_lower(0) # convert all negative --> pa.Check.ge(0)
pa.Parse.fillna(-1) # convert nulls to -1 --> pa.Column(..., not_nullable=True) Re: adding user control over parsing with |
Thank you so much for this brilliant library! Did this ever get written? Thanks! |
It's almost there! I'm gonna punt on writing the docs for you to extend the pandera API so I can cut a |
okay, so i'm going to update the proposal on #252 to get the gears turning. @a-recknagel I'm going to be focusing on the ibis integration #1105 next (to support in-DB validation, but also as a way of testing out how the new rewrite abstractions work for new data containers). Would you be down to contribute an MVP for the parsing feature? I'm hoping to write out the proposal (going in the same direction as #1082 (comment)) in the next week or two. |
I was thinking about proposing that. I'll try, thanks for the trust. |
My problem
I'd like to give a column/series the option to transform values that it otherwise can't handle. I use
coerce=True
extensively, and try to get rid of all pandas-calls that "don't really do anything", meaning they don't change the information in theDataFrame
and only its form. Those operations feel to me like they'd be better represented declaratively in the schema.It could also provide a convenient workaround for requested features like #502 or bugs like #1037
This request is effectively a lightweight variant of some of the options listed in #252, which seems to be stalling for now due to the complexity of the issue.
Possible solution
A new keyword in
pa.Column
andpa.Field
that accepts a function:Running a python function with
apply
introduces performance issues, which is why I'm proposing it as fallback-only here. Thus, it wouldn't need to be run on the complete series and only on thefailure_case
s of aSchemaError
, after the DType's better-performing coerce-methods had their chance. By constraining it like that, I hope to get something soon-ish that could be deprecated once a proper parsing concept has been sussed out.Context
validate
, I don't know what to do about that or how to warn them that they're doing something strangeThe text was updated successfully, but these errors were encountered: