Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fallback coercion function as column/field argument #1082

Closed
a-recknagel opened this issue Jan 26, 2023 · 8 comments
Closed

Allow fallback coercion function as column/field argument #1082

a-recknagel opened this issue Jan 26, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@a-recknagel
Copy link
Contributor

a-recknagel commented Jan 26, 2023

My problem
I'd like to give a column/series the option to transform values that it otherwise can't handle. I use coerce=True extensively, and try to get rid of all pandas-calls that "don't really do anything", meaning they don't change the information in the DataFrame and only its form. Those operations feel to me like they'd be better represented declaratively in the schema.

It could also provide a convenient workaround for requested features like #502 or bugs like #1037

This request is effectively a lightweight variant of some of the options listed in #252, which seems to be stalling for now due to the complexity of the issue.

Possible solution
A new keyword in pa.Column and pa.Field that accepts a function:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({"foo": pa.Column(
    int, 
    coerce=True, 
    fallback=lambda x: 0 if pd.isnull(x) else x  # doesn't exist yet
)})
df = pd.DataFrame({"foo": [7.7, None]})
print(schema.validate(df))
#    foo
# 0    7
# 1    0

Running a python function with apply introduces performance issues, which is why I'm proposing it as fallback-only here. Thus, it wouldn't need to be run on the complete series and only on the failure_cases of a SchemaError, after the DType's better-performing coerce-methods had their chance. By constraining it like that, I hope to get something soon-ish that could be deprecated once a proper parsing concept has been sussed out.

Context

  • I considered using a more complex design, with options to apply functions Series- or DataFrame-wide, and pre- or post-validation, but there is already Implement parsing functionality in dataframe schemas #252 for that discussion
  • If parsing is still on the table, it'd make sense to align the design of this proposal in a way that there is a clean migration path from this feature towards parsing
  • A drawback of this feature is that it would give users an easy way to break the idempotence of validate, I don't know what to do about that or how to warn them that they're doing something strange
@a-recknagel a-recknagel added the enhancement New feature or request label Jan 26, 2023
@a-recknagel a-recknagel changed the title Allow setting fallback-coercion method as field argument Allow fallback coercion function as column/field argument Jan 26, 2023
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jan 27, 2023

@a-recknagel thanks for the input... development on a parser was only stalling because of #913, but now that it's merged we can finally add a parser!

While the fallback solution will solve your problem, I think a first-class parser would be way more flexible / performant. Also, my thinking around parsing has evolved a lot since I wrote #252.

A parser would pretty much behave as you intend with the fallback kwarg, but the idea is that it exposes another step to the validation computation graph. Whereas before you have:

flowchart LR
    Coerce --> Check
Loading

With the parser, you'll have

flowchart LR
    Coerce --> Parse --> Check
Loading

Here's a quick sketch of what the user-facing API might look like:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({"foo": pa.Column(
    int,
    # operate on the full series
    pa.Parser(lambda s: s.mask(s.isna(), 0)),
    # check or list of checks, as usual
    pa.Check.ge(0),
    coerce=True, 
)})
df = pd.DataFrame({"foo": [7.7, None]}

Similar to the pa.Check class, pa.Parser can be subject to modifiers:

# only operate on the coercion failures
pa.Parser(lambda s: ..., coercion_failures_only=True)

# element-wise transform
pa.Parser(lambda x: ..., element_wise=True)

# element-wise transform, only operate on the coercion failures
pa.Parser(lambda x: ..., element_wise=True, coercion_failures_only=True)

I think this would be a medium lift (2-3 weeks of free-time work) if this is scoped down to only work in columns/indexes for now, tho I could see it being applied at the dataframe-level too.

@a-recknagel
Copy link
Contributor Author

Sounds good, I'll close this issue then. Thanks for the in-depth answer.

Just for curiosity's sake, can you give some input one these, too?

  • do you think this interface lends itself to apply DataFrame-wide parsing functions as well? Just re-read your last sentence
  • it sounds like coercion and parsing are two different concepts, and as such a column with a parser but coerce=False would be possible. Is that the case? It feels like it might run the risk of confusing users by providing two tools that aren't obviously different.
  • a parser could produce values that wouldn't be valid input, leading to a situation where consecutive validations of the "same" dataframe fail. Is that a concern? Maybe there should be a way to control it on the schema-level, like schema.parse_and_validate(df) or schema.validate(df, parse=True)?

@cosmicBboy
Copy link
Collaborator

it sounds like coercion and parsing are two different concepts, and as such a column with a parser but coerce=False would be possible. Is that the case? It feels like it might run the risk of confusing users by providing two tools that aren't obviously different.

I actually go back and forth on this. In one sense, coercion IS parsing because coercion is fundamentally about transforming some data into a form that most appropriately matches the use case. On the other hand, I think it also makes sense to think of datatype coercion as a first-class step in the validation process, separate from user-defined parsers, which can further transform the coerced data (or failed coercion cases).

a parser could produce values that wouldn't be valid input, leading to a situation where consecutive validations of the "same" dataframe fail. Is that a concern? Maybe there should be a way to control it on the schema-level, like schema.parse_and_validate(df) or schema.validate(df, parse=True)?

Good point! I think one of the limitations of pa.Parser should be that the output datatype must match the specified dtype. This adds a little overhead, but I think a simple series.dtype metadata check would be fairly cheap, and would inform users if they have a ParserSpecificationError. Since the pipeline is coerce -> parse -> check_dtype -> core_checks -> user_defined_checks, the parser is simply a transform layer that has to further convert the coerced data into a form that passes the validation checks. In the future it may be possible to infer the checks from builtin parsers:

pa.Parse.clip_lower(0)  # convert all negative  --> pa.Check.ge(0)
pa.Parse.fillna(-1)  # convert nulls to -1  --> pa.Column(..., not_nullable=True)

Re: adding user control over parsing with parse=True, I'm actually inclined to deprecate the coerce kwarg and alias it to parse, which encompasses both coercing the dtype and applying user-defined parsers. This is nice and simple, and makes logical sense: if I want to coerce the data, it means my intent is to already apply transformations to the data to get it to a passing state, as defined by checks.

@jtlz2
Copy link

jtlz2 commented Feb 27, 2023

Thank you so much for this brilliant library!

Did this ever get written? Thanks!

@a-recknagel
Copy link
Contributor Author

@jtlz2 no, the rewrite hasn't been released yet. check on #252 if you want to keep an eye on this functionality being implemented.

@cosmicBboy
Copy link
Collaborator

It's almost there! I'm gonna punt on writing the docs for you to extend the pandera API so I can cut a 0.14.0 release sooner rather than later

@cosmicBboy
Copy link
Collaborator

okay, so i'm going to update the proposal on #252 to get the gears turning. @a-recknagel I'm going to be focusing on the ibis integration #1105 next (to support in-DB validation, but also as a way of testing out how the new rewrite abstractions work for new data containers). Would you be down to contribute an MVP for the parsing feature? I'm hoping to write out the proposal (going in the same direction as #1082 (comment)) in the next week or two.

@a-recknagel
Copy link
Contributor Author

I was thinking about proposing that. I'll try, thanks for the trust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants