Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to generate informative error on custom checks? #429

Closed
arielshulman opened this issue Mar 8, 2021 · 15 comments
Closed

Is there a way to generate informative error on custom checks? #429

arielshulman opened this issue Mar 8, 2021 · 15 comments
Labels
question Further information is requested

Comments

@arielshulman
Copy link

Hello,
I'm trying to create a custom check such as
Check(lambda g: g.mean() >0.1, error="mean over 0.1")
Is there a way adding more info to the error message such as f"mean over 0.1 and the current mean is {g.mean()}"?

Thanks 🙏

@arielshulman arielshulman added the question Further information is requested label Mar 8, 2021
@jeffzi
Copy link
Collaborator

jeffzi commented Mar 9, 2021

Hi @arielshulman

At the moment that's not possible. How would you add the error message?

The first idea that comes to mind is to catch a pandera-defined exception CheckError that contains the message, on top of the usual boolean return value. The custom exception distinguishes a validation error from an error in the code of the check function.

@arielshulman
Copy link
Author

Hey @jeffzi,

I think it's a good option, but it's involved with a lot of code refactor since then the error parameter becomes a bit dispensable.

Another option I thought of, is to make the error parameter optionally a callable which gets the Series/DataFrame, exactly like like the fn parameter and returns a string which will be the error message.

@arielshulman arielshulman changed the title Is there a way to generate informative error on cuatom checks? Is there a way to generate informative error on custom checks? Mar 10, 2021
@jeffzi
Copy link
Collaborator

jeffzi commented Mar 10, 2021

I think it's a good option, but it's involved with a lot of code refactor since then the error parameter becomes a bit dispensable.

Do you mean refactors on the user end?

Another option I thought of, is to make the error parameter optionally a callable which gets the Series/DataFrame

With a callable you would most likely run the check again in that error function, only with a tiny modification to return the message.

@arielshulman
Copy link
Author

Do you mean refactors on the user end?

Yes, but maybe I didn't understand the exact way you meant the use of CheckError, could you elaborate please?

With a callable you would most likely run the check again in that error function, only with a tiny modification to return the message.

The way I see it, it will defenatly use part of the computation used for validation, so yes, you're right 🤗

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 12, 2021

I was thinking of having 2 ways of signaling a failing check:

  1. Output a boolean or a Series of boolean values (current)
  2. Raise a specific error with a message that pandera should report. This is similar to pydantic validators.

Example:

# draft, will not run
import pandera as pa


def check_lt_10(series):
    if (series >= 10).any():
        raise pa.CheckError(f"Found elements >= 10: {series >= 10 }")
    return True


schema = pa.DataFrameSchema({"column1": pa.Column(pa.Int, pa.Check(check_lt_10))})

@cosmicBboy Do you have another idea?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 13, 2021

Great discussion so far! I think coming up with a simple API for this is quite the puzzle...

The reason is that the whole premise of check_fn in pa.Check(check_fn, ...) is that it establishes a fairly simple contract of taking in a series, dataframe, or grouped series/dataframe and returns a boolean scalar or series/dataframe. I think this is one of the big selling points of features of pandera and helps users customize it very easily.

The ideas articulated so far are:

  1. have error accept a callback function that takes as input some data that's available during check execution
  2. raise a CheckError within the check function, which would then be reported by SchemaError/SchemaErrors.

I think the merit of (1) is its modularity: it's a simple feature to implement in pandera and understand and use from a user's perspective. The thing that bugs me about it by itself is the double computation of aggregating a series/df in check_fn and then doing it again in the error callback.

(2) is clean from a user's perspective but I think breaks the contract that check_fn is supposed to fulfill, which is to return a Dataframe[bool], Series[bool], or bool, which is used in the error-reporting part of pandera to provide additional metadata about how a check failed. A CheckError might also contain this boolean value, so I think this is a valid option.

def check_mean_gt(series):
    mean = series.mean()
    result = mean > 0.1
    if result:
        raise pa.CheckError(result, f"mean over 0.1 and the current mean is {mean}")
    return result

One proposal I have is this:

  1. Add an agg_fn option to the Check class. This would be a more generic version of groupby: what groupby does is modify the column/dataframe obj that's fed into check_fn such that you partition it into the discrete levels in the groupby column(s) (see here). agg_fn would be any arbitrary operation, the output of which would be fed into check_fn:
pa.Check(
    check_fn=lambda mean: mean > 0.1,
    agg_fn=lambda series: series.mean(),
    ...
)

Where the agg_fn should return a scalar or a series of smaller size than the original series. This way, the check runtime has access to the original series and the aggregated value, so the error callback could potentially have access to both:

pa.Check(
    check_fn=lambda mean: mean > 0.1,
    agg_fn=lambda series: series.mean(),
    ...,
    error=lambda series, mean: f"mean over 0.1 and the current mean is {mean}",
)

The pro of (3) is that it combines well with option (1) in modularizing different concerns during the validation process. The computation done to compute the mean is done only once, and the error callback can use the original input and/or the aggregated value to format an error message. It also adds (what I think is) a key abstraction that's currently missing in pandera, which is a first-class representation of aggregating a column at validation-time. I think the con is that it might be confusing for users... I've had at least one conversation where the groupby behavior was found to be a bit unintuitive... anyway, this is something good documentation can partially solve.

Thougths @arielshulman @jeffzi ?

@arielshulman
Copy link
Author

Thank you for the analysis.

In option (3) do you think the agg_fn could be an array of multiple aggregations, for example series.mean() and series.std()?

@cosmicBboy
Copy link
Collaborator

yeah, the implementation for (3) at validation time (Check.__call__) would do something like:

check_output = check_fn(agg_fn(series))

So you can do something like

def agg_fn(series):
    return series.mean(), series.std()

def check_fn(check_obj):
    # unpack the values
    mean, std = check_obj
    ...

Or

def agg_fn(series):
    return series.agg(["mean", "std"])

def check_fn(agg_series):
    agg_series["mean"]
    agg_series["std"]
    ...

Of course specifying an agg_fn would raise an exception if element_wise == True or groupby is specified.

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 15, 2021

I think this is one of the big selling points of features of pandera and helps users customize it very easily.

I totally agree ⭐

I've had at least one conversation where the groupby behavior was found to be a bit unintuitive...

I'm very much in that camp. I think it's worth breakdown my reasons to dislike groupby, since (3) is similar.

  • For me, it's not intuitive. At first, I expected groupby to result in a pandas groupby object, not a dictionary indexed by tuples.
  • In my opinion, groupby encourages the use of complex inline lambdas (see examples).
  • Everything that can be written with the groupby arg can be done in a more intuitive, "pandas-like", and readable dataframe check.

I'm not in favor of (3):

  • It seems less intuitive and spreads the check in 2 places. In your example, agg_fn is coupled with check_fn. Whereas with (2) all the code would be in a single place. That is arguably easier to follow for someone that did not write the check.
  • It arbitrarily limits the expressiveness. (2) gives you total control on the error message instead of only giving you placeholders.
  • (3) will front-load computations that may not be necessary. For example, I could want to compute "std" in a specific situation and "mean" in another.

@arielshulman
Copy link
Author

Why I 💛 Pandera over other methods I've tried, and as said by @cosmicBboy, is that everything is in front of your eyes when you look at the schema model.
It includes every column and it's checks, even if it has a custom check which not includes in pandera.
That's why I do like the agg_fn solution since it continues this state of mind.
We can think of agg_fn as pack of sub calculations layer (variables if you'd like) which helps you build a lambda check with multiple lines.
When I think about it that way, maybe agg_fn should get array of lambdas instead of encapsulating all the calculations in one lambda which returns multiple items.

Option (2), which @jeffzi suggested is also great. But this way it is less inline orientation.
I think if we're using a regular function, then maybe we should think about it and use it is as an Extension.

BTW, Couldn't find info about it, does Extensions offer way to define custom error message🙄?

@cosmicBboy
Copy link
Collaborator

Okay, letting these ideas marinate for a few days, I think we can decompose this into two separate issues:

(I) more informative custom error messages

I think (2) is the simplest solution considered so far, however the problem that still bugs me about it is that it's only really useful in the case of checks that involve aggregations.

For example, how would we handle CheckErrors in the case that a function is being applied to elements in a series/dataframe. The point of getting back a Series[bool] or DataFrame[bool] is to be able to report cases that failed the check after check_fn has been applied to the column in element-wise or vectorized fashion.

# what pandera does today
series = pd.Series([-1, 2, 3])

check_fn = lambda s: s > 0
check_output = check_fn(series)  # pd.Series([False, True, True])

# or in the element-wise case
check_output = series.map(lambda x: x > 0)  # pd.Series([False, True, True])

# pandera automatically handles getting statistics/reporting errors based on the failure cases

I might be missing something, but it seems to me that CheckError only makes sense in the case of aggregating a series into a scalar or small set of descriptive statistics in order to report violation of assumptions in those cases.

# vectorized-check
def vectorized_check(series: pd.Series):
    result = series > 0
    if (~result).any():
        # this would be redundant with whatever pandera already does to report number of failure cases
        raise pa.CheckError(...)
    return result

# element-wise check
def elementwise_check(x: int):
    result = x > 0
    if not result:
        # there would be a check error raised per element in the series for which this check fails,
        # it's unclear how to combine these custom check errors into a single error message
        raise pa.CheckError(...)
    return result

I'm not sure if introducing a new way of signalling check failure makes sense within the check_fn body if only to support the case of custom error messages in aggregation-type checks... it seems like we should instead support aggregation-type checks as a first-class citizen, and then implement a consistent interface for custom error message reporting.

(II) supporting various "check types" for specific use cases, e.g. checks that rely on groupby's, aggregation, or conditional partitioning of the data.

I do agree that the way groupby checks are implemented today is not great, but I did want to further discuss the reason why we'd want specialized checks.

(B) Everything that can be written with the groupby arg can be done in a more intuitive, "pandas-like", and readable dataframe check.

@jeffzi I understand that dataframe-level checks are the ultimate fallback... it offers ultimate expressiveness because the user can basically do anything with the dataframe. However there are advantages to constraining expressiveness in the service of other objectives, such as being able type-annotate tabular data via schemas in a standardized way. With dataframe checks, all the internal logic of the check is lost within the check_fn body, making it virtually impossible to collect schema metadata for the purpose of serialization (opening up the possibility for cross-library compatibility) or data synthesis (i.e. with hypothesis).

For me, it's not intuitive. At first, I expected groupby to result in a pandas groupby object, not a dictionary indexed by tuples.

Yes, this was admittedly an ad-hoc decision on my part in the earlier days of the library... the dictionary indexed by group keys (tuples if groupby is multiple columns) is exactly the way pandas groupby objects work, but I think we can have check_fn expect a groupby object instead.

In my opinion, groupby encourages the use of complex inline lambdas

I think this is up to the user's preference of how they want to use the library/Python... I think the current API supports using both defined functions and in-line functions, and with a bit of refactoring I think we can make groupby more intuitive.

Continuing the Discussion

I think CheckError presents a nice way of handling custom error messages for checks that aggregate a column into a small set of statistics that the user wants to check (and report on in a custom fashion), however it's unclear to me how it provides a benefit for checks that involve boolean operations on elements of a column. I am strongly for maintaining the current behavior of check_fn and introducing different check types however that might be implemented.

I do think adding options to the Check class becomes confusing and unwieldy, as groupby has demonstrated, so I would propose maybe specialized classes, e.g. GroupbyCheck and AggCheck for better separation of concerns (these could have corresponding decorators @groupby_check and @agg_check for the class-based API). It also opens up the possibility for a first-class citizen ConditionalCheck, which I feel has been sorely missing in the pandera interface.

Let me know what your thoughts are!

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 19, 2021

Both of you made very good points. I agree CheckError / Dataframe checks do not allow to report metadata and it is indeed a deal-breaker. I can also see why some users appreciate inline checks (thanks @arielshulman for your input).

I like the idea of specialized Check classes. That would be much clearer than adding a growing number of options to Check. It would also be easy to understand that different check subtypes get different inputs (groupby, etc.).

Regarding the error message, I suggest one tweak to the idea of a callable error. We can allow the check_fn to return a tuple where the first element is always a bool/Series as of now, and the others are up to the user. The tuple is then passed to the error function. It remains to be seen if that's still worth it in a world with specialized Checks :)

however it's unclear to me how it [CheckError] provides a benefit for checks that involve boolean operations on elements of a column.

It's true that pandera already reports failure cases.

It also opens up the possibility for a first-class citizen ConditionalCheck, which I feel has been sorely missing in the pandera interface.

Can you give an example?

BTW, Couldn't find info about it, does Extensions offer way to define custom error ?

Yes, see below:

import pandera as pa
import pandera.extensions as extensions
import pandas as pd


@extensions.register_check_method(statistics=["min_value", "max_value"])
def is_between(pandas_obj, *, min_value, max_value):
    return (min_value <= pandas_obj) & (pandas_obj <= max_value)


schema = pa.DataFrameSchema(
    {
        "col": pa.Column(
            int, pa.Check.is_between(min_value=1, max_value=10, error="Oops")
        )
    }
)

data = pd.DataFrame({"col": [1, 5, 10, 11]})
print(schema(data))
#> Traceback (most recent call last):
#> ...
#> SchemaError: <Schema Column(name=col, type=<class 'int'>)> failed element-wise validator 0:
#> <Check is_between: Oops>
#> failure cases:
#>    index  failure_case
#> 0      3            11

Created on 2021-03-19 by the reprexpy package

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 21, 2021

I like the idea of specialized Check classes. That would be much clearer than adding a growing number of options to Check. It would also be easy to understand that different check subtypes get different inputs (groupby, etc.).

One thing to point out here is that going this direction would also imply implementing ElementWiseCheck and WideCheck to disambiguate the function signature of Check (which would properly be called ColumnCheck). I actually remember why I went for keyword arguments instead of check subclasses: class proliferation is something I wanted to avoid by keeping a small API surface, and traded having separate classes for the single Check class with polymorphic check_fn signatures depending on the keyword options.

edit: we could also decide to preserve the element_wise and dataframe-level check behavior in the Check class and say that Groupby/AggChecks are different enough to justify a new class.

I think in the short term my preference would be to: (i) refactor groupby behavior such that check_fn takes a pandas groupby object, and (ii) to implement an agg option such that check_fn expects the output of pandas.Series.agg. If we find that this adds confusion to the user experience we can undertake the larger effort of having specialized check classes with consistent check_fn signatures.

It also opens up the possibility for a first-class citizen ConditionalCheck, which I feel has been sorely missing in the pandera interface.

Can you give an example?

I can't find it now, but there was an issue someone opened with a question of how to apply a check given a condition on another column:

data = pd.DataFrame({
    "col1": [-1, -2, -3, 1, 2, 3],
    "col2": ["a", "a", "a", "b", "b", "b"],
})

# if col2 == "a", col1 is negative
# if col2 == "b", col1 is positive

Currently the two ways to implement this is with a wide check or a groupby check on col1:

pa.DataFrameSchema({
    "col1": pa.Column(
        checks=[
           pa.Check(lambda groups: groups["a"] < 0, groupby="col2"),
           pa.Check(lambda groups: groups["b"] > 0, groupby="col2"),
        ]
    ),
})

# or
pa.DataFrameSchema(
    checks=[
        pa.Check(lambda df: df.query("col2 == 'a'")["col1"] < 0),
        pa.Check(lambda df: df.query("col2 == 'b'")["col1"] > 0),
    ]
)

These options might be good enough actually... but I did want to figure out a more intuitive way of doing it, like:

pa.DataFrameSchema({
    "col1": pa.Column(
        checks=pa.ConditionalCheck(lambda s: s < 0, where="col2", eq="a").otherwise(lambda s: s > 0),
    ),
})

Not really sure about the otherwise method, but you get the idea :) I guess we can explore this more if we end up going for specialized Check classes. What this first-class representation would allow for is to support data synthesis for conditional checks, though that might be putting the cart before the horse...

@cosmicBboy
Copy link
Collaborator

Action Items

  • refactor groupby behavior to expect a pandas Groupby object
  • add agg kwarg to Check that simply wraps DataFrame.agg and Series.agg depending on the schema context
  • support callback function for the error kwarg:
CheckObj = Union[pd.Series, pd.DataFrame, pd.Groupby]  # series or dataframe can be the output of `agg` operation

def error_callback(check_obj: CheckObj) -> str:
    return f"check failed with values {check_obj}"

If the additional agg argument feels too clunky, we can go the direction of specialized GroupbyCheck and AggCheck subclasses.

I can give a shot at implementing this!

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 8, 2021

Closing this issue, tracking changes discussed here in #488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants