-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to generate informative error on custom checks? #429
Comments
At the moment that's not possible. How would you add the error message? The first idea that comes to mind is to catch a pandera-defined exception |
Hey @jeffzi, I think it's a good option, but it's involved with a lot of code refactor since then the Another option I thought of, is to make the |
Do you mean refactors on the user end?
With a callable you would most likely run the check again in that error function, only with a tiny modification to return the message. |
Yes, but maybe I didn't understand the exact way you meant the use of
The way I see it, it will defenatly use part of the computation used for validation, so yes, you're right 🤗 |
I was thinking of having 2 ways of signaling a failing check:
Example: # draft, will not run
import pandera as pa
def check_lt_10(series):
if (series >= 10).any():
raise pa.CheckError(f"Found elements >= 10: {series >= 10 }")
return True
schema = pa.DataFrameSchema({"column1": pa.Column(pa.Int, pa.Check(check_lt_10))}) @cosmicBboy Do you have another idea? |
Great discussion so far! I think coming up with a simple API for this is quite the puzzle... The reason is that the whole premise of The ideas articulated so far are:
I think the merit of (1) is its modularity: it's a simple feature to implement in pandera and understand and use from a user's perspective. The thing that bugs me about it by itself is the double computation of aggregating a series/df in (2) is clean from a user's perspective but I think breaks the contract that def check_mean_gt(series):
mean = series.mean()
result = mean > 0.1
if result:
raise pa.CheckError(result, f"mean over 0.1 and the current mean is {mean}")
return result One proposal I have is this:
pa.Check(
check_fn=lambda mean: mean > 0.1,
agg_fn=lambda series: series.mean(),
...
) Where the pa.Check(
check_fn=lambda mean: mean > 0.1,
agg_fn=lambda series: series.mean(),
...,
error=lambda series, mean: f"mean over 0.1 and the current mean is {mean}",
) The pro of (3) is that it combines well with option (1) in modularizing different concerns during the validation process. The computation done to compute the mean is done only once, and the Thougths @arielshulman @jeffzi ? |
Thank you for the analysis. In option (3) do you think the |
yeah, the implementation for (3) at validation time ( check_output = check_fn(agg_fn(series)) So you can do something like def agg_fn(series):
return series.mean(), series.std()
def check_fn(check_obj):
# unpack the values
mean, std = check_obj
... Or def agg_fn(series):
return series.agg(["mean", "std"])
def check_fn(agg_series):
agg_series["mean"]
agg_series["std"]
... Of course specifying an |
I totally agree ⭐
I'm very much in that camp. I think it's worth breakdown my reasons to dislike
I'm not in favor of (3):
|
Why I 💛 Pandera over other methods I've tried, and as said by @cosmicBboy, is that everything is in front of your eyes when you look at the schema model. Option (2), which @jeffzi suggested is also great. But this way it is less inline orientation. BTW, Couldn't find info about it, does Extensions offer way to define custom error message🙄? |
Okay, letting these ideas marinate for a few days, I think we can decompose this into two separate issues: (I) more informative custom error messagesI think (2) is the simplest solution considered so far, however the problem that still bugs me about it is that it's only really useful in the case of checks that involve aggregations. For example, how would we handle # what pandera does today
series = pd.Series([-1, 2, 3])
check_fn = lambda s: s > 0
check_output = check_fn(series) # pd.Series([False, True, True])
# or in the element-wise case
check_output = series.map(lambda x: x > 0) # pd.Series([False, True, True])
# pandera automatically handles getting statistics/reporting errors based on the failure cases I might be missing something, but it seems to me that # vectorized-check
def vectorized_check(series: pd.Series):
result = series > 0
if (~result).any():
# this would be redundant with whatever pandera already does to report number of failure cases
raise pa.CheckError(...)
return result
# element-wise check
def elementwise_check(x: int):
result = x > 0
if not result:
# there would be a check error raised per element in the series for which this check fails,
# it's unclear how to combine these custom check errors into a single error message
raise pa.CheckError(...)
return result I'm not sure if introducing a new way of signalling check failure makes sense within the (II) supporting various "check types" for specific use cases, e.g. checks that rely on groupby's, aggregation, or conditional partitioning of the data.I do agree that the way groupby checks are implemented today is not great, but I did want to further discuss the reason why we'd want specialized checks.
@jeffzi I understand that dataframe-level checks are the ultimate fallback... it offers ultimate expressiveness because the user can basically do anything with the dataframe. However there are advantages to constraining expressiveness in the service of other objectives, such as being able type-annotate tabular data via schemas in a standardized way. With dataframe checks, all the internal logic of the check is lost within the
Yes, this was admittedly an ad-hoc decision on my part in the earlier days of the library... the dictionary indexed by
I think this is up to the user's preference of how they want to use the library/Python... I think the current API supports using both defined functions and in-line functions, and with a bit of refactoring I think we can make Continuing the DiscussionI think I do think adding options to the Let me know what your thoughts are! |
Both of you made very good points. I agree I like the idea of specialized Check classes. That would be much clearer than adding a growing number of options to Regarding the error message, I suggest one tweak to the idea of a callable error. We can allow the
It's true that pandera already reports
Can you give an example?
Yes, see below: import pandera as pa
import pandera.extensions as extensions
import pandas as pd
@extensions.register_check_method(statistics=["min_value", "max_value"])
def is_between(pandas_obj, *, min_value, max_value):
return (min_value <= pandas_obj) & (pandas_obj <= max_value)
schema = pa.DataFrameSchema(
{
"col": pa.Column(
int, pa.Check.is_between(min_value=1, max_value=10, error="Oops")
)
}
)
data = pd.DataFrame({"col": [1, 5, 10, 11]})
print(schema(data))
#> Traceback (most recent call last):
#> ...
#> SchemaError: <Schema Column(name=col, type=<class 'int'>)> failed element-wise validator 0:
#> <Check is_between: Oops>
#> failure cases:
#> index failure_case
#> 0 3 11 Created on 2021-03-19 by the reprexpy package |
One thing to point out here is that going this direction would also imply implementing edit: we could also decide to preserve the I think in the short term my preference would be to: (i) refactor
I can't find it now, but there was an issue someone opened with a question of how to apply a check given a condition on another column: data = pd.DataFrame({
"col1": [-1, -2, -3, 1, 2, 3],
"col2": ["a", "a", "a", "b", "b", "b"],
})
# if col2 == "a", col1 is negative
# if col2 == "b", col1 is positive Currently the two ways to implement this is with a wide check or a groupby check on col1: pa.DataFrameSchema({
"col1": pa.Column(
checks=[
pa.Check(lambda groups: groups["a"] < 0, groupby="col2"),
pa.Check(lambda groups: groups["b"] > 0, groupby="col2"),
]
),
})
# or
pa.DataFrameSchema(
checks=[
pa.Check(lambda df: df.query("col2 == 'a'")["col1"] < 0),
pa.Check(lambda df: df.query("col2 == 'b'")["col1"] > 0),
]
) These options might be good enough actually... but I did want to figure out a more intuitive way of doing it, like: pa.DataFrameSchema({
"col1": pa.Column(
checks=pa.ConditionalCheck(lambda s: s < 0, where="col2", eq="a").otherwise(lambda s: s > 0),
),
}) Not really sure about the |
Action Items
CheckObj = Union[pd.Series, pd.DataFrame, pd.Groupby] # series or dataframe can be the output of `agg` operation
def error_callback(check_obj: CheckObj) -> str:
return f"check failed with values {check_obj}" If the additional I can give a shot at implementing this! |
Closing this issue, tracking changes discussed here in #488 |
Hello,
I'm trying to create a custom check such as
Check(lambda g: g.mean() >0.1, error="mean over 0.1")
Is there a way adding more info to the error message such as
f"mean over 0.1 and the current mean is {g.mean()}"
?Thanks 🙏
The text was updated successfully, but these errors were encountered: