Replies: 3 comments 1 reply
-
Hi @ilianikolaenko92, Similar use cases have come up in the past, I think there's a more concise way of defining the schema. First, we can load the json file specifying the max values into a dataframe. We'll see why later: max_bounds_json = {
"machine_ids": ["1", "A"],
"signal_max_bounds": {
"1": {
"signal1": 11000,
"signal2": 550,
"signal3": 17,
"signal4": 3000
},
"A": {
"signal1": 15000,
"signal2": 700,
"signal3": 20,
"signal4": 6000
},
}
}
max_bounds = pd.DataFrame.from_dict(max_bounds_json["signal_max_bounds"], orient="index")
print(max_bounds)
# signal1 signal2 signal3 signal4
# machine_id
# 1 11000 550 17 3000
# A 15000 700 20 6000 Then, we can define a SchemaModel as follows: class Schema(pa.SchemaModel):
timestamp: Series[DateTime] = pa.Field(unique=True, nullable=False)
signal_field: Series[float] = pa.Field(
ge=0, nullable=False, coerce=True, alias=r"signal\d+", regex=True,
)
# assume that machine names are the index of the dataframe
index: Index[str] = pa.Field(isin=max_bounds_json["machine_ids"])
@pa.check(r"signal\d+")
def check_max_values(cls, series: Series) -> Series[bool]:
max_values = series.index.to_series().map(max_bounds[series.name])
return series <= max_values There are a few things happening in this schema:
I didn't quite understand point (3) in your SO post: there's no need to insert extra data into the output of a pandera check... it can by any one of a On point (4) re: more complicated validation, we can chat in another discussion if you'd like, but in general I'd recommend trying to factor the schema to be as targeted as possible. For example, here's a mean value example: Suppose we have another json file describing min/max range of the mean per machine id mean_range = pd.DataFrame.from_dict(
{
"1": {"min": 0, "max": 100},
"A": {"min": 0, "max": 1000},
},
orient="index"
)
print(mean_range)
# min max
# 1 0 100
# A 0 1000 Then the schema would look something like: class Schema(pa.SchemaModel):
...
@pa.check(r"signal\d+")
def check_mean_values(cls, series: Series) -> bool:
"""Check that mean values are between some range per machine_id."""
mean_per_machine = pd.concat(
[series.groupby(level="machine_id").mean(), mean_range],
axis="columns",
)
return mean_per_machine["signal1"].between(
mean_per_machine["min"], mean_per_machine["max"]
) |
Beta Was this translation helpful? Give feedback.
-
Hi @cosmicBboy Unfortunately, I have not clearly explained the situation with my DataFrames, that's why I cannot really reflect your offered solution to my situation: That's why I couldn't come up with a better solution so far other that using wide-checks to address one specific value of each signals' max bound in the dictionary of 4 machines, using an index name of a created DataFrame. |
Beta Was this translation helpful? Give feedback.
-
Hi @ilianikolaenko92, I don't quite follow this. Can you provide a copy-pasteable example of (i) the dataframe you're trying to validate and (ii) the source metadata for signal ranges per machine? |
Beta Was this translation helpful? Give feedback.
-
Hello :)
I am new to pandera module and am creating my first very simple validation schema right now using the class-based API.
My task for now is to perform a validation of the maximal values of the fields, that differ according to an equipment id in the DataFrames my to-be-validated function returns.
Looking for the best implementation of such a condition I have introduced a custom wide-check with @pa.dataframe_check in the schema. I have opened a question to my concerns at Stackoverflow here.
I would be very grateful to receive your feedback and suggestion on using either wide- or tidy checks of pandera! It is critical for the project that we stick up to the SchemaModel and @pa.check_type though. The object-based API with DataFrameSchema instances is not really an option...
Beta Was this translation helpful? Give feedback.
All reactions