Validating maximal values of fields per condition (DF name) using SchemaModel #747

il-ia-ni · 2022-01-24T11:10:35Z

il-ia-ni
Jan 24, 2022

Hello :)
I am new to pandera module and am creating my first very simple validation schema right now using the class-based API.

My task for now is to perform a validation of the maximal values of the fields, that differ according to an equipment id in the DataFrames my to-be-validated function returns.

Looking for the best implementation of such a condition I have introduced a custom wide-check with @pa.dataframe_check in the schema. I have opened a question to my concerns at Stackoverflow here.

I would be very grateful to receive your feedback and suggestion on using either wide- or tidy checks of pandera! It is critical for the project that we stick up to the SchemaModel and @pa.check_type though. The object-based API with DataFrameSchema instances is not really an option...

cosmicBboy · 2022-01-24T15:22:08Z

cosmicBboy
Jan 24, 2022
Maintainer

Hi @ilianikolaenko92,

Similar use cases have come up in the past, I think there's a more concise way of defining the schema.

First, we can load the json file specifying the max values into a dataframe. We'll see why later:

max_bounds_json = {
    "machine_ids": ["1", "A"],
    "signal_max_bounds": {
        "1": {
            "signal1": 11000,
            "signal2": 550,
            "signal3": 17,
            "signal4": 3000
        },
        "A": {
            "signal1": 15000,
            "signal2": 700,
            "signal3": 20,
            "signal4": 6000
        },
    }
}

max_bounds = pd.DataFrame.from_dict(max_bounds_json["signal_max_bounds"], orient="index")
print(max_bounds)
#             signal1  signal2  signal3  signal4
# machine_id
# 1             11000      550       17     3000
# A             15000      700       20     6000

Then, we can define a SchemaModel as follows:

class Schema(pa.SchemaModel):
    timestamp: Series[DateTime] = pa.Field(unique=True, nullable=False)
    signal_field: Series[float] = pa.Field(
        ge=0, nullable=False, coerce=True, alias=r"signal\d+", regex=True,
    )

    # assume that machine names are the index of the dataframe
    index: Index[str] = pa.Field(isin=max_bounds_json["machine_ids"])

    @pa.check(r"signal\d+")
    def check_max_values(cls, series: Series) -> Series[bool]:
        max_values = series.index.to_series().map(max_bounds[series.name])
        return series <= max_values

There are a few things happening in this schema:

we're using field aliases along with regex pattern matching to apply a Field definition to any column matching the pattern signal\d+. Note that the class attribute name signal_field is simply a key that no longer refers to a dataframe column.
assuming that the machine_id identifier is in the index of the dataframe, we can use a single column-level check check_max_values instead of dataframe-level checks. The benefit of this is check isolation and more targeted error reporting.
the check_max_values method applies to the alias field, where we use the max_bounds dataframe to get the maximum values per machine_id (which is the index of series)

I didn't quite understand point (3) in your SO post: there's no need to insert extra data into the output of a pandera check... it can by any one of a bool, a Series[bool], or a DataFrame[bool]. For the latter two, pandera checks whether the Series/Dataframe is index-aligned with the original validated data, in which case it can report specific failure cases. If it's not index-aligned, it'll just report that an error has occurred at a particular check.

On point (4) re: more complicated validation, we can chat in another discussion if you'd like, but in general I'd recommend trying to factor the schema to be as targeted as possible. For example, here's a mean value example:

Suppose we have another json file describing min/max range of the mean per machine id

mean_range = pd.DataFrame.from_dict(
    {
        "1": {"min": 0, "max": 100},
        "A": {"min": 0, "max": 1000},
    },
    orient="index"
)
print(mean_range)
#    min   max
# 1    0   100
# A    0  1000

Then the schema would look something like:

class Schema(pa.SchemaModel):
    ...

    @pa.check(r"signal\d+")
    def check_mean_values(cls, series: Series) -> bool:
        """Check that mean values are between some range per machine_id."""
        mean_per_machine = pd.concat(
            [series.groupby(level="machine_id").mean(), mean_range],
            axis="columns",
        )
        return mean_per_machine["signal1"].between(
            mean_per_machine["min"], mean_per_machine["max"]
        )

0 replies

il-ia-ni · 2022-01-28T12:05:51Z

il-ia-ni
Jan 28, 2022
Author

Hi @cosmicBboy
Thank you very much for your time and quick respond! The idea with RegEx-Check is really cool, I definitety am going to do it this way in my schema!

Unfortunately, I have not clearly explained the situation with my DataFrames, that's why I cannot really reflect your offered solution to my situation:
The length of each created DataFrame is not equal to the length of the dictionary of maximal bound per machine. Each created DataFrame can be 1 line wide, can be 17 000 lines wide, but they always consist of the signals data of one single machine (either 1 or A or etc.).

That's why I couldn't come up with a better solution so far other that using wide-checks to address one specific value of each signals' max bound in the dictionary of 4 machines, using an index name of a created DataFrame.

0 replies

cosmicBboy · 2022-01-28T15:35:55Z

cosmicBboy
Jan 28, 2022
Maintainer

The length of each created DataFrame is not equal to the length of the dictionary of maximal bound per machine.

Hi @ilianikolaenko92, I don't quite follow this. Can you provide a copy-pasteable example of (i) the dataframe you're trying to validate and (ii) the source metadata for signal ranges per machine?

1 reply

il-ia-ni Jan 31, 2022
Author

Hi @cosmicBboy !

Sorry for all the misunderstanding! Here is a simple example of the DataFrame. The machine_id is right now assigned as df.columns.name, because df.index.name is taken for timestamp (this is critical for the logic of the project).

So the task would be to make a Check not only able to access a corresponding column of the max_bounds DF from your answer, but also a correspondig row based for a specific machine_id only (aka intersection: signal1 and A => the Check is validating all Series' values against 15000).

DF example:

df = pd.DataFrame({
        "timestamp": [datetime.strptime("20/01/2022 10:00:00", "%d/%m/%Y %H:%M:%S"),
                      datetime.strptime("20/01/2022 10:02:00", "%d/%m/%Y %H:%M:%S"),
                      datetime.strptime("20/01/2022 10:04:00", "%d/%m/%Y %H:%M:%S"),
                      datetime.strptime("20/01/2022 10:06:00", "%d/%m/%Y %H:%M:%S"),
                      datetime.strptime("20/01/2022 10:08:00", "%d/%m/%Y %H:%M:%S")],
        "signal1": [20, 30, 40, 30, 50],
        "signal2": [200, 300, 400, 300, 500],
        "signal3": [2, 3, 4, 3, 5],
        "signal4": [2000, 3000, 4000, 3000, 5000]
    })

df.rename_axis("A", axis="columns", inplace=True)
df.set_index("timestamp", inplace=True, append=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validating maximal values of fields per condition (DF name) using SchemaModel #747

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Validating maximal values of fields per condition (DF name) using SchemaModel #747

il-ia-ni Jan 24, 2022

Replies: 3 comments · 1 reply

cosmicBboy Jan 24, 2022 Maintainer

il-ia-ni Jan 28, 2022 Author

cosmicBboy Jan 28, 2022 Maintainer

il-ia-ni Jan 31, 2022 Author

il-ia-ni
Jan 24, 2022

Replies: 3 comments 1 reply

cosmicBboy
Jan 24, 2022
Maintainer

il-ia-ni
Jan 28, 2022
Author

cosmicBboy
Jan 28, 2022
Maintainer

il-ia-ni Jan 31, 2022
Author