Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeriesModel -- support for defining an index on a series. #688

Open
zevisert opened this issue Nov 24, 2021 · 9 comments
Open

SeriesModel -- support for defining an index on a series. #688

zevisert opened this issue Nov 24, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@zevisert
Copy link
Contributor

zevisert commented Nov 24, 2021

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is.

We have SchemaModels, and we have inline types like P.Series[float], but we don't have a way to specify the kind of index that a series has. Consider this example function:

import pandas as pd
import pandera.typing as P

def is_positive_datetime_series(x: P.Series[P.Int32]) -> P.Series[bool]:
    if not isinstance(x.index, pd.DatetimeIndex): 
        raise NotImplemented
    
    return x > 0 

Describe the solution you'd like

A clear and concise description of what you want to happen.

I'd like to be able to specify the index on a series, for places in my codebase that pass series with specific index types between functions.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

I've thought of two solutions would be acceptable:

Idea 1:

A schema model that borrows it's the idea of __root__ from pydantic:

import pandas as pd
import pandera as pa
import pandera.typing as P

class DatetimeAmountSeries(pa.SchemaModel):
    index: P.Index[P.DateTime]
    __root__: P.Series[P.Int32]

Idea 2:

More annotated type options for P.Series:

import pandera.typing as P
from typing import TypeAlias, Annotated

DatetimeAmountSeries: TypeAlias = Annotated[P.Series[P.Int32], P.Index[P.DateTime]]

Additional context

Add any other context or screenshots about the feature request here.

@zevisert zevisert added the enhancement New feature or request label Nov 24, 2021
@zevisert
Copy link
Contributor Author

Oh, forgot to mention -- both of the proposed ideas are already valid code, but they don't validate how I'd like them to.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Dec 5, 2021

hi @zevisert thanks for proposing this, it's a great idea!

I think the syntax of idea # 1 will be more useful and expressive, since you can also provide pa.Field metadata and custom checks via the @pa.check method.

class DatetimeAmountSeries(pa.SchemaModel):
    index: P.Index[P.DateTime]
    __root__: P.Series[P.Int32]

After reading the pydantic docs on custom root types, one question I have about the __root__ keyword in pydantic is what specific use case it addresses? Basically want to make sure the semantics in pydantic and pandera match up.

A slight alternative to consider is that pandera should implicitly understand that a SchemaModel with a single Series attribute can validate series objects.

class DatetimeAmountSeries(pa.SchemaModel):
    name: P.Series[P.Int32] = pa.Field(check_name=True)  # validate the Series name. If False, don't check it.
    index: P.Index[P.DateTime]

We could introduce a pa.FieldModel to separate concerns and make single-field validation more explicit... I'm concerned that conflating the purpose of SchemaModel to validate both dataframes and series might lead to confusion.

On the other hand, it would be convenient to be able to reuse SchemaModels for both datastructures.

Do you have any thoughts @jeffzi ?

@jeffzi
Copy link
Collaborator

jeffzi commented Dec 7, 2021

Idea 2 is very verbose and not intuitive. You need to remember the order of arguments since the Annotated mechanism does not allow naming arguments.

one question I have about the root keyword in pydantic is what specific use case it addresses?

I think the use case is json(schema) output. Consider:

from typing import List
from pydantic import BaseModel


class Pets(BaseModel):
    species: List[str]


print(Pets(species=["dog", "cat"]).json())
#> {"species": ["dog", "cat"]}


class Pets(BaseModel):
    __root__: List[str]


print(Pets(__root__=["dog", "cat"]).json())
#> ["dog", "cat"]

The semantics do match up. __root__ indicates the type of the modeled pandas object.

I guess the default __root__ for regular SchemaModels should be __root__=DataFrame so that you can inherit a Series model and transform it to a dataframe model. __root__=DataFrame[Schema] seems dangerous though. Suppose your model inherits a base model A and you specify another model B in root: __root__=DataFrame[B].

A slight alternative to consider is that pandera should implicitly understand that a SchemaModel with a single Series attribute can validate series objects.

Users will be required to name the unique "column" of the series even if they don't care about it. On the other hand it circumvents the above problems and makes the model API consistent. We could introduce a SeriesModel to solve the dilemma:

class DatetimeAmountSeries(pa.SeriesModel):
    name: P.Series[P.Int32] = pa.Field(check_name=True, ge=0) 
    index: P.Index[P.DateTime]

class Schema(pa.SchemaModel):
    dttm_new_syntax: P.Series[DatetimeAmountSeries] # ignore index validation
    ddtm: P.Series[P.Int32] = pa.Field(ge=0) # equivalent to dttm_new_syntax

The above syntax makes Series validation more re-usable. At the moment, you can re-use a pre-defined Field but you will still have to specify the dtype in the annotation. You could say it's similar to pydantic custom types. That would also give us a better way to introduce non-native "dtypes": email, paths, etc.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Dec 31, 2021

I'm down for introducing a new Model base class that handles this case, although I'd like to propose a slightly different naming to be ArrayModel.

class DatetimeAmount(pa.ArrayModel):
    name: P.Int32 = pa.Field(check_name=True, ge=0)   # no need to specify `Series` type
    index: P.Index[P.DateTime]  # optional index, only for pandas.Series

class Schema(pa.SchemaModel):
    dttm_new_syntax: P.Series[DatetimeAmount] # ignore index validation
    ddtm: P.Series[P.Int32] = pa.Field(ge=0) # equivalent to dttm_new_syntax

Then the array model can be used as a Series like so:

from pandera.typing.pandas import Series

def function(series: Series[DatetimeAmount]): ...

# and eventually
from pandera.typing.numpy import Array
from pandera.typing.pytorch import Tensor

def function(np_array: Array[DatetimeAmount]): ...

def function(torch_tensor: Tensor[DatetimeAmount]): ...

I think this strikes a nice balance of being specific enough to the pandas domain while being able to model all sorts of array-like data structures like numpy arrays, pytorch tensors, xarray.DataArray, and pandas.Series.

Basically the pattern I want to explore here is to:

  • have ArrayModel encapsulate properties about a semantic (potentially n-dimensional) array
  • have SchemaModel encapsulate properties of a dict-like mapping of keys to n-dimensional arrays

pandas.DataFrame and xarray.Dataset are basically a mapping of keys to "alignable" arrays according to some type of coordinate system (pandas.Index, or coords in array).

This might be a little ambitious, i.e. pre-mature abstraction, but I do want to see how far we can take the whole idea of "defining a schema once, use it to validate a bunch of different data container types".

thoughts @zevisert @jeffzi ?

@jeffzi
Copy link
Collaborator

jeffzi commented Jan 3, 2022

I agree with what you laid out. I don't think it's premature. Pandera has started opening up to new data containers. I'd rather explore the ArrayModel idea before consolidating support for non-pandas libraries.

One nitpick though. I agree it's nice not having to specify Series typing for ArrayModel but I think we shouldn't have to specify index typing either for consistency. I suggest an argument pa.Field(index:bool) that would only apply to "arrays" supporting an index.

@zevisert
Copy link
Contributor Author

zevisert commented Jan 3, 2022

Ditto on that @jeffzi! I think it's good timing to explore how we want to model your two bullet points @cosmicBboy.

Sure a lower level ArrayModel (maybe pa.Matrix??) makes sense to me.

I think the nitpick is reasonable. Pandas at least lets you get away with a default RangeIndex if not specified. Come to think of it -- given that pd.Series() with no arguments produces the warning The default dtype for empty Series will be 'object' instead of 'float64' in a future version., perhaps P.Series[P.ArrayModel] could be an allowable, albeit not that useful of a way to express a series with no dtye or index, sort of like class Lax(pandera.SchemaModel): pass does

@cosmicBboy
Copy link
Collaborator

I suggest an argument pa.Field(index:bool) that would only apply to "arrays" supporting an index.

Cool! This sounds good to me... it's also nice because it doesn't shoe-horn ArrayModel to use pandera.typing.pandas.Index as the index annotation.

@zevisert any interest in contributing a PR for this? @jeffzi is the expert when it comes to the SchemaModel stuff, but I can also help out with guidance if needed.

@skrawcz
Copy link

skrawcz commented Jul 17, 2022

I'm not that versed with the SchemaModel classes, but taking a further step back, would it make sense to have a more granular level, i.e. a "value check" over primitive data types?

Rationale:

  • checks can be composed into map operations, or aggregation operations.
  • map operations check single values at a time (e.g. is > 10)
  • aggregation operations use multiple values from a list/tensor/dataframe (e.g. sum() > 10)
    so why not start at the most granular thing and build from there?

Example use case context: right now with Hamilton people can return primitive types and they can't use pandera to express the check on them. E.g. the function returns the mean of some series.

Just spit balling here, but this is what I believe I'm suggesting:

class SpendAmount(pa.ValueModel):
   value: P.Int32 = pa.Field(ge=0, nan=False, le=1000) # can only have `value` field?

class DatetimeAmount(pa.ArrayModel):
   name: SpendAmount = pa.Field(mean=dict(ge=20, le=30)) # making this aggregation syntax check up
   index: P.Index[P.DateTime]  # optional index, only for pandas.Series

class MyDataFrameSchema(pa.SchemaModel):
  ...

alternatively if this doesn't fit here, then maybe a ValueSchema class analogous to DataFrameSchema and SeriesSchema?

@joaoe
Copy link

joaoe commented Sep 1, 2023

Hi.
Gladly I found this, since I need a SeriesModel.

We have DataFrameModel plus DataFrameSchema. But we have only SeriesSchema and no SeriesModel.

My suggestion.

  1. The now deprecated SchemaModel is undeprecated and contains all current code, common between DataFrameModel and SeriesModel
  2. The new DataFrameModel is just
    class DataFrameModel(SchemaModel):
        pass
  3. The new SeriesModel follows the same principle
    class SeriesModel(SchemaModel):
         # differences in behavior

Now SeriesModel does some things different from DataFrameModel:

  1. Forces one column and one column exactly for the Model
  2. The column can be overriden with inheritance, to specify new metadata, but the name must be the same.
  3. The default Field(check_name=None) for the single column in a Series is assumed as False during validation since often we're not too concerned about a series name.
  4. Following on 3, check_name=... can be set on the pa.Field() or as a SeriesModel Config value e.g. column_check_name=False|True.
  5. SeriesModel.to_schema() obviously returns a SeriesSchema object.

So, this matches a bit what @cosmicBboy said, e.g.

A slight alternative to consider is that pandera should implicitly understand that a SchemaModel with a single Series attribute can validate series objects.

however I don't entirely agree on

I'm down for introducing a new Model base class that handles this case, although I'd like to propose a slightly different naming to be ArrayModel.

Since everyone that works in pandas knows what a DataFrame and Series are, and those names should be reused.

To conclude, the advantge of this proposal is that the changes needed to implement are minimal and already use much of the existing classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants