-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SeriesModel -- support for defining an index on a series. #688
Comments
Oh, forgot to mention -- both of the proposed ideas are already valid code, but they don't validate how I'd like them to. |
hi @zevisert thanks for proposing this, it's a great idea! I think the syntax of idea # 1 will be more useful and expressive, since you can also provide class DatetimeAmountSeries(pa.SchemaModel):
index: P.Index[P.DateTime]
__root__: P.Series[P.Int32] After reading the pydantic docs on custom root types, one question I have about the A slight alternative to consider is that pandera should implicitly understand that a class DatetimeAmountSeries(pa.SchemaModel):
name: P.Series[P.Int32] = pa.Field(check_name=True) # validate the Series name. If False, don't check it.
index: P.Index[P.DateTime] We could introduce a On the other hand, it would be convenient to be able to reuse SchemaModels for both datastructures. Do you have any thoughts @jeffzi ? |
Idea 2 is very verbose and not intuitive. You need to remember the order of arguments since the
I think the use case is json(schema) output. Consider: from typing import List
from pydantic import BaseModel
class Pets(BaseModel):
species: List[str]
print(Pets(species=["dog", "cat"]).json())
#> {"species": ["dog", "cat"]}
class Pets(BaseModel):
__root__: List[str]
print(Pets(__root__=["dog", "cat"]).json())
#> ["dog", "cat"] The semantics do match up. I guess the default
Users will be required to name the unique "column" of the series even if they don't care about it. On the other hand it circumvents the above problems and makes the model API consistent. We could introduce a class DatetimeAmountSeries(pa.SeriesModel):
name: P.Series[P.Int32] = pa.Field(check_name=True, ge=0)
index: P.Index[P.DateTime]
class Schema(pa.SchemaModel):
dttm_new_syntax: P.Series[DatetimeAmountSeries] # ignore index validation
ddtm: P.Series[P.Int32] = pa.Field(ge=0) # equivalent to dttm_new_syntax The above syntax makes Series validation more re-usable. At the moment, you can re-use a pre-defined |
I'm down for introducing a new class DatetimeAmount(pa.ArrayModel):
name: P.Int32 = pa.Field(check_name=True, ge=0) # no need to specify `Series` type
index: P.Index[P.DateTime] # optional index, only for pandas.Series
class Schema(pa.SchemaModel):
dttm_new_syntax: P.Series[DatetimeAmount] # ignore index validation
ddtm: P.Series[P.Int32] = pa.Field(ge=0) # equivalent to dttm_new_syntax Then the array model can be used as a Series like so: from pandera.typing.pandas import Series
def function(series: Series[DatetimeAmount]): ...
# and eventually
from pandera.typing.numpy import Array
from pandera.typing.pytorch import Tensor
def function(np_array: Array[DatetimeAmount]): ...
def function(torch_tensor: Tensor[DatetimeAmount]): ... I think this strikes a nice balance of being specific enough to the pandas domain while being able to model all sorts of array-like data structures like numpy arrays, pytorch tensors, xarray.DataArray, and pandas.Series. Basically the pattern I want to explore here is to:
This might be a little ambitious, i.e. pre-mature abstraction, but I do want to see how far we can take the whole idea of "defining a schema once, use it to validate a bunch of different data container types". |
I agree with what you laid out. I don't think it's premature. Pandera has started opening up to new data containers. I'd rather explore the One nitpick though. I agree it's nice not having to specify |
Ditto on that @jeffzi! I think it's good timing to explore how we want to model your two bullet points @cosmicBboy. Sure a lower level I think the nitpick is reasonable. Pandas at least lets you get away with a default |
Cool! This sounds good to me... it's also nice because it doesn't shoe-horn @zevisert any interest in contributing a PR for this? @jeffzi is the expert when it comes to the |
I'm not that versed with the SchemaModel classes, but taking a further step back, would it make sense to have a more granular level, i.e. a "value check" over primitive data types? Rationale:
Example use case context: right now with Hamilton people can return primitive types and they can't use pandera to express the check on them. E.g. the function returns the mean of some series. Just spit balling here, but this is what I believe I'm suggesting: class SpendAmount(pa.ValueModel):
value: P.Int32 = pa.Field(ge=0, nan=False, le=1000) # can only have `value` field?
class DatetimeAmount(pa.ArrayModel):
name: SpendAmount = pa.Field(mean=dict(ge=20, le=30)) # making this aggregation syntax check up
index: P.Index[P.DateTime] # optional index, only for pandas.Series
class MyDataFrameSchema(pa.SchemaModel):
... alternatively if this doesn't fit here, then maybe a |
Hi. We have My suggestion.
Now
So, this matches a bit what @cosmicBboy said, e.g.
however I don't entirely agree on
Since everyone that works in pandas knows what a DataFrame and Series are, and those names should be reused. To conclude, the advantge of this proposal is that the changes needed to implement are minimal and already use much of the existing classes. |
Is your feature request related to a problem? Please describe.
We have
SchemaModels
, and we have inline types likeP.Series[float]
, but we don't have a way to specify the kind of index that a series has. Consider this example function:Describe the solution you'd like
I'd like to be able to specify the index on a series, for places in my codebase that pass series with specific index types between functions.
Describe alternatives you've considered
I've thought of two solutions would be acceptable:
Idea 1:
A schema model that borrows it's the idea of
__root__
from pydantic:Idea 2:
More annotated type options for
P.Series
:Additional context
The text was updated successfully, but these errors were encountered: