Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandera dataframe in Pydantic model .dict() and .json() compatability #966

Closed
derinwalters opened this issue Oct 16, 2022 · 6 comments
Closed
Labels
question Further information is requested

Comments

@derinwalters
Copy link
Contributor

In reading through the Pandera documentation, it's not clear to me how to intermingle Pandera dataframes within a Pydantic model and still be able to use .dict() and .json() methods successfully. I followed the steps on https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pandera-schemas-in-pydantic-models and love how seamless it is. However, the .dict() method keeps the Pandera type and .json() fails altogether. The solution provided by Pandera's to_format is close, but I want to keep the validated dataframe intact while I perform operations then convert format later (not right away). Is there a way to do this?

from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())
test.py:26 <module>
    myinst: PydanticModel(
        x=1,
        df=<DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    ) (PydanticModel)
test.py:27 <module>
    myinst.dict(): {
        'x': 1,
        'df': <DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    } (dict) len=2
Traceback (most recent call last):
  File "/Users/derinw/x-bitbucket/juso/tests/test.py", line 28, in <module>
    debug(myinst.json())
  File "pydantic/main.py", line 505, in pydantic.main.BaseModel.json
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "pydantic/json.py", line 90, in pydantic.json.pydantic_encoder
TypeError: Object of type 'DataFrame' is not JSON serializable
@derinwalters derinwalters added the question Further information is requested label Oct 16, 2022
@cosmicBboy
Copy link
Collaborator

hi @derinwalters this is currently unexplored territory, would appreciate clarification on the use cases here.

For the .dict() method, is the expectation that the df key is turned into a list of records? or some other format?

I suspect once dict works the json() method should as well.

Are you familiar how to create custom pydantic types? How does one extend a type within a BaseModel can be converted to a json-serializable dict?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 18, 2022

so looking at pydantic docs, this will work:

from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

    class Config:
        json_encoders = {
            pd.DataFrame: lambda x: x.to_dict(orient="records")
        }

valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())

Output:

foo.py:21 <module>
    myinst: PydanticModel(
        x=1,
        df=<DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    ) (PydanticModel)
foo.py:22 <module>
    myinst.dict(): {
        'x': 1,
        'df': <DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    } (dict) len=2
foo.py:23 <module>
    myinst.json(): '{"x": 1, "df": [{"str_col": "hello"}, {"str_col": "world"}]}' (str) len=60

The .dict() method is not really customizable, the the json_encoders configuration lets your serialize your validated data to json by letting it know how to handle certain, potentially unknown types.

@derinwalters
Copy link
Contributor Author

@cosmicBboy thank you so much for your suggestion. Leveraging the Config json_encoders seems like just the thing. I will give this a try and report back.

The use case is a hierarchical data class that I store in MongoDB and process locally. Recently I transitioned from a monolithic Pandas dataframe to lists of Pydantic class dictionaries where I convert to Pandas for manipulation. However, this incurs extra to-from conversion cost that never really seemed ideal. I don't remember exactly how, but last week I stumbled across Pandera and thought to myself "this is exactly what I was looking for!" and so here I am kicking the tires.

@cosmicBboy
Copy link
Collaborator

However, this incurs extra to-from conversion cost that never really seemed ideal

Yep! this is pretty much the reason I built pandera, though at the time I wasn't aware of pydantic and was doing the same thing with the schema library.

@derinwalters
Copy link
Contributor Author

I think the proposed solution works well enough for what I was asking. Thanks! I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements, which is rather straightforward in a pydantic by row approach, and will continue working on that. Looks like you're also already working on providing a default value option on #502, which is great.

@cosmicBboy
Copy link
Collaborator

Great!

I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements

There's this issue #260, but for now I'd recommend custom checks

class SimpleSchema(pa.SchemaModel):
    list_col: Series[object]
    dict_col: Series[object]

    @pa.check("list_col")
    def check_list(cls, series):
        return series.map(lambda x: isinstance(x, list)) # check any other property about this column

    @pa.check("dict_col")
    def check_list(cls, series):
        return series.map(lambda x: isinstance(x, dict)) # check any other property about this column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants