Pandera dataframe in Pydantic model .dict() and .json() compatability #966

derinwalters · 2022-10-16T02:13:02Z

In reading through the Pandera documentation, it's not clear to me how to intermingle Pandera dataframes within a Pydantic model and still be able to use .dict() and .json() methods successfully. I followed the steps on https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pandera-schemas-in-pydantic-models and love how seamless it is. However, the .dict() method keeps the Pandera type and .json() fails altogether. The solution provided by Pandera's to_format is close, but I want to keep the validated dataframe intact while I perform operations then convert format later (not right away). Is there a way to do this?

from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())

test.py:26 <module>
    myinst: PydanticModel(
        x=1,
        df=<DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    ) (PydanticModel)
test.py:27 <module>
    myinst.dict(): {
        'x': 1,
        'df': <DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    } (dict) len=2
Traceback (most recent call last):
  File "/Users/derinw/x-bitbucket/juso/tests/test.py", line 28, in <module>
    debug(myinst.json())
  File "pydantic/main.py", line 505, in pydantic.main.BaseModel.json
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "pydantic/json.py", line 90, in pydantic.json.pydantic_encoder
TypeError: Object of type 'DataFrame' is not JSON serializable

cosmicBboy · 2022-10-18T17:07:27Z

hi @derinwalters this is currently unexplored territory, would appreciate clarification on the use cases here.

For the .dict() method, is the expectation that the df key is turned into a list of records? or some other format?

I suspect once dict works the json() method should as well.

Are you familiar how to create custom pydantic types? How does one extend a type within a BaseModel can be converted to a json-serializable dict?

cosmicBboy · 2022-10-18T18:20:43Z

so looking at pydantic docs, this will work:

from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

    class Config:
        json_encoders = {
            pd.DataFrame: lambda x: x.to_dict(orient="records")
        }

valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())

Output:

foo.py:21 <module>
    myinst: PydanticModel(
        x=1,
        df=<DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    ) (PydanticModel)
foo.py:22 <module>
    myinst.dict(): {
        'x': 1,
        'df': <DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    } (dict) len=2
foo.py:23 <module>
    myinst.json(): '{"x": 1, "df": [{"str_col": "hello"}, {"str_col": "world"}]}' (str) len=60

The .dict() method is not really customizable, the the json_encoders configuration lets your serialize your validated data to json by letting it know how to handle certain, potentially unknown types.

derinwalters · 2022-10-21T10:14:24Z

@cosmicBboy thank you so much for your suggestion. Leveraging the Config json_encoders seems like just the thing. I will give this a try and report back.

The use case is a hierarchical data class that I store in MongoDB and process locally. Recently I transitioned from a monolithic Pandas dataframe to lists of Pydantic class dictionaries where I convert to Pandas for manipulation. However, this incurs extra to-from conversion cost that never really seemed ideal. I don't remember exactly how, but last week I stumbled across Pandera and thought to myself "this is exactly what I was looking for!" and so here I am kicking the tires.

cosmicBboy · 2022-10-21T13:56:32Z

However, this incurs extra to-from conversion cost that never really seemed ideal

Yep! this is pretty much the reason I built pandera, though at the time I wasn't aware of pydantic and was doing the same thing with the schema library.

derinwalters · 2022-10-22T23:43:46Z

I think the proposed solution works well enough for what I was asking. Thanks! I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements, which is rather straightforward in a pydantic by row approach, and will continue working on that. Looks like you're also already working on providing a default value option on #502, which is great.

cosmicBboy · 2022-10-24T14:20:51Z

Great!

I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements

There's this issue #260, but for now I'd recommend custom checks

class SimpleSchema(pa.SchemaModel):
    list_col: Series[object]
    dict_col: Series[object]

    @pa.check("list_col")
    def check_list(cls, series):
        return series.map(lambda x: isinstance(x, list)) # check any other property about this column

    @pa.check("dict_col")
    def check_list(cls, series):
        return series.map(lambda x: isinstance(x, dict)) # check any other property about this column

derinwalters added the question Further information is requested label Oct 16, 2022

derinwalters closed this as completed Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandera dataframe in Pydantic model .dict() and .json() compatability #966

Pandera dataframe in Pydantic model .dict() and .json() compatability #966

derinwalters commented Oct 16, 2022

cosmicBboy commented Oct 18, 2022

cosmicBboy commented Oct 18, 2022 •

edited

Loading

derinwalters commented Oct 21, 2022

cosmicBboy commented Oct 21, 2022

derinwalters commented Oct 22, 2022

cosmicBboy commented Oct 24, 2022

Pandera dataframe in Pydantic model .dict() and .json() compatability #966

Pandera dataframe in Pydantic model .dict() and .json() compatability #966

Comments

derinwalters commented Oct 16, 2022

cosmicBboy commented Oct 18, 2022

cosmicBboy commented Oct 18, 2022 • edited Loading

derinwalters commented Oct 21, 2022

cosmicBboy commented Oct 21, 2022

derinwalters commented Oct 22, 2022

cosmicBboy commented Oct 24, 2022

cosmicBboy commented Oct 18, 2022 •

edited

Loading