Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas_dtype_strategy supports Category dtype #320

Open
cosmicBboy opened this issue Nov 11, 2020 · 2 comments
Open

pandas_dtype_strategy supports Category dtype #320

cosmicBboy opened this issue Nov 11, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@cosmicBboy
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Currently, the pandas_dtype_strategies function in #314 doesn't handle categorical data types. To be feature-complete, we'd want to support this, with the caveat that pandera doesn't currently support PandasDtype enums with additional metadata, such as the CategoryDtype with categories and ordered information.

Describe the solution you'd like

When constructing a field_element_strategy, programmatically fetch any Check.isin checks and get the allowed_values fields from those checks. Try to infer the pandas datatype of the underlying categorical value:

categories = []
category_pandas_dtype = None
if pandas_dtype.is_category:
   for check in checks:
       if check is Check.isin:
           # get categories and infer pandas dtype
           # and remove isin checks from the checks list
    elements = pandas_dtype_strategy(category_pandas_dtype, categories=categories)

...

And then in pandas_dtype_strategy:

if pandas_dtype.is_category:
    return isin_strategy(
            pandas_dtype.String,
            strategy,
            allowed_values=kwargs.get("categories")
    )

Since series/index/dataframe strategies cast the generated data to the correct data type, this workaround should work for now.

@cosmicBboy cosmicBboy added the enhancement New feature or request label Nov 11, 2020
@cosmicBboy cosmicBboy added this to the 0.7.0 Release milestone Jan 12, 2021
@cosmicBboy
Copy link
Collaborator Author

This should be done after #369

@cosmicBboy cosmicBboy self-assigned this Jan 24, 2021
@zevisert
Copy link
Contributor

zevisert commented Nov 22, 2021

#369 is done, and with Annotated types, we can embed the categories into the type!

import pandera as pa
import pandera.typing as P
from typing import Annotated

class MySchema(pa.SchemaModel):
    some_category: P.Series[Annotated[P.Category, ['foo', 'bar', 'baz', 'quux'], False]]

It seems like pandera could generate strategies using these annotations, and error as it does now if there's no categories annotation. It does seem like we'd have to contribute to hypothesis first perhaps, since their pandas series strategy implementation seems to require being able to create numpy arrays with specific dtype, rather than pandas extension arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants