Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PandasDtype #490

Merged
merged 16 commits into from
May 24, 2021
Merged

Refactor PandasDtype #490

merged 16 commits into from
May 24, 2021

Conversation

jeffzi
Copy link
Collaborator

@jeffzi jeffzi commented May 11, 2021

This PR implements a dtype hierarchy that replaces PandasDtype, as discussed in #369.

Goals of the refactor

  • Decouple pandera and pandas dtypes: opening up to other dataframe-like data structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.
  • Built-in support for dynamic dtypes: e.g. categorical dtype implementations often have a ordered and categories arguments.
  • Allow end-user to customize dtype coercion. For example, pass a date format, coerce ["oui", "non"] to boolean.

Contract

From now on, I'm going to refer to numpy/pandas dtypes as native dtypes in contrast to pandera's DataType.

A DataType class implements the following contract:

  1. Subclass DataType
  2. DataType.coerce implements coercion logic that was previously located in DataFrameSchema.
  3. DataType.check implements "equivalence" check between datatypes. The reason for not using __eq__ is that some dtypes are equivalents even if they are not represented by the same underlying pandas/numpy dtype. For example, Pandas represents numpy.str_has anumpy.object_. Another example is to implement a Numberdatatype that is equivalent toFloatandInt`.
  4. Implement __hash__ and be immutable. Required for internal registry. In practice, datatypes are implemented via dataclasses which give us those properties for free.
  5. __str__ should return the native alias.
class DataType:

    def coerce(self, obj: Any):
        """Coerce object to the dtype."""
        raise NotImplementedError()

    def check(self, datatype: "DataType") -> bool:
        """Validate that `datatype` is equivalent."""
        if not isinstance(datatype, DataType):
            return False
        return self == datatype

    def __hash__(self) -> int:
        """Must be implemented by subclasses."""

    def __str__(self) -> str:
        pass

A parallel class hierarchy implements a bridge interface to pandas and numpy dtypes: PandasDataType, NumpyDataType and their sub-classes.

Pandera DataTypes should not be instantiated directly. Instead, they are generated by a factory method Engine.dtype() (e.g PandasEngine.dtype) which can take any acceptable native dtype representations: string alias, builtin python types, pandas extension dtypes, numpy dtypes, etc. Basically anything that pandera.Column already accepts. DataFrameSchema holds an engine and delegates dtype translations to it.

Two engines are implemented:

  • PandasEngine: full support for official pandas dtypes.
  • NumpyEngine: only dtypes supported by pandas.

Numpy dtypes are supported by a NumpyDataType hierarchy so that we can validate pure numpy arrays in the future if we choose to do so. NumpyEngine has not been tested.

Moreover, only "abstract" DataType hierarchy should be exposed to end-users. Pandas/Numpy specifics should be kept to internals. Users should instead provide native dtypes to public api: e.g pa.Column(pd.Int16Dtype) or pa.Column("Int16"),
not pa.Column(pandera.engines.pandas_engine.PandasInt16)! Exceptions are aliases such as pa.INT, pa.STRING, etc. provided for convenience and legacy reasons.
^ Feel free to disagree :)

Status of the implementation

New Pandera dtypes are kept in dtypes_.py at the moment in order to allow co-existence of refactored and non-refactored features. Pylint and Mypy will be very upset, waiting for feedbacks before sinking time into that fun part 🤡 Only tests related to refactored tests are expected to pass.

  • DataFrameSchema api
  • SchemaModel api
  • decorators
  • inference
  • schema_statistics
  • is_float(), is_continuous(), etc.
  • strategies
  • docstrings
  • documentation

Breaking changes

  • Dropped support for pandas 0.25
  • No attempt to test on py3.6
  • DataFrameSchema.pdtype renamed to dtype. Remove explicit reference to pandas.
  • DataFrameSchema.dtype renamed to dtypes and eturns a dict of DataTypes instead of string aliases..
    • Better reflect pandas api: DataFrame has a dtypes property that return a dict of dtypes.
    • Now that dtypes are more powerful, it seems more intuitive to return them.
    • That change is further motivated by the fact that some parameterized datatypes cannot be distinguished by alias alone. For example, str(pandas.CategoricalDtype(categories=["A", "B"])) == "category".

Ideas for future improvements

  • Implement "meta" dtypes: Nominal, Number
  • Implement Date and Decimal Datatype. Those are useful for compatibility with Parquet files via PyArrow.

Let's address questions and potential shortcomings before extending the refactor.

TColl and others added 7 commits May 8, 2021 11:26
* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <[email protected]>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <[email protected]>
Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 thanks @jeffzi! this looks great, gonna take me a few days to look at this. See initial comments/questions below



@immutable
class Int8(Int16):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the purpose of this inheritance chain? e.g. can we have all the Int* types inherit from Int?

Copy link
Collaborator Author

@jeffzi jeffzi May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the chain to soften the dtype check, i.e. allowing a subtype (inspired by [numpy.can_cast])(https://numpy.org/doc/stable/reference/generated/numpy.can_cast.html#numpy.can_cast). I played with a casting argument in DataType.check that could be False for strict check (current behavior) or True to allow safe downcasting. Then I realized almost anything can be casted to a string and gave up the idea for now.

In the same vein, Pandas and Numpy dtypes inherit the appropriate DataType so that we can implement a cross-engine is_numeric, etc. with a call to isinstance.

pandera/dtypes_.py Outdated Show resolved Hide resolved
pandera/schema_components.py Show resolved Hide resolved
pandera/schema_components.py Outdated Show resolved Hide resolved
pandera/__init__.py Outdated Show resolved Hide resolved
pandera/engines/pandas_engine.py Outdated Show resolved Hide resolved
pandera/typing.py Outdated Show resolved Hide resolved
@jeffzi
Copy link
Collaborator Author

jeffzi commented May 12, 2021

Following your comment, I dropped the prefixes Numpy/Pandas from dtypes. It looks much better 👍

gonna take me a few days to look at this.

No problem, there is a lot to take in. I'm excited to get your feedback, I'm sure we can refine the api further.

pandera/dtypes_.py Outdated Show resolved Hide resolved
@antonl
Copy link
Contributor

antonl commented May 14, 2021

It would be useful to update the ASV metrics before this is merged. The first implementation could come with some performance hit that would be good to quantify.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the io module needs to reflect changes to the pandas_dtype argument as well

pandera/schemas.py Outdated Show resolved Hide resolved
pandera/schemas.py Outdated Show resolved Hide resolved
@@ -230,9 +231,9 @@ def _set_column_handler(column, column_name):
}

@property
def dtype(self) -> Dict[str, str]:
def dtypes(self) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note here that we should keep track of the breaking changes in the documentation.

we also don't have an official change log, which we should really start 😅

Copy link
Collaborator Author

@jeffzi jeffzi May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've listed breaking changes in the top post. But yeah, we should definitely keep track of breaking changes. Maybe using conventions for commit messages would help? For example, conventional commits as some tooling to generate changelogs but even without tools it would be easier to spot changes.

Btw type annotations is wrong, now it returns a Dict[str, Datatype] as mentioned in the breaking changes.

Copy link
Collaborator

@cosmicBboy cosmicBboy May 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also update the documentation to reflect this:
https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#get-pandas-datatypes

I think this part of the docs pre-dated the coerce keyword, so the code example doesn't really make sense anymore

schema = pa.DataFrameSchema(
    columns={
      "column1": pa.Column(pa.Int),
      "column2": pa.Column(pa.Category),
      "column3": pa.Column(pa.Bool)
    },
    coerce=True,
)

df = pd.DataFrame.from_dict(
  {
      "a": {"column1": 1, "column2": "valueA", "column3": True},
      "b": {"column1": 1, "column2": "valueB", "column3": True},
  },
  orient="index"
).astype(schema.dtype).sort_index(axis=1)

# At this point you can simply do
schema(df)

I think the dtypes and get_dtypes attributes might still be useful, if not to just conveniently inspect the (pandera engine) dtypes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dtypes and get_dtypes attributes might still be useful, if not to just conveniently inspect the (pandera engine) dtypes

Totally !

@jeffzi
Copy link
Collaborator Author

jeffzi commented May 14, 2021

re: DataFrameSchema/SeriesSchema.dtype and DataFrameSchema.dtypes.

Now is a good time to introduce a breaking change and think about how those properties should behave. I think DataFrameSchema.dtypes (previously dtype) should be aligned with dtype and return a dtype, not a string alias. The question is: should they return dypes provided by the user or pandera DataTypes?

@cosmicBboy correct me if I'm wrong, but I guess the reasoning for returning a dict of strings from .dtype was to give users standardized dtypes. At least, that is my reasoning for returning DataTypes. With them users can replicate coerce and get standard aliases with DataType.__str__.

@jeffzi jeffzi closed this May 14, 2021
@jeffzi jeffzi reopened this May 14, 2021
@cosmicBboy
Copy link
Collaborator

@cosmicBboy correct me if I'm wrong, but I guess the reasoning for returning a dict of strings from .dtype was to give users standardized dtypes. At least, that is my reasoning for returning DataTypes. With them users can replicate coerce and get standard aliases with DataType.str.

I think I answered your question in the other thread.

I think DataFrameSchema.dtypes (previously dtype) should be aligned with dtype and return a dtype, not a string alias. The question is: should they return dypes provided by the user or pandera DataTypes?

I think these should be pandera DataTypes so that the interface is standardized and predictable. The problem I had with the old way was that it could be any one of str, pandas extension type, numpy type, etc. The type variability would then cascade to downstream parts of the validation logic, which is a pain to handle.

@jeffzi
Copy link
Collaborator Author

jeffzi commented May 18, 2021

My bad, I misread your comment. We are in agreement about returning Pandera DataTypes :)

I'll move on to adapt io and strategies 🚀

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 21, 2021

hey @jeffzi what do you think about merging these changes onto a dtypes branch in the main repo and making another PR for io and strategies? I think I've grokked most of this PR and would be easier to review the up-coming changes to those modules in another one.

edit: just made the branch https://github.com/pandera-dev/pandera/tree/dtypes

@jeffzi jeffzi changed the base branch from release/0.7.0 to dtypes May 23, 2021 11:56
@jeffzi
Copy link
Collaborator Author

jeffzi commented May 23, 2021

I fixed the pylint and mypy errors, excluding those related to inference, strategies and io.

what do you think about merging these changes onto a dtypes branch in the main repo and making another PR for io and strategies

I agree. We can have a third PR for adding the documentation. We'll need to explain how to implement a custom data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants