Refactor PandasDtype #490

jeffzi · 2021-05-11T21:03:56Z

This PR implements a dtype hierarchy that replaces PandasDtype, as discussed in #369.

Goals of the refactor

Decouple pandera and pandas dtypes: opening up to other dataframe-like data structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.
Built-in support for dynamic dtypes: e.g. categorical dtype implementations often have a ordered and categories arguments.
Allow end-user to customize dtype coercion. For example, pass a date format, coerce ["oui", "non"] to boolean.

Contract

From now on, I'm going to refer to numpy/pandas dtypes as native dtypes in contrast to pandera's DataType.

A DataType class implements the following contract:

Subclass DataType
DataType.coerce implements coercion logic that was previously located in DataFrameSchema.
DataType.check implements "equivalence" check between datatypes. The reason for not using __eq__ is that some dtypes are equivalents even if they are not represented by the same underlying pandas/numpy dtype. For example, Pandas represents numpy.str_has anumpy.object_. Another example is to implement a Numberdatatype that is equivalent toFloatandInt`.
Implement __hash__ and be immutable. Required for internal registry. In practice, datatypes are implemented via dataclasses which give us those properties for free.
__str__ should return the native alias.

class DataType:

    def coerce(self, obj: Any):
        """Coerce object to the dtype."""
        raise NotImplementedError()

    def check(self, datatype: "DataType") -> bool:
        """Validate that `datatype` is equivalent."""
        if not isinstance(datatype, DataType):
            return False
        return self == datatype

    def __hash__(self) -> int:
        """Must be implemented by subclasses."""

    def __str__(self) -> str:
        pass

A parallel class hierarchy implements a bridge interface to pandas and numpy dtypes: PandasDataType, NumpyDataType and their sub-classes.

Pandera DataTypes should not be instantiated directly. Instead, they are generated by a factory method Engine.dtype() (e.g PandasEngine.dtype) which can take any acceptable native dtype representations: string alias, builtin python types, pandas extension dtypes, numpy dtypes, etc. Basically anything that pandera.Column already accepts. DataFrameSchema holds an engine and delegates dtype translations to it.

Two engines are implemented:

PandasEngine: full support for official pandas dtypes.
NumpyEngine: only dtypes supported by pandas.

Numpy dtypes are supported by a NumpyDataType hierarchy so that we can validate pure numpy arrays in the future if we choose to do so. NumpyEngine has not been tested.

Moreover, only "abstract" DataType hierarchy should be exposed to end-users. Pandas/Numpy specifics should be kept to internals. Users should instead provide native dtypes to public api: e.g pa.Column(pd.Int16Dtype) or pa.Column("Int16"),
not pa.Column(pandera.engines.pandas_engine.PandasInt16)! Exceptions are aliases such as pa.INT, pa.STRING, etc. provided for convenience and legacy reasons.
^ Feel free to disagree :)

Status of the implementation

New Pandera dtypes are kept in dtypes_.py at the moment in order to allow co-existence of refactored and non-refactored features. Pylint and Mypy will be very upset, waiting for feedbacks before sinking time into that fun part 🤡 Only tests related to refactored tests are expected to pass.

Breaking changes

Dropped support for pandas 0.25
No attempt to test on py3.6
DataFrameSchema.pdtype renamed to dtype. Remove explicit reference to pandas.
DataFrameSchema.dtype renamed to dtypes and eturns a dict of DataTypes instead of string aliases..
- Better reflect pandas api: DataFrame has a dtypes property that return a dict of dtypes.
- Now that dtypes are more powerful, it seems more intuitive to return them.
- That change is further motivated by the fact that some parameterized datatypes cannot be distinguished by alias alone. For example, str(pandas.CategoricalDtype(categories=["A", "B"])) == "category".

Ideas for future improvements

Implement "meta" dtypes: Nominal, Number
Implement Date and Decimal Datatype. Those are useful for compatibility with Parquet files via PyArrow.

Let's address questions and potential shortcomings before extending the refactor.

* parse frictionless schema - using frictionless-py for some of the heavy lifting - accept yaml/json/frictionless schema files/objects directly - frictionless becomes a new requirement for io - apply pre-commit formatting updates to other code in pandera.io - add test to validate schema parsing, from yaml and json sources * improve documentation * update docstrings per code review Co-authored-by: Niels Bantilan <[email protected]> * add type hints * standardise class properties for easier re-use in future * simplify key check * add missing alternative type * update docstring * align name with Column arg * fix NaN check * fix type assertion * create empty dict if constraints not provided Co-authored-by: Niels Bantilan <[email protected]>

cosmicBboy

🙏 thanks @jeffzi! this looks great, gonna take me a few days to look at this. See initial comments/questions below

cosmicBboy · 2021-05-12T14:15:22Z

pandera/dtypes_.py

+
+
+@immutable
+class Int8(Int16):


Can you clarify the purpose of this inheritance chain? e.g. can we have all the Int* types inherit from Int?

We could use the chain to soften the dtype check, i.e. allowing a subtype (inspired by [numpy.can_cast])(https://numpy.org/doc/stable/reference/generated/numpy.can_cast.html#numpy.can_cast). I played with a casting argument in DataType.check that could be False for strict check (current behavior) or True to allow safe downcasting. Then I realized almost anything can be casted to a string and gave up the idea for now.

In the same vein, Pandas and Numpy dtypes inherit the appropriate DataType so that we can implement a cross-engine is_numeric, etc. with a call to isinstance.

pandera/dtypes_.py

pandera/schema_components.py

pandera/__init__.py

pandera/engines/pandas_engine.py

pandera/typing.py

jeffzi · 2021-05-12T19:45:27Z

Following your comment, I dropped the prefixes Numpy/Pandas from dtypes. It looks much better 👍

gonna take me a few days to look at this.

No problem, there is a lot to take in. I'm excited to get your feedback, I'm sure we can refine the api further.

pandera/dtypes_.py

pandera/engines/engine.py

antonl · 2021-05-14T00:48:38Z

It would be useful to update the ASV metrics before this is merged. The first implementation could come with some performance hit that would be good to quantify.

pandera/engines/engine.py

cosmicBboy

I think the io module needs to reflect changes to the pandas_dtype argument as well

pandera/schemas.py

cosmicBboy · 2021-05-14T14:46:32Z

pandera/schemas.py

@@ -230,9 +231,9 @@ def _set_column_handler(column, column_name):
        }

    @property
-    def dtype(self) -> Dict[str, str]:
+    def dtypes(self) -> Dict[str, str]:


note here that we should keep track of the breaking changes in the documentation.

we also don't have an official change log, which we should really start 😅

I've listed breaking changes in the top post. But yeah, we should definitely keep track of breaking changes. Maybe using conventions for commit messages would help? For example, conventional commits as some tooling to generate changelogs but even without tools it would be easier to spot changes.

Btw type annotations is wrong, now it returns a Dict[str, Datatype] as mentioned in the breaking changes.

we should also update the documentation to reflect this:
https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#get-pandas-datatypes

I think this part of the docs pre-dated the coerce keyword, so the code example doesn't really make sense anymore

schema = pa.DataFrameSchema( columns={ "column1": pa.Column(pa.Int), "column2": pa.Column(pa.Category), "column3": pa.Column(pa.Bool) }, coerce=True, ) df = pd.DataFrame.from_dict( { "a": {"column1": 1, "column2": "valueA", "column3": True}, "b": {"column1": 1, "column2": "valueB", "column3": True}, }, orient="index" ).astype(schema.dtype).sort_index(axis=1) # At this point you can simply do schema(df)

I think the dtypes and get_dtypes attributes might still be useful, if not to just conveniently inspect the (pandera engine) dtypes

I think the dtypes and get_dtypes attributes might still be useful, if not to just conveniently inspect the (pandera engine) dtypes

Totally !

jeffzi · 2021-05-14T19:49:50Z

re: DataFrameSchema/SeriesSchema.dtype and DataFrameSchema.dtypes.

Now is a good time to introduce a breaking change and think about how those properties should behave. I think DataFrameSchema.dtypes (previously dtype) should be aligned with dtype and return a dtype, not a string alias. The question is: should they return dypes provided by the user or pandera DataTypes?

@cosmicBboy correct me if I'm wrong, but I guess the reasoning for returning a dict of strings from .dtype was to give users standardized dtypes. At least, that is my reasoning for returning DataTypes. With them users can replicate coerce and get standard aliases with DataType.__str__.

cosmicBboy · 2021-05-16T15:51:19Z

@cosmicBboy correct me if I'm wrong, but I guess the reasoning for returning a dict of strings from .dtype was to give users standardized dtypes. At least, that is my reasoning for returning DataTypes. With them users can replicate coerce and get standard aliases with DataType.str.

I think I answered your question in the other thread.

I think DataFrameSchema.dtypes (previously dtype) should be aligned with dtype and return a dtype, not a string alias. The question is: should they return dypes provided by the user or pandera DataTypes?

I think these should be pandera DataTypes so that the interface is standardized and predictable. The problem I had with the old way was that it could be any one of str, pandas extension type, numpy type, etc. The type variability would then cascade to downstream parts of the validation logic, which is a pain to handle.

jeffzi · 2021-05-18T21:32:23Z

My bad, I misread your comment. We are in agreement about returning Pandera DataTypes :)

I'll move on to adapt io and strategies 🚀

cosmicBboy · 2021-05-21T13:26:43Z

hey @jeffzi what do you think about merging these changes onto a dtypes branch in the main repo and making another PR for io and strategies? I think I've grokked most of this PR and would be easier to review the up-coming changes to those modules in another one.

edit: just made the branch https://github.com/pandera-dev/pandera/tree/dtypes

jeffzi · 2021-05-23T23:16:47Z

I fixed the pylint and mypy errors, excluding those related to inference, strategies and io.

what do you think about merging these changes onto a dtypes branch in the main repo and making another PR for io and strategies

I agree. We can have a third PR for adding the documentation. We'll need to explain how to implement a custom data type.

TColl and others added 7 commits May 8, 2021 11:26

refactor PandasDtype into class hierarchy supported by engines

7ba46f3

refactor DataFrameSchema based on DataType hierarchy

89917ec

refactor SchemaModel based on DataType hierarchy

93f86f3

revert fix coerce=True and dtype=None should be a noop

53595b1

apply code style

3ae926d

fix running tests/core with nox

e5267a7

cosmicBboy reviewed May 12, 2021

View reviewed changes

consolidate dtype names

96ea9b7

jeffzi force-pushed the feature/dtypes branch from 86efc05 to 96ea9b7 Compare May 12, 2021 20:06

consolidate engine internal naming

ec23e0c

jeffzi force-pushed the feature/dtypes branch from 9d4ed35 to ec23e0c Compare May 12, 2021 20:21

cosmicBboy reviewed May 13, 2021

View reviewed changes

pandera/dtypes_.py Outdated Show resolved Hide resolved

disable inherited __init__ with immutable(init=False)

3cd08e8

jeffzi force-pushed the feature/dtypes branch from 326b396 to 3cd08e8 Compare May 13, 2021 21:28

cosmicBboy reviewed May 13, 2021

View reviewed changes

pandera/engines/engine.py Show resolved Hide resolved

delete duplicated immutable

7a74714

cosmicBboy reviewed May 14, 2021

View reviewed changes

pandera/engines/engine.py Outdated Show resolved Hide resolved

cosmicBboy reviewed May 14, 2021

View reviewed changes

pandera/engines/engine.py Outdated Show resolved Hide resolved

cosmicBboy reviewed May 14, 2021

View reviewed changes

jeffzi closed this May 14, 2021

jeffzi reopened this May 14, 2021

jeffzi changed the base branch from release/0.7.0 to dtypes May 23, 2021 11:56

disambiguate dtype variables

f886af3

Jean-Francois Zinque added 4 commits May 23, 2021 21:24

add warning on base pandas_engine, numpy_engine.DataType init

30f53f2

fix pylint, mypy errors

7a329bb

fix DataFrameSchema.dtypes return type

e2778a8

enable CI on dtypes branch

a4415a4

jeffzi merged commit 4b2101d into unionai-oss:dtypes May 24, 2021

jeffzi mentioned this pull request May 28, 2021

Refactor inference, schema_statistics, strategies and io using the DataType hierarchy #504

Merged

9 tasks

This was referenced Jun 25, 2021

DataFrameSchema with column of type pa.Float will raise SchemaError on validating DataFrame with float32 column #525

Closed

Add DataTypes documentation #536

Merged

Feature/dtypes #537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor PandasDtype #490

Refactor PandasDtype #490

jeffzi commented May 11, 2021 •

edited

Loading

cosmicBboy left a comment

cosmicBboy May 12, 2021

jeffzi May 12, 2021 •

edited

Loading

jeffzi commented May 12, 2021

antonl commented May 14, 2021

cosmicBboy left a comment

cosmicBboy May 14, 2021

jeffzi May 14, 2021 •

edited

Loading

cosmicBboy May 16, 2021 •

edited

Loading

jeffzi May 18, 2021

jeffzi commented May 14, 2021 •

edited

Loading

cosmicBboy commented May 16, 2021

jeffzi commented May 18, 2021

cosmicBboy commented May 21, 2021 •

edited

Loading

jeffzi commented May 23, 2021

Refactor PandasDtype #490

Refactor PandasDtype #490

Conversation

jeffzi commented May 11, 2021 • edited Loading

Goals of the refactor

Contract

Status of the implementation

Breaking changes

Ideas for future improvements

cosmicBboy left a comment

Choose a reason for hiding this comment

cosmicBboy May 12, 2021

Choose a reason for hiding this comment

jeffzi May 12, 2021 • edited Loading

Choose a reason for hiding this comment

jeffzi commented May 12, 2021

antonl commented May 14, 2021

cosmicBboy left a comment

Choose a reason for hiding this comment

cosmicBboy May 14, 2021

Choose a reason for hiding this comment

jeffzi May 14, 2021 • edited Loading

Choose a reason for hiding this comment

cosmicBboy May 16, 2021 • edited Loading

Choose a reason for hiding this comment

jeffzi May 18, 2021

Choose a reason for hiding this comment

jeffzi commented May 14, 2021 • edited Loading

cosmicBboy commented May 16, 2021

jeffzi commented May 18, 2021

cosmicBboy commented May 21, 2021 • edited Loading

jeffzi commented May 23, 2021

jeffzi commented May 11, 2021 •

edited

Loading

jeffzi May 12, 2021 •

edited

Loading

jeffzi May 14, 2021 •

edited

Loading

cosmicBboy May 16, 2021 •

edited

Loading

jeffzi commented May 14, 2021 •

edited

Loading

cosmicBboy commented May 21, 2021 •

edited

Loading