-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PandasDtype #490
Refactor PandasDtype #490
Conversation
* parse frictionless schema - using frictionless-py for some of the heavy lifting - accept yaml/json/frictionless schema files/objects directly - frictionless becomes a new requirement for io - apply pre-commit formatting updates to other code in pandera.io - add test to validate schema parsing, from yaml and json sources * improve documentation * update docstrings per code review Co-authored-by: Niels Bantilan <[email protected]> * add type hints * standardise class properties for easier re-use in future * simplify key check * add missing alternative type * update docstring * align name with Column arg * fix NaN check * fix type assertion * create empty dict if constraints not provided Co-authored-by: Niels Bantilan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏 thanks @jeffzi! this looks great, gonna take me a few days to look at this. See initial comments/questions below
|
||
|
||
@immutable | ||
class Int8(Int16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify the purpose of this inheritance chain? e.g. can we have all the Int*
types inherit from Int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the chain to soften the dtype check, i.e. allowing a subtype (inspired by [numpy.can_cast])(https://numpy.org/doc/stable/reference/generated/numpy.can_cast.html#numpy.can_cast). I played with a casting
argument in DataType.check
that could be False for strict check (current behavior) or True to allow safe downcasting. Then I realized almost anything can be casted to a string and gave up the idea for now.
In the same vein, Pandas and Numpy dtypes inherit the appropriate DataType so that we can implement a cross-engine is_numeric
, etc. with a call to isinstance
.
Following your comment, I dropped the prefixes Numpy/Pandas from dtypes. It looks much better 👍
No problem, there is a lot to take in. I'm excited to get your feedback, I'm sure we can refine the api further. |
It would be useful to update the ASV metrics before this is merged. The first implementation could come with some performance hit that would be good to quantify. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the io
module needs to reflect changes to the pandas_dtype
argument as well
pandera/schemas.py
Outdated
@@ -230,9 +231,9 @@ def _set_column_handler(column, column_name): | |||
} | |||
|
|||
@property | |||
def dtype(self) -> Dict[str, str]: | |||
def dtypes(self) -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note here that we should keep track of the breaking changes in the documentation.
we also don't have an official change log, which we should really start 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've listed breaking changes in the top post. But yeah, we should definitely keep track of breaking changes. Maybe using conventions for commit messages would help? For example, conventional commits as some tooling to generate changelogs but even without tools it would be easier to spot changes.
Btw type annotations is wrong, now it returns a Dict[str, Datatype]
as mentioned in the breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also update the documentation to reflect this:
https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#get-pandas-datatypes
I think this part of the docs pre-dated the coerce
keyword, so the code example doesn't really make sense anymore
schema = pa.DataFrameSchema(
columns={
"column1": pa.Column(pa.Int),
"column2": pa.Column(pa.Category),
"column3": pa.Column(pa.Bool)
},
coerce=True,
)
df = pd.DataFrame.from_dict(
{
"a": {"column1": 1, "column2": "valueA", "column3": True},
"b": {"column1": 1, "column2": "valueB", "column3": True},
},
orient="index"
).astype(schema.dtype).sort_index(axis=1)
# At this point you can simply do
schema(df)
I think the dtypes
and get_dtypes
attributes might still be useful, if not to just conveniently inspect the (pandera engine) dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dtypes and get_dtypes attributes might still be useful, if not to just conveniently inspect the (pandera engine) dtypes
Totally !
re: Now is a good time to introduce a breaking change and think about how those properties should behave. I think @cosmicBboy correct me if I'm wrong, but I guess the reasoning for returning a dict of strings from |
I think I answered your question in the other thread.
I think these should be pandera DataTypes so that the interface is standardized and predictable. The problem I had with the old way was that it could be any one of |
My bad, I misread your comment. We are in agreement about returning Pandera DataTypes :) I'll move on to adapt |
hey @jeffzi what do you think about merging these changes onto a edit: just made the branch https://github.com/pandera-dev/pandera/tree/dtypes |
I fixed the pylint and mypy errors, excluding those related to inference, strategies and io.
I agree. We can have a third PR for adding the documentation. We'll need to explain how to implement a custom data type. |
This PR implements a dtype hierarchy that replaces
PandasDtype
, as discussed in #369.Goals of the refactor
Contract
From now on, I'm going to refer to numpy/pandas dtypes as
native dtypes
in contrast to pandera'sDataType
.A
DataType
class implements the following contract:DataType
DataType.coerce
implements coercion logic that was previously located inDataFrameSchema
.DataType.check
implements "equivalence" check between datatypes. The reason for not using__eq__
is that some dtypes are equivalents even if they are not represented by the same underlying pandas/numpy dtype. For example, Pandas represents numpy.str_has a
numpy.object_. Another example is to implement a
Numberdatatype that is equivalent to
Floatand
Int`.__hash__
and be immutable. Required for internal registry. In practice, datatypes are implemented viadataclasses
which give us those properties for free.__str__
should return the native alias.A parallel class hierarchy implements a bridge interface to pandas and numpy dtypes:
PandasDataType
,NumpyDataType
and their sub-classes.Pandera
DataType
s should not be instantiated directly. Instead, they are generated by a factory methodEngine.dtype()
(e.gPandasEngine.dtype
) which can take any acceptable native dtype representations: string alias, builtin python types, pandas extension dtypes, numpy dtypes, etc. Basically anything thatpandera.Column
already accepts.DataFrameSchema
holds an engine and delegates dtype translations to it.Two engines are implemented:
PandasEngine
: full support for official pandas dtypes.NumpyEngine
: only dtypes supported by pandas.Numpy dtypes are supported by a
NumpyDataType
hierarchy so that we can validate pure numpy arrays in the future if we choose to do so.NumpyEngine
has not been tested.Moreover, only "abstract" DataType hierarchy should be exposed to end-users. Pandas/Numpy specifics should be kept to internals. Users should instead provide native dtypes to public api: e.g
pa.Column(pd.Int16Dtype)
orpa.Column("Int16")
,not
pa.Column(pandera.engines.pandas_engine.PandasInt16)
! Exceptions are aliases such aspa.INT, pa.STRING
, etc. provided for convenience and legacy reasons.^ Feel free to disagree :)
Status of the implementation
New Pandera dtypes are kept in
dtypes_.py
at the moment in order to allow co-existence of refactored and non-refactored features. Pylint and Mypy will be very upset, waiting for feedbacks before sinking time into that fun part 🤡 Only tests related to refactored tests are expected to pass.Breaking changes
DataFrameSchema.pdtype
renamed todtype
. Remove explicit reference to pandas.DataFrameSchema.dtype
renamed todtypes
and eturns a dict ofDataType
s instead of string aliases..dtypes
property that return a dict of dtypes.str(pandas.CategoricalDtype(categories=["A", "B"])) == "category"
.Ideas for future improvements
Nominal
,Number
Date
andDecimal
Datatype. Those are useful for compatibility with Parquet files via PyArrow.Let's address questions and potential shortcomings before extending the refactor.