Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataTypes documentation #536

Merged
merged 23 commits into from
Jul 2, 2021
Merged

Conversation

jeffzi
Copy link
Collaborator

@jeffzi jeffzi commented Jun 29, 2021

This PR is a follow-up to the dtypes refactor started in #490 and initially discussed in #369. It adds documentation for the DataType in the api reference section as well as a section about customizing data types.

I also added some niceties:

  • re-organized api reference into sub-sections to keep the side menu tidy. Sphinx defaults to lexical order and it was a mess with
    all the data type additions.
  • added xdoctest to CI to test docstrings
  • added doctest builder to sphinx-build when running locally (replicate github CI).
  • ignored python prompts when copying example from doc (strip >>> present in docstring examples).
  • unpinned sphinx, working so far...

I also had to pin the furo theme. The latest release change the path to the assets reference in conf.py. I did not want to mess with that in this PR...

@cosmicBboy Let me know what you think about the documentation. I was trying to remain concise and not bother users with irrelevant details. Someone who wants to add a completely new third-party library will likely get in touch and we could then explain the finer details.

@jeffzi jeffzi requested a review from cosmicBboy June 29, 2021 22:24
@jeffzi
Copy link
Collaborator Author

jeffzi commented Jun 29, 2021

CI does not pick up the PR, will re-create

@jeffzi jeffzi closed this Jun 29, 2021
@jeffzi
Copy link
Collaborator Author

jeffzi commented Jun 29, 2021

No idea why github checks are not added. They worked on previous PR #504, opening a new PR #537 did not solve the problem. The only modification to ci-tests.yml is an added test here.

@jeffzi jeffzi reopened this Jun 29, 2021
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 30, 2021

@jeffzi thanks! looks like there are some conflicts, probably because I messed around with the docs/dtype modules a little: 6127454 (feel free to undo my changes)

edit: just rolled it back

docs/source/dtypes.rst Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #536 (f0a33c2) into dtypes (460663d) will decrease coverage by 0.46%.
The diff coverage is 89.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           dtypes     #536      +/-   ##
==========================================
- Coverage   97.91%   97.45%   -0.47%     
==========================================
  Files          24       24              
  Lines        3116     3101      -15     
==========================================
- Hits         3051     3022      -29     
- Misses         65       79      +14     
Impacted Files Coverage Δ
pandera/checks.py 98.54% <ø> (ø)
pandera/engines/numpy_engine.py 87.50% <ø> (ø)
pandera/schemas.py 99.63% <ø> (ø)
pandera/engines/pandas_engine.py 95.00% <87.50%> (-1.11%) ⬇️
pandera/dtypes.py 90.35% <100.00%> (-6.11%) ⬇️
pandera/engines/engine.py 93.40% <100.00%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 460663d...f0a33c2. Read the comment docs.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Jul 1, 2021

looks like there are some conflicts,

Interesting, I didn't know conflicts could prevent CI from running. Thanks for the help :)

Code coverage could be improved but it's probably better to save that for another PR. I think dtypes branch looks ready to be merged into dev 🤞

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 2, 2021

Code coverage could be improved but it's probably better to save that for another PR. I think dtypes branch looks ready to be merged into dev 🤞

@jeffzi very soon! actually dtypes should be merged onto the release/0.7.0 branch, since dev is for bugfix releases.

Gonna have to do a little dance to get this to all work smoothly, but basically there are a couple of things I wanted to do as part of 0.7.0 release. Now that pandera has a steady stream of regular users and new ones coming in every few weeks (I'd say pandera is no longer a personal project), I'd like to break as little as possible and put a deprecation policy in place.

It's gonna be a little more work, but I think it'll be worth it so that we set a good precedent for not pulling the rug from under unsuspecting users :)

So what does this mean? #539

  1. Add back the Column(pandas_dtype=...) kwarg (to exist alongside dtype arg)
    • user cannot specify both pandas_dtype and dtype
    • if using pandas_dtype, DeprecationWarning is raised telling the user to use dtype
  2. Add back PandasDtype enum that simply aliases the types that you introduced (based on issues raised at least a few people are using PandasDtype
    • DeprecationWarning is raised whenever this enum is used

On a related note, I'm wondering if maintaining bugfix patches to 0.6.* would make sense. It might be a pain, but I've already gotten a request to loosen the pandas requirements due to legacy codebases using pandas==0.23.4 but wanting to use pandera (someone has even asked if they can fork and publish a pandera-legacy package on pypi for legacy pandas.

@cosmicBboy cosmicBboy merged commit 478d362 into unionai-oss:dtypes Jul 2, 2021
cosmicBboy added a commit that referenced this pull request Jul 2, 2021
* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <[email protected]>
@jeffzi
Copy link
Collaborator Author

jeffzi commented Jul 2, 2021

I agree with putting back pandera_dtype and PandasDtype + deprecation warnings (targeted to 0.8.0 or 0.9.0?).

Personally, I think maintaining 0.6.* patches is overkill, even pandas doesn't do that for versions < 1.0. Especially since pandera is still maturing at a very fast pace. Legacy pandas indeed proved to be a pain to maintain. I don't really see the point of a pandera-legacy Vs locking the requirements to 0.6.0. The fork won't be able to keep up with new features imho.

re: 0.7.0. Let me know where you'd prefer me to help. I mean, maybe you'd want to handle deprecation warnings yourself?

@cosmicBboy
Copy link
Collaborator

re: 0.7.0. Let me know where you'd prefer me to help. I mean, maybe you'd want to handle deprecation warnings yourself?

Yeah I'll work on that, shouldn't be too much effort.

Personally, I think maintaining 0.6.* patches is overkill

Yeah, that's what I'm leaning towards right now... anyway it's open source, so whoever wants to make it work with legacy pandas can do it themselves :)

@jeffzi jeffzi deleted the feature/dtypes branch July 11, 2021 16:19
cosmicBboy added a commit that referenced this pull request Jul 15, 2021
* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <[email protected]>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <[email protected]>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <[email protected]>

Co-authored-by: Jean-Francois Zinque <[email protected]>
cosmicBboy added a commit that referenced this pull request Jul 22, 2021
* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <[email protected]>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <[email protected]>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <[email protected]>

Co-authored-by: Jean-Francois Zinque <[email protected]>
cosmicBboy added a commit that referenced this pull request Jul 24, 2021
* Feature/420 (#454)

* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <[email protected]>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <[email protected]>

* decouple pandera and pandas dtypes (#559)

* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <[email protected]>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <[email protected]>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <[email protected]>

Co-authored-by: Jean-Francois Zinque <[email protected]>

* improve coverage

* fix docs

* add pandas accessor tests

* pin sphinx

* fix lint

Co-authored-by: Tom Collingwood <[email protected]>
Co-authored-by: Jean-Francois Zinque <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants