Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/420 #454

Merged
merged 12 commits into from
Apr 20, 2021
Merged

Feature/420 #454

merged 12 commits into from
Apr 20, 2021

Conversation

TColl
Copy link
Contributor

@TColl TColl commented Apr 3, 2021

Relates to #420

Here's a first attempt at parsing frictionless schema, with the following capabilities:

  • read from frictionless schema json/yaml files on disk
  • accept a frictionless Schema object directly from memory
  • map field names, types, constraints directly into pandera equivalents

The code to handle this is in pandera.io, and I've created a new FrictionlessFieldParser class to convert each field in the schema accordingly.

Currently pa.io.from_frictionless_schema() is as far as I've got in terms of an API - how/where do you want this to be imported from?

Documentation will also need updating - there are docstrings for everything I've added but nothing more than that at this point.

I've also added tests in tests.io to validate this functionality.

@TColl TColl changed the base branch from release/0.8.0 to release/0.7.0 April 4, 2021 14:29
@TColl TColl changed the title Draft: Feature/420 Feature/420 Apr 4, 2021
@TColl
Copy link
Contributor Author

TColl commented Apr 4, 2021

@cosmicBboy I've swapped this over to target v0.7.0 and added an example to the docstring so documentation should be a bit clearer now.

I've just realised I branched from master rather than from 0.7.0 directly, sorry - do you want this fixing so only my commits get covered by this MR?

Please let me know if you need anything else from me! I can possibly take a look at parsing open API schema too, though I haven't used them before. It should be easy to refactor what I've done here for anyone to extend coverage to other schema quite consistently in future, too (e.g. an abstract base class for column parsing, which is extended by FrictionlessField, OpenAPIColumn etc with a consistent API for constructing a pandera schema from them).

@cosmicBboy cosmicBboy closed this Apr 4, 2021
@cosmicBboy cosmicBboy reopened this Apr 4, 2021
@cosmicBboy cosmicBboy closed this Apr 4, 2021
@cosmicBboy cosmicBboy reopened this Apr 4, 2021
@cosmicBboy
Copy link
Collaborator

thanks @TColl this looks fantastic! ✨

I've just realised I branched from master rather than from 0.7.0 directly, sorry - do you want this fixing so only my commits get covered by this MR?

No worries, the branch situation is still evolving as more folks start contributing to the project! I just updated the release/0.7.0 branch to be on par with master. Would you mind rebasing your changes on top of the latest release/0.7.0 branch and re-opening the PR?

Will add comments/feedback after the rebase

@codecov
Copy link

codecov bot commented Apr 4, 2021

Codecov Report

Merging #454 (6a4282f) into release/0.7.0 (3191be9) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                Coverage Diff                @@
##           release/0.7.0     #454      +/-   ##
=================================================
+ Coverage          99.45%   99.46%   +0.01%     
=================================================
  Files                 21       21              
  Lines               2572     2637      +65     
=================================================
+ Hits                2558     2623      +65     
  Misses                14       14              
Impacted Files Coverage Δ
pandera/io.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3191be9...6a4282f. Read the comment docs.

@TColl
Copy link
Contributor Author

TColl commented Apr 4, 2021

thanks @TColl this looks fantastic! ✨

I've just realised I branched from master rather than from 0.7.0 directly, sorry - do you want this fixing so only my commits get covered by this MR?

No worries, the branch situation is still evolving as more folks start contributing to the project! I just updated the release/0.7.0 branch to be on par with master. Would you mind rebasing your changes on top of the latest release/0.7.0 branch and re-opening the PR?

Will add comments/feedback after the rebase

I've rebased now, though it's not pretty!

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 4, 2021

huh, weird... can you try making a new PR off of this branch? I think you'll have to make a new branch name from this commit in your fork https://github.com/TColl/pandera/commit/a98aeb92edda632fb2165d1c855884f8f75a2a77

it looks like https://github.com/TColl/pandera/tree/feature/420 doubles up on the improve documentation and parse frictionless schema commits since I made those changes to github actions file.

TColl added 2 commits April 4, 2021 22:55
- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources
@TColl
Copy link
Contributor Author

TColl commented Apr 4, 2021

huh, weird... can you try making a new PR off of this branch? I think you'll have to make a new branch name from this commit in your fork TColl@a98aeb9

it looks like https://github.com/TColl/pandera/tree/feature/420 doubles up on the improve documentation and parse frictionless schema commits since I made those changes to github actions file.

I've force-pushed a clean branch over the old one and cherry-picked the changes back in from the messy one - should be good to review now 👍

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @TColl ! I added a few suggestions and nit, see inline comments

pandera/io.py Outdated Show resolved Hide resolved
them to an equivalent pandera Column schema.

For this implementation, we are using field names, constraints and types
but leaving other frictionless parameters out (e.g. foreign keys, type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can go back and add this feature when #331 is done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in anticipation of this, I've standardised all properties of FrictionlessFieldParser to reflect the properties we need for a pandera Column object - so when #331 is complete it'll be clearer where to add the new properties in, too.

pandera/io.py Outdated
self.type = field.get("type", "string")

@property
def dtype(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffzi just to give you a heads up, this might be one area we should revisit as your work with #369 is merged

pandera/io.py Outdated Show resolved Hide resolved
pandera/io.py Outdated Show resolved Hide resolved
pandera/io.py Outdated Show resolved Hide resolved
pandera/io.py Outdated Show resolved Hide resolved
pandera/io.py Outdated Show resolved Hide resolved
pandera/io.py Outdated Show resolved Hide resolved
@TColl
Copy link
Contributor Author

TColl commented Apr 5, 2021

@cosmicBboy after making these latest changes I'm getting pylint E1136 (unscriptable object) errors for the newly-used typing.Optional and typing.Union type hints with the current pre-commit setup - I've raised #455 to cover this

@cosmicBboy
Copy link
Collaborator

hi @TColl I think for now you can put a #pylint: disable=unsubscriptable-object comment above the places where you get this error, this is accounted for in 3.9 tests here.

I think #455 would be a good time to look at other, less opinionated linting options... I've been finding that pylint has become more and more of a burden, though maybe that's because of python's growing pains adding more features in recent versions. If we do decide to stick with pylint unpinning it will probably yield more issues, so we can localize those changes in #455.

It looks like there are also some punit test issues](https://github.com/pandera-dev/pandera/pull/454/checks?check_run_id=2267711686), are you able to reproduce them locally? See here to run the testing suite that's on par with what CI runs.

@TColl
Copy link
Contributor Author

TColl commented Apr 6, 2021

hi @TColl I think for now you can put a #pylint: disable=unsubscriptable-object comment above the places where you get this error, this is accounted for in 3.9 tests here.

I've resorted to running pre-commit once to check those are the only errors I get, then committing with a --no-verify flag to bypass pylint failing on those specific issues. Neither option is great, but adding pylint pragma always feels like a failure to me!

I think #455 would be a good time to look at other, less opinionated linting options... I've been finding that pylint has become more and more of a burden, though maybe that's because of python's growing pains adding more features in recent versions. If we do decide to stick with pylint unpinning it will probably yield more issues, so we can localize those changes in #455.

I'm a fan of pylint personally, but I think #455 is a good place to have that discussion!

It looks like there are also some unit test issues, are you able to reproduce them locally? See here to run the testing suite that's on par with what CI runs.

I get the same result locally - some of them are trivial, if annoying (nan != nan for some versions of python), and can be fixed with some updating of the unit tests themselves to be more robust. Others are more concerning to me (the NoneType object has no 'get' method ones) and will need a code tweak or two to get it sorted. I'll get this done in the next couple of days.

On a side note, I'm a big fan of #397 purely for it halving the number of nox cases to run through - that full suite takes a while to burn through!

@TColl
Copy link
Contributor Author

TColl commented Apr 16, 2021

@cosmicBboy sorry for the delay (I got a puppy, which has taken up most of my spare time!) but the unit tests are now fixed.This was mostly my unit testing not capturing various nuances between pandas versions so the actual code is largely unchanged.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay (I got a puppy, which has taken up most of my spare time!)

nice! 🐶

thanks for the PR 🙏, and congrats to your first contribution to pandera 🎉! This is also the first of many integrations with other schema formats, so thanks for kicking that off 🍾

@cosmicBboy cosmicBboy merged this pull request into unionai-oss:release/0.7.0 Apr 20, 2021
@TColl TColl deleted the feature/420 branch April 21, 2021 16:51
cosmicBboy added a commit that referenced this pull request May 8, 2021
* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <[email protected]>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <[email protected]>
jeffzi pushed a commit that referenced this pull request May 24, 2021
* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <[email protected]>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <[email protected]>
cosmicBboy added a commit that referenced this pull request Jul 22, 2021
* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <[email protected]>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <[email protected]>
cosmicBboy added a commit that referenced this pull request Jul 24, 2021
* Feature/420 (#454)

* parse frictionless schema

- using frictionless-py for some of the heavy lifting
- accept yaml/json/frictionless schema files/objects directly
- frictionless becomes a new requirement for io
- apply pre-commit formatting updates to other code in pandera.io
- add test to validate schema parsing, from yaml and json sources

* improve documentation

* update docstrings per code review

Co-authored-by: Niels Bantilan <[email protected]>

* add type hints

* standardise class properties for easier re-use in future

* simplify key check

* add missing alternative type

* update docstring

* align name with Column arg

* fix NaN check

* fix type assertion

* create empty dict if constraints not provided

Co-authored-by: Niels Bantilan <[email protected]>

* decouple pandera and pandas dtypes (#559)

* refactor PandasDtype into class hierarchy supported by engines

* refactor DataFrameSchema based on DataType hierarchy

* refactor SchemaModel based on DataType hierarchy

* revert fix coerce=True and dtype=None should be a noop

* apply code style

* fix running tests/core with nox

* consolidate dtype names

* consolidate engine internal naming

* disable inherited __init__ with immutable(init=False)

* delete duplicated immutable

* disambiguate dtype variables

* add warning on base pandas_engine, numpy_engine.DataType init

* fix pylint, mypy errors

* fix DataFrameSchema.dtypes return type

* enable CI on dtypes branch

* Refactor inference, schema_statistics, strategies and io using the DataType hierarchy (#504)

* fix pandas_engine.Interval

* fix Timedelta64 registration with pandas_engine.Engine

* add DataType helpers

* add DataType.continuous attribute

* add dtypes.is_numeric

* refactor schema_statistics based on DataType hierarchy

* refactor schema_inference based on DataType hierarchy

* fix numpy_engine.Timedelta64.type

* add is_subdtype helper

* add Engine.get_registered_dtypes

* fix Engine error when registering a base DataType

* fix pandas_engine DateTime string alias

* clean up test_dtypes

* fix test_extensions

* refactor strategies based on DataType hierarchy

* refactor io based on DataType hierarchy

* replace dtypes module by new DataType hierarchy

* fix black

* delete dtypes_.py

* drop legacy pandas and python 3.6 from CI

* fix mypy errors

* fix ci-docs

* fix conda dependencies

* fix lint, update noxfile

* simplify nox tests, fix test_io

* update ci build

* update nox

* pin nox, handle windows data types

* fix windows platform

* fix pandas_engine on windows platform

* fix test_dtypes on windows platform

* force pip on docs CI

* test out windows dtype stuff

* more messing around with windows

* more debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert ci

* increase cache

* testing

Co-authored-by: cosmicBboy <[email protected]>

* Add DataTypes documentation (#536)

* delete print statements

* pin furo

* fix generated docs not removed by nox

* re-organize API section

* replace aliased pandas_engine data types with their aliases

* drop warning when calling Engine.register_dtype without arguments

* add data types to api reference doc

* add document for DataType refactor

* unpin sphinx and drop sphinx_rtd_theme

* add xdoctest

* ignore prompt when copying example from doc

* add doctest builder when running sphinx-build locally

* fix dtypes doc examples

* fix pandas_engine.DataType.check

* fix pylint

* remove whitespaces in dtypes doc

* Update docs/source/dtypes.rst

* Update dtypes.rst

* update docs structure

* update nox file

* force pip on doctests

* update test_schemas

* fix docs session not overriding html with doctest output

Co-authored-by: Niels Bantilan <[email protected]>

* add deprecation warnings for pandas_dtype and PandasDtype enum (#547)

* remove auto-generated docs

* add deprecation warnings, support pandas>=1.3.0

* add deprecation warnings for PandasDtype enum

* fix sphinx

* fix windows

* fix windows

* add support for pyarrow backed string data type (#548)

* add support for pyarrow backed string data type

* fix regression for pandas < 1.3.0

* add verbosity to test run

* loosen strategies unit tests deadline, exclude windows ci

* loosen test_strategies.py tests

* use "dev" hypothesis profile for python 3.7

* add pandas==1.2.5 test

* fix ci

* ci typo

* don't install environment.yml on unit tests

* install nox in ci

* remove environment.yml

* update environment in ci

Co-authored-by: cosmicBboy <[email protected]>

Co-authored-by: Jean-Francois Zinque <[email protected]>

* improve coverage

* fix docs

* add pandas accessor tests

* pin sphinx

* fix lint

Co-authored-by: Tom Collingwood <[email protected]>
Co-authored-by: Jean-Francois Zinque <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants