Docs/scaling - Bring Pandera to Spark and Dask #588

kvnkho · 2021-08-18T05:13:47Z

Here is a first pass of the scaling.rst we talked about. Here, we show how to scale pandera code to Spark and Dask using Fugue. Thanks for agreeing to collaborate. Feedback is appreciated.

I don't know if you want to take it from here and find a place for this but @goodwanghan and I can also take care of making changes from your feedback. Just let us know.

kvnkho · 2021-08-18T05:18:56Z

Sorry, this PR became a mess because I changed the branch to dev. Will re-open.

codecov · 2021-08-26T13:16:25Z

Codecov Report

Merging #588 (5d33ef5) into dev (2779015) will increase coverage by 0.18%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##              dev     #588      +/-   ##
==========================================
+ Coverage   98.55%   98.73%   +0.18%     
==========================================
  Files          26       29       +3     
  Lines        3257     3327      +70     
==========================================
+ Hits         3210     3285      +75     
+ Misses         47       42       -5

Impacted Files	Coverage Δ
pandera/checks.py	`98.50% <0.00%> (-0.06%)`	⬇️
pandera/io.py	`100.00% <0.00%> (ø)`
pandera/errors.py	`100.00% <0.00%> (ø)`
pandera/error_formatters.py	`95.45% <0.00%> (ø)`
pandera/engines/numpy_engine.py	`100.00% <0.00%> (ø)`
pandera/engines/type_aliases.py	`100.00% <0.00%> (ø)`
pandera/check_utils.py	`100.00% <0.00%> (ø)`
pandera/engines/utils.py	`100.00% <0.00%> (ø)`
pandera/engines/pandas_engine.py	`99.29% <0.00%> (+<0.01%)`	⬆️
pandera/engines/engine.py	`98.82% <0.00%> (+5.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2779015...5d33ef5. Read the comment docs.

docs/source/scaling.rst

* add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]>

@cosmicBboy

* Unique keyword arg (#580) * add copy button to docs (#448) * Add missing inplace arg to SchemaModel's validate (#450) * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * WIP * fix test errors, re-factor allow_duplicates handling * fix io tests * fix docs, remove _allow_duplicates private var * update unique type signature in strategies * completing tests for setters and lazy evaluation of unique kw * small fix for the linting errors * support dataframe-level uniqueness in strategies * add docs, fix error formatting, add multiindex support Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> * Add support for timezone-aware datetime strategies (#595) * add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]> * support frictionless primary keys with multiple fields Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: Kevin Kho <[email protected]>

@cosmicBboy

* Unique keyword arg (#580) * add copy button to docs (#448) * Add missing inplace arg to SchemaModel's validate (#450) * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * WIP * fix test errors, re-factor allow_duplicates handling * fix io tests * fix docs, remove _allow_duplicates private var * update unique type signature in strategies * completing tests for setters and lazy evaluation of unique kw * small fix for the linting errors * support dataframe-level uniqueness in strategies * add docs, fix error formatting, add multiindex support Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> * Add support for timezone-aware datetime strategies (#595) * add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]> * schemas with multi-index columns correctly report errors (#600) fixes #589 * strategies module supports undefined checks in regex columns (#599) * Add support for empty data type annotation in SchemaModel (#602) * remove artifacts of py3.6 support * add support for empty data type annotation in SchemaModel * fix frictionless version in dev dependencies * fix setuptools version instead of frictionless * fix setuptools pinning * remove frictionless from core pandera deps (#609) * support frictionless primary keys with multiple fields (#608) * fix validation of check raising error without message (#613) * docs/requirements.txt pin setuptools (#611) * bump version 0.7.1 Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: Kevin Kho <[email protected]>

kvnkho added 3 commits August 16, 2021 02:02

scaling.rst

1463bb2

edited conf

166076f

finished first pass

8e910e0

kvnkho changed the title ~~Docs/scaling~~ Docs/scaling - Bring Pandera to Spark and Dask Aug 18, 2021

kvnkho changed the base branch from master to dev August 18, 2021 05:16

kvnkho closed this Aug 18, 2021

removing FugueWorkflow

7e7c733

kvnkho reopened this Aug 21, 2021

kvnkho changed the base branch from dev to master August 21, 2021 19:23

cosmicBboy changed the base branch from master to dev August 26, 2021 13:06

Update index.rst

4231ddb

cosmicBboy reviewed Aug 26, 2021

View reviewed changes

docs/source/scaling.rst Outdated Show resolved Hide resolved

Update docs/source/scaling.rst

5d33ef5

kvnkho mentioned this pull request Aug 31, 2021

Great Expectations vs Pandera #590

Closed

cosmicBboy changed the base branch from dev to master September 1, 2021 18:05

cosmicBboy merged commit 84ea3c2 into unionai-oss:master Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs/scaling - Bring Pandera to Spark and Dask #588

Docs/scaling - Bring Pandera to Spark and Dask #588

kvnkho commented Aug 18, 2021

kvnkho commented Aug 18, 2021

codecov bot commented Aug 26, 2021 •

edited

Loading

Docs/scaling - Bring Pandera to Spark and Dask #588

Docs/scaling - Bring Pandera to Spark and Dask #588

Conversation

kvnkho commented Aug 18, 2021

kvnkho commented Aug 18, 2021

codecov bot commented Aug 26, 2021 • edited Loading

Codecov Report

codecov bot commented Aug 26, 2021 •

edited

Loading