Releases: unionai-oss/pandera
Release 0.19.0: Polars validation support
✨ Highlights ✨
📣 Pandera now supports validation of polars.DataFrame
and polars.LazyFrame
🐻❄️!
You can now do this:
import pandera.polars as pa
import polars as pl
class Schema(pa.DataFrameModel):
state: str
city: str
price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})
lf = pl.LazyFrame(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
Schema.validate(lf).collect()
And of course you can do functional validation with decorators like so:
from pandera.typing.polars import LazyFrame
@pa.check_types
def function(lf: LazyFrame[Schema]) -> LazyFrame[Schema]:
return lf.filter(pl.col("state").eq("CA"))
function(lf).collect()
You can read more about the integration here. Not all pandera features are supported at this point, but depending on community demand/contributions we'll slowly add them. To learn more about what's currently supported, check out this table.
Special shoutout to @AndriiG13 and @FilipAisot for their contributions on the built-in checks and polars datatypes, respectively, and to @evanrasmussen9, @baldwinj30, @obiii, @Filimoa, @philiporlando, @r-bar, @alkment, @jjfantini, and @robertdj for their early feedback and bug reports during the 0.19.0 beta.
What's Changed
- Support polars DataFrames, LazyFrames by @cosmicBboy, @AndriiG13, and @FilipAisot in #1373
- bugfix: optional columns in polars schema should no longer raise errors when not present by @cosmicBboy in #1532
check_nullable
does not uselessly computeisna()
anymore in pandas backend by @smarie in #1538- Polars LazyFrames are validated at the schema-level by default by @cosmicBboy in #1534
- Enable from_format_kwargs for dict format by @ektar in #1539
- Convert docs to myst by @cosmicBboy in #1542
- fix README(tab to space) by @np-yoe in #1544
- pandas DataFrameModel accepts python generic types by @cosmicBboy in #1547
- Backend registration happens at schema initialization by @cosmicBboy in #1548
- do not format if test is not necessary by @mattB1989 in #1530
- Register default backends when restoring state by @alkment in #1550
- Bump actions/setup-python from 4 to 5 by @dependabot in #1452
- fix: prevent environment pollution when importing pyspark by @sam-goodwin in #1552
- use rst to speed up api docs generation by @cosmicBboy in #1557
- Add _GenericAlias.call patch by @cosmicBboy in #1561
- support typeguard < 3 for better compatability by @cosmicBboy in #1563
- Add parse function to DataFrameModel in #1181
- localize GenericAlias patch to DataFrameBase subclasses by @cosmicBboy in #1571
- Bump idna from 3.4 to 3.7 by @dependabot in #1569
- docs: fix typo in env var name by @alekseik1 in #1562
- polars: fix element-wise checks, register backends by @cosmicBboy in #1572
- remove pytest ignore on modin, dask. pyspark tests with pandas >= 2 by @cosmicBboy in #1573
- make sure check name is propagated to error report by @cosmicBboy in #1574
- update ci to run pyspark, modin, dask with pandas >= v2 by @cosmicBboy in #1575
- use sphinx-design instead of sphinx-panels by @cosmicBboy in #1581
- Update bug_report.md by @philiporlando in #1585
- bugfix: polars column core checks now return check output by @cosmicBboy in #1586
- make pandera.typing.Series[TYPE] error in polars DataFrameModel more readable by @cosmicBboy in #1588
- implement timezone agnostic polars_engine.DateTime type by @cosmicBboy in #1589
- fix pyspark import error by @cosmicBboy in #1591
- fix pyspark tests when run on full test suite by @cosmicBboy in #1593
- Bugfix/1580 by @cosmicBboy in #1596
- Set pandas_io.from_frictionless_schema to use a raw string for docs by @mark-thm in #1597
- Add a generic Series type for polars by @baldwinj30 in #1595
- Add StructType and DDL extraction from Pandera schemas by @filipeo2-mck in #1570
- Clean up typing for pandas GenericDtype by @cosmicBboy in #1601
- Adding warning for unique in pyspark field and a test showing the issue as well as config when it works. by @zippeurfou in #1592
- bugfix/1607: coercion error should correctly report relevant failure cases by @cosmicBboy in #1608
- Create a common DataFrameSchema class, update mypy used in pre-commit by @cosmicBboy in #1609
- Dataframe column schema by @cosmicBboy in #1611
- bugfix: column-level coercion is properly implemented by @cosmicBboy in #1612
- update docs for polars by @cosmicBboy in #1613
- fix: properly coerce dtypes for columns with regex=True by @tesslinden in #1602
- rewrite Check class docstrings to remove pandas assumption by @cosmicBboy in #1614
- add tests for polars decorators by @cosmicBboy in #1615
New Contributors
- @smarie made their first contribution in #1538
- @ektar made their first contribution in #1539
- @np-yoe made their first contribution in #1544
- @alkment made their first contribution in #1550
- @sam-goodwin made their first contribution in #1552
- @alekseik1 made their first contribution in #1562
- @philiporlando made their first contribution in #1585
- @mark-thm made their first contribution in #1597
- @baldwinj30 made their first contribution in #1595
- @zippeurfou made their first contribution in #1592
- @tesslinden made their first contribution in #1602
Full Changelog: v0.18.3...v0.19.0
Beta release: v0.19.0b4
What's Changed
- fix pyspark tests when run on full test suite by @cosmicBboy in #1593
- Bugfix/1580 by @cosmicBboy in #1596
- Set pandas_io.from_frictionless_schema to use a raw string for docs by @mark-thm in #1597
- Add a generic Series type for polars by @baldwinj30 in #1595
- Add StructType and DDL extraction from Pandera schemas by @filipeo2-mck in #1570
- Clean up typing for pandas GenericDtype by @cosmicBboy in #1601
- Adding warning for unique in pyspark field and a test showing the issue as well as config when it works. by @zippeurfou in #1592
- bugfix/1607: coercion error should correctly report relevant failure cases by @cosmicBboy in #1608
- Create a common DataFrameSchema class, update mypy used in pre-commit by @cosmicBboy in #1609
New Contributors
- @mark-thm made their first contribution in #1597
- @baldwinj30 made their first contribution in #1595
- @zippeurfou made their first contribution in #1592
Full Changelog: v0.19.0b3...v0.19.0b4
Beta release: v0.19.0b3
What's Changed
- fix pyspark import error by @cosmicBboy in #1591
Full Changelog: v0.19.0b2...v0.19.0b3
Beta release 0.19.0b2
What's Changed
- do not format if test is not necessary by @mattB1989 in #1530
- Register default backends when restoring state by @alkment in #1550
- Bump actions/setup-python from 4 to 5 by @dependabot in #1452
- fix: prevent environment pollution when importing pyspark by @sam-goodwin in #1552
- use rst to speed up api docs generation by @cosmicBboy in #1557
- Add _GenericAlias.call patch by @cosmicBboy in #1561
- support typeguard < 3 for better compatability by @cosmicBboy in #1563
- Add parse function to DataFrameModel in #1181
- localize GenericAlias patch to DataFrameBase subclasses by @cosmicBboy in #1571
- Bump idna from 3.4 to 3.7 by @dependabot in #1569
- docs: fix typo in env var name by @alekseik1 in #1562
- polars: fix element-wise checks, register backends by @cosmicBboy in #1572
- remove pytest ignore on modin, dask. pyspark tests with pandas >= 2 by @cosmicBboy in #1573
- make sure check name is propagated to error report by @cosmicBboy in #1574
- update ci to run pyspark, modin, dask with pandas >= v2 by @cosmicBboy in #1575
- use sphinx-design instead of sphinx-panels by @cosmicBboy in #1581
- Update bug_report.md by @philiporlando in #1585
- bugfix: polars column core checks now return check output by @cosmicBboy in #1586
- make pandera.typing.Series[TYPE] error in polars DataFrameModel more readable by @cosmicBboy in #1588
- implement timezone agnostic polars_engine.DateTime type by @cosmicBboy in #1589
New Contributors
- @alkment made their first contribution in #1550
- @sam-goodwin made their first contribution in #1552
- @alekseik1 made their first contribution in #1562
- @philiporlando made their first contribution in #1585
Full Changelog: v0.19.0b1...v0.19.0b2
Beta release 0.19.0b1
What's Changed
- Support polars DataFrames, LazyFrames by @cosmicBboy in #1373
- bugfix: optional columns in polars schema should no longer raise errors when not present by @cosmicBboy in #1532
check_nullable
does not uselessly computeisna()
anymore in pandas backend by @smarie in #1538- Polars LazyFrames are validated at the schema-level by default by @cosmicBboy in #1534
- Enable from_format_kwargs for dict format by @ektar in #1539
- Convert docs to myst by @cosmicBboy in #1542
- fix README(tab to space) by @np-yoe in #1544
- pandas DataFrameModel accepts python generic types by @cosmicBboy in #1547
- Backend registration happens at schema initialization by @cosmicBboy in #1548
New Contributors
- @smarie made their first contribution in #1538
- @ektar made their first contribution in #1539
- @np-yoe made their first contribution in #1544
Full Changelog: v0.18.3...v0.19.0b1
Beta release 0.19.0b0: Polars integration
What's Changed
- Support polars DataFrames, LazyFrames by @cosmicBboy, @AndriiG13, and @FilipAisot in #1373
Full Changelog: v0.18.3...v0.19.0b0
Release v0.18.3: Bugfix issue with SeriesSchema Index validation
What's Changed
- bugfix: add index validation to SeriesSchema by @cosmicBboy in #1524
Full Changelog: v0.18.2...v0.18.3
Release v0.18.2: Docs fix - try pandera page.
docs fix release 0.18.2
Release v0.18.1: Granular control of validation on pandas dfs.
✨ Highlights ✨
Granular control of pandas validation #1490
There is now support for granular control of schema-level or data-level validations. This can be done via the PANDERA_VALIDATION_DEPTH
environment variable. Schema-level (or metadata) validation includes things like column name checks and column data types, while data-level validation involves checks that operate on actual data values.
export PANDERA_VALIDATION_DEPTH= SCHEMA_AND_DATA # check schema- and data-level checks (default)
export PANDERA_VALIDATION_DEPTH=SCHEMA_ONLY # only do schema-level checks
export PANDERA_VALIDATION_DEPTH=DATA_ONLY # only do data-level checks
Efficient Hypothesis strategies #1503
Pandas data synthesis strategies now uses comparison operator functions for more efficient data synthesis. It also updates the minimum hypothesis
version to 6.92.7
.
What's Changed
- Fix copy-pasted docstring in PySpark accessor test by @deepyaman in #1448
- Mypy precommit by @cosmicBboy in #1468
- @check_types now properly passes in *args **kwargs and checks their types by @ecthompson99 in #1336
- Bump starlette from 0.27.0 to 0.36.2 in /dev by @dependabot in #1484
- Bump fastapi from 0.103.0 to 0.109.1 by @dependabot in #1482
- Bump actions/cache from 3 to 4 by @dependabot in #1478
- Bump codecov/codecov-action from 3 to 4 by @dependabot in #1477
- Bump jinja2 from 3.1.2 to 3.1.3 by @dependabot in #1459
- fix: pin multimethod dep version (#1485) by @schatimo in #1486
- Fix issue where str dtype in a multiindex dataframe schema results in invalid example by @gsugar87 in #1050
- Bump python-multipart from 0.0.6 to 0.0.7 by @dependabot in #1496
- Bump python-multipart from 0.0.6 to 0.0.7 in /dev by @dependabot in #1495
- Bump python-multipart from 0.0.6 to 0.0.7 in /ci by @dependabot in #1494
- Bump jinja2 from 3.1.2 to 3.1.3 in /ci by @dependabot in #1457
- Bump starlette from 0.27.0 to 0.36.2 in /dev by @dependabot in #1489
- Bugfix/1463 Pandas 2.2.0 FutureWarning resolution by using assignment instead of … by @derinwalters in #1464
- Bump jinja2 from 3.1.2 to 3.1.3 in /dev by @dependabot in #1458
- add pandas 2.2.0 to tests, use uv for pip compile by @cosmicBboy in #1502
- Efficient Hypothesis strategies by @Zac-HD in #1503
- remove headers in requirements files by @cosmicBboy in #1512
- Granular validations on pandas dfs by @kykyi in #1490
New Contributors
- @deepyaman made their first contribution in #1448
- @ecthompson99 made their first contribution in #1336
- @schatimo made their first contribution in #1486
- @gsugar87 made their first contribution in #1050
- @Zac-HD made their first contribution in #1503
Full Changelog: v0.18.0...v0.18.1
Release v0.18.0: Pandas schemas supports global configuration
✨ Highlight ✨
Pandera now supports the configuration environment variable PANDERA_VALIDATION_ENABLED
.
export PANDERA_VALIDATION_ENABLED=False
now globally deactivates validation.
What's Changed
- Bump urllib3 from 2.0.4 to 2.0.7 by @dependabot in #1383
- Bump urllib3 from 2.0.5 to 2.0.7 in /dev by @dependabot in #1382
- Bump urllib3 from 2.0.4 to 2.0.7 in /ci by @dependabot in #1381
- Bugfix/1278 add_missing_columns assorted bugfixes by @derinwalters in #1372
- Fix lack of support for new TimestampNTZType in Spark 3.4 datatypes by @filipeo2-mck in #1385
- Current
pip-compile
usage does not have--no-emit-index-url
by @filipeo2-mck in #1390 - Avoid throwing exception on Union types by @mjgp2 in #1378
- Fix optional fields in PySpark SQL by @filipeo2-mck in #1387
- Add support for
unique
validation in PySpark by @filipeo2-mck in #1396 - Enhancement to support GeoDataFrame, Geometry coercion, and CRS (Feature/1108) by @derinwalters in #1392
- fix issue for optional fields by @coobas in #1258
- Fix validating pyspark dataframes with regex columns by @lexanth in #1397
- Bump pyarrow from 13.0.0 to 14.0.1 by @dependabot in #1417
- Bump pyarrow from 13.0.0 to 14.0.1 in /dev by @dependabot in #1416
- Bump pyarrow from 13.0.0 to 14.0.1 in /ci by @dependabot in #1415
- [BUGFIX] [PYSPARK] Avoid running nullable checks if
nullable=True
by @filipeo2-mck in #1403 - Add Date type to pandera.all by @diederikperdok in #1419
- Fix disabling validation for PySpark DataFrame Schemas by @maxispeicher in #1407
- Bump actions/checkout from 3 to 4 by @dependabot in #1361
- [PySpark] Improve validation performance by enabling
cache()
/unpersist()
toggles by @filipeo2-mck in #1414 - Bump urllib3 from 2.0.5 to 2.0.7 by @dependabot in #1420
- Generate localized timestamps in multiindex examples by @rob-sil in #1426
- feature: support string column validation for pandas 2.1.3 by @karlma821 in #1425
- Add support for
PANDERA_VALIDATION_ENABLED
for pandas and Configuration docs by @noklam in #1354 - update total download badge and fix contributing instructions by @cosmicBboy in #1436
- update cache dataframe config args, fix tests by @cosmicBboy in #1437
- Bump jupyter-server from 2.7.3 to 2.11.2 in /dev by @dependabot in #1440
- Bump cryptography from 41.0.4 to 41.0.6 by @dependabot in #1435
- Bump jupyter-server from 2.7.2 to 2.11.2 by @dependabot in #1441
New Contributors
- @filipeo2-mck made their first contribution in #1385
- @mjgp2 made their first contribution in #1378
- @coobas made their first contribution in #1258
- @lexanth made their first contribution in #1397
- @diederikperdok made their first contribution in #1419
- @maxispeicher made their first contribution in #1407
- @rob-sil made their first contribution in #1426
- @karlma821 made their first contribution in #1425
- @noklam made their first contribution in #1354
Full Changelog: v0.17.2...v0.18.0