-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pandas pyarrow backend support #1628
add pandas pyarrow backend support #1628
Conversation
There seems to be a bug in pandas. |
thanks @aaravind100! looks like there're failing tests:
You can use
You'll need If you can run one specific test env:
|
This might be for backwards compatilibity reasons. I think when pandas < 2 supported pyarrow the string alias |
Ah that makes sense. Do you suggest fixing it with a condition? |
yes, using a condition is okay. also, we need to account for pandas < 2, see failures here: https://github.com/unionai-oss/pandera/actions/runs/9005141365/job/24742482507?pr=1628 I think the condition to define those classes need to check if a) pyarrow is installed and b) if pandas >= 2. You can use this function: pandera/pandera/engines/utils.py Line 12 in 63140c9
|
c4b8e01
to
a449052
Compare
@cosmicBboy i ended up removing "string[pyarrow]" in the equivalents for |
Hey @cosmicBboy, whats the best way to run the entire ci test suite in local? |
@aaravind100 you can simply do For version specific tests, you can do
|
thanks, let me run it across a subset before pushing a commit. This is what i meant by disk space usage 😄 du -h --max-depth=1 .
37K ./dev
191M ./tests
30K ./pandera.egg-info
6.0K ./.vscode
6.0K ./scripts
5.0M ./.git
18K ./__pycache__
54G ./.nox-mamba # <--
1.7M ./.hypothesis
4.9M ./pandera
89K ./.pytest_cache
223K ./ci
18K ./asv_bench
502K ./docs
1.9M ./htmlcov
5.5M ./.mypy_cache
73K ./.github
54G . |
lol yeah I experience this too... yeah just nuke |
tests/core/test_pandas_engine.py
Outdated
data_type | ||
for data_type in pandas_engine.Engine.get_registered_dtypes() | ||
if data_type | ||
!= pandas_engine.ArrowString # `string[pyarrow]` gets parsed to `string` by pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one strategy here is to do a check in the test_pandas_data_type
function body and pytest.skip
if pandas > 2 and pyarrow is installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i fixed it like this, i see this pattern in some other tests.
UNSUPPORTED_DTYPE_CLS: set[Any] = set()
# `string[pyarrow]` gets parsed to type `string` by pandas
if pandas_engine.PYARROW_INSTALLED and pandas_engine.PANDAS_2_0_0_PLUS:
UNSUPPORTED_DTYPE_CLS.add(pandas_engine.ArrowString)
@pytest.mark.parametrize(
"data_type",
[
data_type
for data_type in pandas_engine.Engine.get_registered_dtypes()
if data_type not in UNSUPPORTED_DTYPE_CLS
],
)
def test_pandas_data_type(data_type):
...
I just want to make sure it passes all the ci checks :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
fdda32f
to
f06ef84
Compare
Ran the subset for python 3.10 and 3.11. Just this test for polars fails. FAILED tests/polars/test_polars_container.py::test_dataframe_column_level_coerce - NotImplementedError |
@aaravind100 if you rebase on the |
Signed-off-by: Ajith Aravind <[email protected]>
Signed-off-by: Ajith Aravind <[email protected]>
Signed-off-by: Ajith Aravind <[email protected]>
Signed-off-by: Ajith Aravind <[email protected]>
`string[pyarrow]` gets parsed to type `string` by pandas Signed-off-by: Ajith Aravind <[email protected]>
Signed-off-by: Ajith Aravind <[email protected]>
f06ef84
to
3fd0468
Compare
Thanks, it passes now. Pushing the commit now. |
tests/core/test_pandas_engine.py
Outdated
@@ -14,9 +15,20 @@ | |||
from pandera.engines import pandas_engine | |||
from pandera.errors import ParserError | |||
|
|||
UNSUPPORTED_DTYPE_CLS: set[Any] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use typing.Set
here for compat with py3.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops that was a force of habit, let me fix it
Signed-off-by: Ajith Aravind <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1628 +/- ##
===========================================
- Coverage 94.29% 83.27% -11.02%
===========================================
Files 91 116 +25
Lines 7024 8646 +1622
===========================================
+ Hits 6623 7200 +577
- Misses 401 1446 +1045 ☔ View full report in Codecov by Sentry. |
this is amazing @aaravind100 ! all tests are passing (codecov is a false positive). gonna do one last review of the code over the weekend, but this is a huge feature set to support in pandera ❤️ |
Adds a subset of common data types from pyarrow data types.
fixes: #1262