Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-10: Add pyarrow as a required dependency #52711

Merged
merged 40 commits into from
Jul 30, 2023
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
89a3a3b
Start pdep 10
mroeschke Apr 14, 2023
cf88b43
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 17, 2023
dafa709
finish drawbacks, fix other sections
mroeschke Apr 17, 2023
5e1fbd1
Add number
mroeschke Apr 17, 2023
44a3321
our current version is 7 not 6
mroeschke Apr 17, 2023
ea9f5e3
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 18, 2023
fbd1aa0
Clarify and fix typo
mroeschke Apr 18, 2023
6d667b4
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
bed5f0b
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
12622bb
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
864b8d1
Add string as a preferential pyarrow type
mroeschke Apr 21, 2023
2d4f4fd
Add metric about number of pyarrow import checks
mroeschke Apr 21, 2023
bb332ca
Clarify with actual call
mroeschke Apr 21, 2023
a8275fa
Clarify with actual call
mroeschke Apr 21, 2023
1148007
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 28, 2023
b406dc1
Address some comments
mroeschke Apr 28, 2023
ecc4d5b
Update 0010-required-pyarrow-dependency.md
phofl Apr 28, 2023
ec1c0e3
Update 0010-required-pyarrow-dependency.md
phofl Apr 28, 2023
23eb251
add Patrick as an author, remove constraint on only bumping during ma…
mroeschke Apr 28, 2023
dd7c62a
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke May 9, 2023
2ddd82a
Change required proposal for 3.0 to be version requiring pyarrow & st…
mroeschke May 9, 2023
3c54d22
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke May 9, 2023
1b60fbb
Address typos
mroeschke May 9, 2023
70cdf74
Merge branch 'main' into pdep/pyarrow
mroeschke May 24, 2023
14602a6
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 1, 2023
2cfb92f
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 9, 2023
e0e406c
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 20, 2023
f047032
Update 0010-required-pyarrow-dependency.md
phofl Jul 2, 2023
ed28c04
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Jul 3, 2023
99de932
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
99fd739
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
9384bc7
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
c3beeb3
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
8347e83
improve structure, list user benefits more clearly, add faq
MarcoGorelli Jul 5, 2023
d740403
restore little demo
MarcoGorelli Jul 5, 2023
959873e
remove masked part, note that pyarrow dtyeps will likely be ready by 3
MarcoGorelli Jul 5, 2023
f936280
Merge pull request #26 from MarcoGorelli/pdep10-amendments
mroeschke Jul 6, 2023
2db0037
Update 0010-required-pyarrow-dependency.md
phofl Jul 13, 2023
c2b8cfe
Merge branch 'main' into pdep/pyarrow
mroeschke Jul 25, 2023
4e05151
Update 0010-required-pyarrow-dependency.md
phofl Jul 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions web/pandas/pdeps/0010-required-pyarrow-dependency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
# PDEP-10: PyArrow as a required dependency for default string inference implementation

- Created: 17 April 2023
- Status: Accepted
- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711)
[#52509](https://github.com/pandas-dev/pandas/issues/52509)
- Author: [Matthew Roeschke](https://github.com/mroeschke)
[Patrick Hoefler](https://github.com/phofl)
- Revision: 1

## Abstract

This PDEP proposes that:

- PyArrow becomes a required runtime dependency starting with pandas 3.0
- The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow.
- When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has
been released for at least 2 years.
- The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting
with pandas 3.0. We will pin a feedback issue on the pandas issue tracker. The note in the release notes will point
to that issue.
- Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users
environment when pandas is imported. This will ensure that only one warning is raised and users can
easily silence it if necessary. This warning will point to the feedback issue.
- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string`
instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as object.

This will bring **immediate benefits to users**, as well as opening up the door for significant further
benefits in the future.

## Background

PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas:

- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet
- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an
optional string data type backed by PyArrow
- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV
- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow
data types within the `ExtensionArray` interface
- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods
now utilize PyArrow compute functions to
accelerate PyArrow-backed data in pandas, notibly string and datetime types.

As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:

1. Consistent `NA` support for all data types;
2. Broader support of data types such as `decimal`, `date` and nested types;
3. Better interoperability with other dataframe libraries based on Arrow.

## Motivation

While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
ecosystem (as well as improving interoperability with it).

### Immediate User Benefit 1: pyarrow strings

Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type
is `object`, which has significantly much worse memory usage and performance as compared to pyarrow strings.
With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
the inferred type to the more efficient pyarrow string type.

```python
In [1]: import pandas as pd

In [2]: pd.Series(["a"]).dtype
# Current behavior
Out[2]: dtype('O')

# Future behavior in 3.0
Out[2]: string[pyarrow]
```

Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes),
and found them to be a significant improvement over the current `object` dtype.

Little demo:
```python
import string
import random

import pandas as pd


def random_string() -> str:
return "".join(random.choices(string.printable, k=random.randint(10, 100)))


ser_object = pd.Series([random_string() for _ in range(1_000_000)])
ser_string = ser_object.astype("string[pyarrow]")\
```

PyArrow backed strings are significantly faster than NumPy object strings:

*str.len*

```python
In[1]: %timeit ser_object.str.len()
118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[2]: %timeit ser_string.str.len()
24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

*str.startswith*

```python
In[3]: %timeit ser_object.str.startswith("a")
136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[4]: %timeit ser_string.str.startswith("a")
11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

### Immediate User Benefit 2: Nested Datatypes

Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype:
```python
In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}])
Out[6]:
0 {'a': 1, 'b': 2}
1 {'a': 2, 'b': 99}
dtype: object
```

If `pyarrow` were required, this could have been auto-inferred to be `pyarrow.struct`, which again
would come with memory and performance improvements.

### Immediate User Benefit 3: Interoperability

Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation
would improve interoperability with them, as operations such as:
```python
import pandas as pd
import polars as pl

df = pd.DataFrame(
{
'a': ['one', 'two'],
'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}],
}
)
pl.from_pandas(df)
```
could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to
switch between them.

### Future User Benefits:

Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
functionality that would be better suited by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

Copy link
Member

@MarcoGorelli MarcoGorelli Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this comment by Will has been addressed (unless I missed it?)

to make it easier to find: the link is here, and says:

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?


- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
- decimal
- binary
- nested types (list or dict data)
- strings
- time
- date

#### Developer benefits

First, this would simplify development of pyarrow-backed datatypes, as it would avoid
optional dependency checks.

Second, it could potentially remove redundant functionality:
- fastparquet engine in `read_parquet`;
- potentially simplifying the `read_csv` logic (needs more investigation);
- factorization;
- datetime/timezone ops.

## Drawbacks

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow
using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires an additional `120MB`.
An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments
such as AWS Lambda.

Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`,
the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include

- Alpine linux (commonly used as a base for Docker containers)
- WASM (pyodide and pyscript)
- Python development versions

Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when
supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version
before releasing a new pandas version.

## F.A.Q.

**Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?**

**A**: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct,
not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries.

**Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?**

**A**: They will likely be ready by 3.0 - however, we're not making them the default (yet).
For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be
`np.int64`. We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are
stored as `object` dtype, such as strings and nested datatypes.

### PDEP-10 History

- 17 April 2023: Initial version
phofl marked this conversation as resolved.
Show resolved Hide resolved
- 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the revision history only needs to include the updates for published PDEPs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a real opinion on this, there were some requests that we should include it in this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an observation. I'll be downvoting this proposal anyway so no real need to change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worthwhile to indicate any major changes that came about as a result of the discussion


[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability>
[^2] <https://arrow.apache.org/powered_by/>
attack68 marked this conversation as resolved.
Show resolved Hide resolved