From 89a3a3be9d08477e34a48cbc74e83d8d12dcd538 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 14 Apr 2023 10:53:07 -0700 Subject: [PATCH 01/29] Start pdep 10 --- .../pdeps/0010-required-pyarrow-dependency.md | 49 +++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 web/pandas/pdeps/0010-required-pyarrow-dependency.md diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md new file mode 100644 index 0000000000000..8c944db50f60c --- /dev/null +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -0,0 +1,49 @@ +# PDEP-10: PyArrow as a required dependency + +- Created: 13 April 2023 +- Status: Under discussion +- Discussion: [#?](https://github.com/pandas-dev/pandas/pull/?) + [#52509](https://github.com/pandas-dev/pandas/issues/52509) +- Author: [Matthew Roeschke](https://github.com/mroeschke) +- Revision: 1 + +## Abstract + +This PDEP proposes that: + +- PyArrow becomes a runtime dependency starting pandas 2.1 +- The minimum version of PyArrow supported starting pandas 2.1 is version 6. +- The minimum version of PyArrow will be bumped every major pandas release to the highest + PyArrow version that has been released for at least 2 years. + +## Background + +PyArrow has been an optional dependency of pandas since version 0.21.0. PyArrow +initially provided I/O reading functionality for formats such as Parquet and CSV. In pandas version 1.2, +pandas integrated PyArrow into the ExtensionArray interface to provide an optional string data type backed by PyArrow. +In pandas version 1.5 this functionality was expanded to support all data types that PyArrow supports. As of pandas version 2.0, +all I/O readers have the option to return PyArrow-backed data types, and a lot of methods now utilize PyArrow compute functions to +accelerate PyArrow-backed data in pandas, notibly string and datetime types. + +## Motivation + +While all the functionality described in the previous paragraph is currently optional, PyArrow has significant integration into many areas +of pandas. With our roadmap noting that pandas strives for better Apache Arrow interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow +ecosystem to pandas users. + +Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited +by PyArrow including: + +- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations +- Improve NumPy object data type support by default for analogous types that have native PyArrow support such as decimal, binary, and nested types + +## Drawbacks + +Including PyArrow would naturally increase the installation size of pandas. + +### PDEP-1 History + +- 13 April 2023: Initial version + +[^1] +[^2] From dafa709b2f8d6a9c6c5686ececb7800e533cc830 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Mon, 17 Apr 2023 14:03:07 -0700 Subject: [PATCH 02/29] finish drawbacks, fix other sections --- .../pdeps/0010-required-pyarrow-dependency.md | 35 ++++++++++++++----- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 8c944db50f60c..ba0c528387367 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -1,6 +1,6 @@ # PDEP-10: PyArrow as a required dependency -- Created: 13 April 2023 +- Created: 17 April 2023 - Status: Under discussion - Discussion: [#?](https://github.com/pandas-dev/pandas/pull/?) [#52509](https://github.com/pandas-dev/pandas/issues/52509) @@ -18,13 +18,20 @@ This PDEP proposes that: ## Background -PyArrow has been an optional dependency of pandas since version 0.21.0. PyArrow -initially provided I/O reading functionality for formats such as Parquet and CSV. In pandas version 1.2, -pandas integrated PyArrow into the ExtensionArray interface to provide an optional string data type backed by PyArrow. -In pandas version 1.5 this functionality was expanded to support all data types that PyArrow supports. As of pandas version 2.0, -all I/O readers have the option to return PyArrow-backed data types, and a lot of methods now utilize PyArrow compute functions to +PyArrow is an optional dependency of pandas that provides a wide range of supplimental feature to pandas: + +- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet +- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an optional string data type backed by PyArrow +- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV +- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow data types within the `ExtensionArray` interface +- Since pandas 2.0.0, All I/O readers have the option to return PyArrow-backed data types, and many methods now utilize PyArrow compute functions to accelerate PyArrow-backed data in pandas, notibly string and datetime types. +As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: + +1. Consistent ``NA`` support for all data types +2. Broader support of data types such as ``decimal``, ``date`` and nested types + ## Motivation While all the functionality described in the previous paragraph is currently optional, PyArrow has significant integration into many areas @@ -35,15 +42,25 @@ Additionally, requiring PyArrow would simplify the related development within pa by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations -- Improve NumPy object data type support by default for analogous types that have native PyArrow support such as decimal, binary, and nested types +- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types ## Drawbacks -Including PyArrow would naturally increase the installation size of pandas. +Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas +are about `70MB`, and PyArrow is around `120MB`. An increase of installation size would have negative impliciation using pandas in space-constrained development +or deployment environments such as AWS Lambda. + +Additionally, if a user is installing pandas in an environment where wheels are not available and needs to build from source, the user will need to build Arrow C++ and related dependencies. These environments include + +- Alpine linux (commonly used as a base for Docker containers) +- WASM (pyodide and pyscript) +- Python development versions + +Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version before releasing a new pandas version. ### PDEP-1 History -- 13 April 2023: Initial version +- 17 April 2023: Initial version [^1] [^2] From 5e1fbd17bc2e999aeb33c2e9662b59a346c44d40 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Mon, 17 Apr 2023 15:03:14 -0700 Subject: [PATCH 03/29] Add number --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index ba0c528387367..809ab61915f2f 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -2,7 +2,7 @@ - Created: 17 April 2023 - Status: Under discussion -- Discussion: [#?](https://github.com/pandas-dev/pandas/pull/?) +- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711) [#52509](https://github.com/pandas-dev/pandas/issues/52509) - Author: [Matthew Roeschke](https://github.com/mroeschke) - Revision: 1 From 44a33215f7396b151a8144e1a977d0d5fa0d283f Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Mon, 17 Apr 2023 15:44:19 -0700 Subject: [PATCH 04/29] our current version is 7 not 6 --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 809ab61915f2f..36b3df0bda9ad 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -12,7 +12,7 @@ This PDEP proposes that: - PyArrow becomes a runtime dependency starting pandas 2.1 -- The minimum version of PyArrow supported starting pandas 2.1 is version 6. +- The minimum version of PyArrow supported starting pandas 2.1 is version 7. - The minimum version of PyArrow will be bumped every major pandas release to the highest PyArrow version that has been released for at least 2 years. From fbd1aa02dcb31790d03a3bf4a81086e883f70095 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Tue, 18 Apr 2023 11:13:40 -0700 Subject: [PATCH 05/29] Clarify and fix typo --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 36b3df0bda9ad..3de373365970e 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -14,11 +14,12 @@ This PDEP proposes that: - PyArrow becomes a runtime dependency starting pandas 2.1 - The minimum version of PyArrow supported starting pandas 2.1 is version 7. - The minimum version of PyArrow will be bumped every major pandas release to the highest - PyArrow version that has been released for at least 2 years. + PyArrow version that has been released for at least 2 years, and the minimum PyArrow version will be + maintained for every minor version in the major version series. ## Background -PyArrow is an optional dependency of pandas that provides a wide range of supplimental feature to pandas: +PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas: - Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet - Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an optional string data type backed by PyArrow From 6d667b483657b72107fee1daa65bfb563788df1b Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 21 Apr 2023 09:51:02 +0200 Subject: [PATCH 06/29] Update web/pandas/pdeps/0010-required-pyarrow-dependency.md Co-authored-by: Irv Lustig --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 3de373365970e..51ccb5a62188a 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -11,7 +11,7 @@ This PDEP proposes that: -- PyArrow becomes a runtime dependency starting pandas 2.1 +- PyArrow becomes a runtime dependency starting with pandas 2.1 - The minimum version of PyArrow supported starting pandas 2.1 is version 7. - The minimum version of PyArrow will be bumped every major pandas release to the highest PyArrow version that has been released for at least 2 years, and the minimum PyArrow version will be From bed5f0b84664b380b5a0db032883443182963a8e Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 21 Apr 2023 11:26:08 +0200 Subject: [PATCH 07/29] Update web/pandas/pdeps/0010-required-pyarrow-dependency.md Co-authored-by: Irv Lustig --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 51ccb5a62188a..ac86353d8a4cf 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -12,7 +12,7 @@ This PDEP proposes that: - PyArrow becomes a runtime dependency starting with pandas 2.1 -- The minimum version of PyArrow supported starting pandas 2.1 is version 7. +- The minimum version of PyArrow supported starting with pandas 2.1 is version 7 of PyArrow. - The minimum version of PyArrow will be bumped every major pandas release to the highest PyArrow version that has been released for at least 2 years, and the minimum PyArrow version will be maintained for every minor version in the major version series. From 12622bbb564f05e66267df8f4b9228d93505bd45 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 21 Apr 2023 11:26:17 +0200 Subject: [PATCH 08/29] Update web/pandas/pdeps/0010-required-pyarrow-dependency.md Co-authored-by: Irv Lustig --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index ac86353d8a4cf..cd83d97efb89f 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -25,7 +25,7 @@ PyArrow is an optional dependency of pandas that provides a wide range of supple - Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an optional string data type backed by PyArrow - Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV - Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow data types within the `ExtensionArray` interface -- Since pandas 2.0.0, All I/O readers have the option to return PyArrow-backed data types, and many methods now utilize PyArrow compute functions to +- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods now utilize PyArrow compute functions to accelerate PyArrow-backed data in pandas, notibly string and datetime types. As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: From 864b8d10497341f4f2b7c0b835fe189845606e2d Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 21 Apr 2023 11:14:58 -0700 Subject: [PATCH 09/29] Add string as a preferential pyarrow type --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index cd83d97efb89f..e0acb9c5eacbd 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -43,7 +43,7 @@ Additionally, requiring PyArrow would simplify the related development within pa by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations -- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types +- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, decimal, binary, and nested types ## Drawbacks From 2d4f4fd35ace69a56e38b200e3c1808a10164e93 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 21 Apr 2023 11:23:19 -0700 Subject: [PATCH 10/29] Add metric about number of pyarrow import checks --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 1 + 1 file changed, 1 insertion(+) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index e0acb9c5eacbd..89484c4736b32 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -43,6 +43,7 @@ Additionally, requiring PyArrow would simplify the related development within pa by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations + - Currently, there are 17 runtime PyArrow import checks throughout the pandas code base - Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, decimal, binary, and nested types ## Drawbacks From bb332ca9c2118d657f53bd2223a43735b5ad81aa Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 21 Apr 2023 11:24:41 -0700 Subject: [PATCH 11/29] Clarify with actual call --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 89484c4736b32..67398e28b4905 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -43,7 +43,7 @@ Additionally, requiring PyArrow would simplify the related development within pa by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - - Currently, there are 17 runtime PyArrow import checks throughout the pandas code base + - Currently, there are 17 `import_optiona_dependency("pyarrow")` checks throughout the pandas code base - Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, decimal, binary, and nested types ## Drawbacks From a8275fa01ae6f0857c3cb7664c91c00860b74e58 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 21 Apr 2023 11:24:47 -0700 Subject: [PATCH 12/29] Clarify with actual call --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 67398e28b4905..dd6f74c933750 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -43,7 +43,7 @@ Additionally, requiring PyArrow would simplify the related development within pa by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - - Currently, there are 17 `import_optiona_dependency("pyarrow")` checks throughout the pandas code base + - Currently, there are 17 `import_optional_dependency("pyarrow")` checks throughout the pandas code base - Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, decimal, binary, and nested types ## Drawbacks From b406dc1fec6fd0668877689436b4bae5fa9838f4 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 28 Apr 2023 11:14:46 -0700 Subject: [PATCH 13/29] Address some comments --- .../pdeps/0010-required-pyarrow-dependency.md | 41 +++++++++++-------- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index dd6f74c933750..69bc9e0b756b2 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -13,19 +13,22 @@ This PDEP proposes that: - PyArrow becomes a runtime dependency starting with pandas 2.1 - The minimum version of PyArrow supported starting with pandas 2.1 is version 7 of PyArrow. -- The minimum version of PyArrow will be bumped every major pandas release to the highest - PyArrow version that has been released for at least 2 years, and the minimum PyArrow version will be - maintained for every minor version in the major version series. +- The minimum version of PyArrow will be bumped only during a major release of pandas. +- When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has + been released for at least 2 years. ## Background PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas: - Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet -- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an optional string data type backed by PyArrow +- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an + optional string data type backed by PyArrow - Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV -- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow data types within the `ExtensionArray` interface -- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods now utilize PyArrow compute functions to +- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow + data types within the `ExtensionArray` interface +- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods + now utilize PyArrow compute functions to accelerate PyArrow-backed data in pandas, notibly string and datetime types. As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: @@ -35,30 +38,36 @@ As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data repres ## Motivation -While all the functionality described in the previous paragraph is currently optional, PyArrow has significant integration into many areas -of pandas. With our roadmap noting that pandas strives for better Apache Arrow interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow +While all the functionality described in the previous paragraph is currently optional, PyArrow has significant +integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow +interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with +the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow ecosystem to pandas users. -Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited -by PyArrow including: +Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy +functionality that would be better suited by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - Currently, there are 17 `import_optional_dependency("pyarrow")` checks throughout the pandas code base -- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, decimal, binary, and nested types +- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, + decimal, binary, and nested types ## Drawbacks -Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas -are about `70MB`, and PyArrow is around `120MB`. An increase of installation size would have negative impliciation using pandas in space-constrained development -or deployment environments such as AWS Lambda. +Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow +using pip from wheels, numpy and pandas are about `70MB`, and PyArrow is around `120MB`. An increase of installation size would +have negative impliciation using pandas in space-constrained development or deployment environments such as AWS Lambda. -Additionally, if a user is installing pandas in an environment where wheels are not available and needs to build from source, the user will need to build Arrow C++ and related dependencies. These environments include +Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`, +the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include - Alpine linux (commonly used as a base for Docker containers) - WASM (pyodide and pyscript) - Python development versions -Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version before releasing a new pandas version. +Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when +supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version +before releasing a new pandas version. ### PDEP-1 History From ecc4d5b3aac7770875ce8af38cb8899a9f556705 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 28 Apr 2023 23:40:52 +0200 Subject: [PATCH 14/29] Update 0010-required-pyarrow-dependency.md --- .../pdeps/0010-required-pyarrow-dependency.md | 65 ++++++++++++++++++- 1 file changed, 63 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 69bc9e0b756b2..d091aa3932fcc 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -49,8 +49,69 @@ functionality that would be better suited by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - Currently, there are 17 `import_optional_dependency("pyarrow")` checks throughout the pandas code base -- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as string, - decimal, binary, and nested types + +- Removing unnecessary functionality: + - fastparquet engine in ``read_parquet`` + - potentially simplifying the ``read_csv`` logic (needs more investigation) + +- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as: + - decimal + - binary + - nested types (like lists, dicts, ...) + - strings + +Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster: + +**Performance:** + +```python +import string +import random + +import pandas as pd + + +def random_string() -> str: + return "".join(random.choices(string.printable, k=random.randint(10, 100))) + + +ser_object = pd.Series([random_string() for _ in range(1_000_000)]) +ser_string = ser_object.astype("string[pyarrow]") + +``` + +PyArrow backed strings are significantly faster than NumPy object strings: + +*str.len* + +```python +In[1]: %timeit ser_object.str.len() +118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + +In[2]: %timeit ser_string.str.len() +24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) +``` + +*str.startswith* + +```python +In[3]: %timeit ser_object.str.startswith("a") +136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + +In[4]: %timeit ser_string.str.startswith("a") +11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) +``` + +Another advantage is I/O. PyArrow engines in pandas can provide a significant speedup. Currently, the data +are cast to NumPy dtypes, which requires roundtripping when converting back to PyArrow strings explicitly, which +hinders performance. + +**Memory** + +PyArrow backed strings use significantly less memory. Dask developers investigated this [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes). + +Short summary: PyArrow strings required 1/3 of the original memory. + ## Drawbacks From ec1c0e3e29adb2db07c60db434ccc00d6f075bd9 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 28 Apr 2023 23:55:20 +0200 Subject: [PATCH 15/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index d091aa3932fcc..e6394220c8087 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -50,7 +50,7 @@ functionality that would be better suited by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - Currently, there are 17 `import_optional_dependency("pyarrow")` checks throughout the pandas code base -- Removing unnecessary functionality: +- Removing unnecessary functionality: - fastparquet engine in ``read_parquet`` - potentially simplifying the ``read_csv`` logic (needs more investigation) @@ -59,7 +59,7 @@ functionality that would be better suited by PyArrow including: - binary - nested types (like lists, dicts, ...) - strings - + Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster: **Performance:** From 23eb251e10adb66f2301b9d9633bacf3215644a8 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 28 Apr 2023 15:15:15 -0700 Subject: [PATCH 16/29] add Patrick as an author, remove constraint on only bumping during major version --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index e6394220c8087..ec7f0ecfa51a2 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -5,6 +5,7 @@ - Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711) [#52509](https://github.com/pandas-dev/pandas/issues/52509) - Author: [Matthew Roeschke](https://github.com/mroeschke) + [Patrick Hoefler](https://github.com/phofl) - Revision: 1 ## Abstract @@ -13,7 +14,6 @@ This PDEP proposes that: - PyArrow becomes a runtime dependency starting with pandas 2.1 - The minimum version of PyArrow supported starting with pandas 2.1 is version 7 of PyArrow. -- The minimum version of PyArrow will be bumped only during a major release of pandas. - When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has been released for at least 2 years. From 2ddd82ae9948d29317edfb274a5b0d0afaaf5db2 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Mon, 8 May 2023 17:38:18 -0700 Subject: [PATCH 17/29] Change required proposal for 3.0 to be version requiring pyarrow & string dtype inference default --- .../pdeps/0010-required-pyarrow-dependency.md | 37 +++++++++++++------ 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index ec7f0ecfa51a2..b1f438b34a9e2 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -1,4 +1,4 @@ -# PDEP-10: PyArrow as a required dependency +# PDEP-10: PyArrow as a required dependency for default string inference implementation - Created: 17 April 2023 - Status: Under discussion @@ -12,10 +12,14 @@ This PDEP proposes that: -- PyArrow becomes a runtime dependency starting with pandas 2.1 -- The minimum version of PyArrow supported starting with pandas 2.1 is version 7 of PyArrow. +- PyArrow becomes a runtime dependency starting with pandas 3.0 +- The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow. - When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has been released for at least 2 years. +- Starting in pandas 2.1, pandas raises a ``FutureWarning`` when needing to infer string data that the future + data type result will be `ArrowDtype` with `pyarrow.string` instead of object +- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` + instead of `object` ## Background @@ -33,8 +37,19 @@ accelerate PyArrow-backed data in pandas, notibly string and datetime types. As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: -1. Consistent ``NA`` support for all data types -2. Broader support of data types such as ``decimal``, ``date`` and nested types +1. Consistent `NA` support for all data types +2. Broader support of data types such as `decimal`, `date` and nested types + +Additionally, when users pass string data into pandas constructors without specifying a data type, the result data type +is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default +the inferred type to the more efficient pyarrow string type. + +```python +In [1]: import pandas as pd + +In [2]: pd.Series(["a"]).dtype +Out[2]: dtype('O') +``` ## Motivation @@ -48,16 +63,15 @@ Additionally, requiring PyArrow would simplify the related development within pa functionality that would be better suited by PyArrow including: - Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - - Currently, there are 17 `import_optional_dependency("pyarrow")` checks throughout the pandas code base -- Removing unnecessary functionality: - - fastparquet engine in ``read_parquet`` - - potentially simplifying the ``read_csv`` logic (needs more investigation) +- Removing redundant functionality: + - fastparquet engine in `read_parquet` + - potentially simplifying the `read_csv` logic (needs more investigation) - Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as: - decimal - binary - - nested types (like lists, dicts, ...) + - nested types (list or dict data) - strings Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster: @@ -76,8 +90,7 @@ def random_string() -> str: ser_object = pd.Series([random_string() for _ in range(1_000_000)]) -ser_string = ser_object.astype("string[pyarrow]") - +ser_string = ser_object.astype("string[pyarrow]")\ ``` PyArrow backed strings are significantly faster than NumPy object strings: From 1b60fbb4eb133ea38f6cad730049b058e9b37ecf Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Tue, 9 May 2023 12:01:27 -0700 Subject: [PATCH 18/29] Address typos --- .../pdeps/0010-required-pyarrow-dependency.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index b1f438b34a9e2..505423f8bf439 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -12,7 +12,7 @@ This PDEP proposes that: -- PyArrow becomes a runtime dependency starting with pandas 3.0 +- PyArrow becomes a required runtime dependency starting with pandas 3.0 - The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow. - When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has been released for at least 2 years. @@ -40,7 +40,7 @@ As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data repres 1. Consistent `NA` support for all data types 2. Broader support of data types such as `decimal`, `date` and nested types -Additionally, when users pass string data into pandas constructors without specifying a data type, the result data type +Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default the inferred type to the more efficient pyarrow string type. @@ -48,7 +48,11 @@ the inferred type to the more efficient pyarrow string type. In [1]: import pandas as pd In [2]: pd.Series(["a"]).dtype +# Current behavior Out[2]: dtype('O') + +# Future behavior in 3.0 +Out[2]: string[pyarrow] ``` ## Motivation @@ -129,8 +133,9 @@ Short summary: PyArrow strings required 1/3 of the original memory. ## Drawbacks Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow -using pip from wheels, numpy and pandas are about `70MB`, and PyArrow is around `120MB`. An increase of installation size would -have negative impliciation using pandas in space-constrained development or deployment environments such as AWS Lambda. +using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires around `120MB`. An increase +of installation size would have negative implication using pandas in space-constrained development or deployment environments +such as AWS Lambda. Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`, the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include @@ -146,6 +151,7 @@ before releasing a new pandas version. ### PDEP-1 History - 17 April 2023: Initial version +- 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1 [^1] [^2] From f047032598bbdb2c84acc4d51e70bde0643ea1a8 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sun, 2 Jul 2023 23:18:56 +0200 Subject: [PATCH 19/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 505423f8bf439..31f00e441c8ef 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -16,8 +16,11 @@ This PDEP proposes that: - The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow. - When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has been released for at least 2 years. -- Starting in pandas 2.1, pandas raises a ``FutureWarning`` when needing to infer string data that the future - data type result will be `ArrowDtype` with `pyarrow.string` instead of object +- The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting + with pandas 3.0. +- Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users + environment when pandas is imported. This will ensure that only one warning is raised and users can + easily silence it if necessary. - Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` instead of `object` From ed28c044c7d5cfff2a7c642956a5cc99b44d9ebc Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Mon, 3 Jul 2023 12:06:13 +0200 Subject: [PATCH 20/29] Update web/pandas/pdeps/0010-required-pyarrow-dependency.md Co-authored-by: Simon Hawkins --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 31f00e441c8ef..a32faa4733fd8 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -151,7 +151,7 @@ Lastly, pandas development and releases will need to be mindful of PyArrow's dev supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version before releasing a new pandas version. -### PDEP-1 History +### PDEP-10 History - 17 April 2023: Initial version - 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1 From 99de932e4a5285bb975a9b6bda813279febea9b2 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 4 Jul 2023 19:04:42 +0200 Subject: [PATCH 21/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index a32faa4733fd8..ed38ccb65db9f 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -75,7 +75,7 @@ functionality that would be better suited by PyArrow including: - fastparquet engine in `read_parquet` - potentially simplifying the `read_csv` logic (needs more investigation) -- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as: +- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes: - decimal - binary - nested types (list or dict data) From 99fd73941cfb768338a2ac2a857d2a149ce7cc97 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 4 Jul 2023 19:05:56 +0200 Subject: [PATCH 22/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index ed38ccb65db9f..5fa94bd051e3c 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -17,7 +17,7 @@ This PDEP proposes that: - When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has been released for at least 2 years. - The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting - with pandas 3.0. + with pandas 3.0. We will pin a feedback issue on the pandas issue tracker. - Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. From 9384bc74e1d0f39ab01f2bd53c7f6c2732ad8230 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 4 Jul 2023 19:11:29 +0200 Subject: [PATCH 23/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 5fa94bd051e3c..f0b5e0ad46441 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -17,10 +17,11 @@ This PDEP proposes that: - When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has been released for at least 2 years. - The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting - with pandas 3.0. We will pin a feedback issue on the pandas issue tracker. + with pandas 3.0. We will pin a feedback issue on the pandas issue tracker. The note in the release notes will point + to that issue. - Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can - easily silence it if necessary. + easily silence it if necessary. This warning will point to the feedback issue. - Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` instead of `object` From c3beeb3c1b6278fc9fd6c0a11290055fe8dac185 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 4 Jul 2023 23:28:56 +0200 Subject: [PATCH 24/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index f0b5e0ad46441..a6894b375d0f2 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -23,7 +23,7 @@ This PDEP proposes that: environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue. - Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` - instead of `object` + instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as objec. ## Background @@ -81,6 +81,8 @@ functionality that would be better suited by PyArrow including: - binary - nested types (list or dict data) - strings + - time + - date Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster: From 8347e832ca69134060c77dae2d05bbcf12afcc00 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Wed, 5 Jul 2023 07:53:03 +0100 Subject: [PATCH 25/29] improve structure, list user benefits more clearly, add faq --- .../pdeps/0010-required-pyarrow-dependency.md | 146 ++++++++++-------- 1 file changed, 80 insertions(+), 66 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index a6894b375d0f2..8b4b4397c2131 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -23,7 +23,10 @@ This PDEP proposes that: environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue. - Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` - instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as objec. + instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as object. + +This will bring **immediate benefits to users**, as well as opening up the door for significant further +benefits in the future. ## Background @@ -41,11 +44,23 @@ accelerate PyArrow-backed data in pandas, notibly string and datetime types. As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: -1. Consistent `NA` support for all data types -2. Broader support of data types such as `decimal`, `date` and nested types +1. Consistent `NA` support for all data types; +2. Broader support of data types such as `decimal`, `date` and nested types; +3. Better interoperability with other dataframe libraries based on Arrow. + +## Motivation + +While all the functionality described in the previous paragraph is currently optional, PyArrow has significant +integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow +interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with +the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow +ecosystem (as well as improving interoperability with it). + +### Immediate User Benefit 1: pyarrow strings Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type -is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default +is `object`, which has horrendous memory and performance implications. +With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default the inferred type to the more efficient pyarrow string type. ```python @@ -59,88 +74,74 @@ Out[2]: dtype('O') Out[2]: string[pyarrow] ``` -## Motivation - -While all the functionality described in the previous paragraph is currently optional, PyArrow has significant -integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow -interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with -the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow -ecosystem to pandas users. - -Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy -functionality that would be better suited by PyArrow including: - -- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations - -- Removing redundant functionality: - - fastparquet engine in `read_parquet` - - potentially simplifying the `read_csv` logic (needs more investigation) - -- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes: - - decimal - - binary - - nested types (list or dict data) - - strings - - time - - date - -Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster: +Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes), +and found them to be a significant improvement over the current `object` dtype. -**Performance:** +### Immediate User Benefit 2: Nested Datatypes +Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype: ```python -import string -import random - -import pandas as pd - - -def random_string() -> str: - return "".join(random.choices(string.printable, k=random.randint(10, 100))) - - -ser_object = pd.Series([random_string() for _ in range(1_000_000)]) -ser_string = ser_object.astype("string[pyarrow]")\ +In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}]) +Out[6]: +0 {'a': 1, 'b': 2} +1 {'a': 2, 'b': 99} +dtype: object ``` -PyArrow backed strings are significantly faster than NumPy object strings: +If `pyarrow` were required, this could have been auto-inferred to be `pyarrow.struct`, which again +would come with memory and performance improvements. -*str.len* +### Immediate User Benefit 3: Interoperability +Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation +would improve interoperability with them, as operations such as: ```python -In[1]: %timeit ser_object.str.len() -118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - -In[2]: %timeit ser_string.str.len() -24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) +import pandas as pd +import polars as pl + +df = pd.DataFrame( + { + 'a': ['one', 'two'], + 'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}], + } +) +pl.from_pandas(df) ``` +could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to +switch between them. -*str.startswith* +### Future User Benefits: -```python -In[3]: %timeit ser_object.str.startswith("a") -136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - -In[4]: %timeit ser_string.str.startswith("a") -11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) -``` +Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy +functionality that would be better suited by PyArrow including: -Another advantage is I/O. PyArrow engines in pandas can provide a significant speedup. Currently, the data -are cast to NumPy dtypes, which requires roundtripping when converting back to PyArrow strings explicitly, which -hinders performance. +- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations -**Memory** +- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes: + - decimal + - binary + - nested types (list or dict data) + - strings + - time + - date -PyArrow backed strings use significantly less memory. Dask developers investigated this [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes). +#### Developer benefits -Short summary: PyArrow strings required 1/3 of the original memory. +First, this would simplify development of pyarrow-backed datatypes, as it would avoid +optional dependency checks. +Second, it could potentially remove redundant functionality: +- fastparquet engine in `read_parquet`; +- potentially simplifying the `read_csv` logic (needs more investigation); +- MaskedDtypes/Arrays; +- factorization; +- datetime/timezone ops. ## Drawbacks Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow -using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires around `120MB`. An increase -of installation size would have negative implication using pandas in space-constrained development or deployment environments +using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires an additional `120MB`. +An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments such as AWS Lambda. Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`, @@ -154,6 +155,19 @@ Lastly, pandas development and releases will need to be mindful of PyArrow's dev supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version before releasing a new pandas version. +## F.A.Q. + +**Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?** + +**A**: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct, + not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries. + +**Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?** + +**A**: We're not making them the default (yet). For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be + `np.int64`. We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are + stored as `object` dtype, such as strings and nested datatypes. + ### PDEP-10 History - 17 April 2023: Initial version From d740403cb7d4520a88331d5f21f7d41f6f2f7d23 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Wed, 5 Jul 2023 09:22:31 +0100 Subject: [PATCH 26/29] restore little demo --- .../pdeps/0010-required-pyarrow-dependency.md | 38 +++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 8b4b4397c2131..5101e4982b85d 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -77,6 +77,44 @@ Out[2]: string[pyarrow] Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes), and found them to be a significant improvement over the current `object` dtype. +Little demo: +```python +import string +import random + +import pandas as pd + + +def random_string() -> str: + return "".join(random.choices(string.printable, k=random.randint(10, 100))) + + +ser_object = pd.Series([random_string() for _ in range(1_000_000)]) +ser_string = ser_object.astype("string[pyarrow]")\ +``` + +PyArrow backed strings are significantly faster than NumPy object strings: + +*str.len* + +```python +In[1]: %timeit ser_object.str.len() +118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + +In[2]: %timeit ser_string.str.len() +24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) +``` + +*str.startswith* + +```python +In[3]: %timeit ser_object.str.startswith("a") +136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + +In[4]: %timeit ser_string.str.startswith("a") +11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) +``` + ### Immediate User Benefit 2: Nested Datatypes Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype: From 959873e708eb930c52b69234b645e2500389ccfd Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Wed, 5 Jul 2023 09:23:57 +0100 Subject: [PATCH 27/29] remove masked part, note that pyarrow dtyeps will likely be ready by 3 --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 5101e4982b85d..f2f8a86f22fd8 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -171,7 +171,6 @@ optional dependency checks. Second, it could potentially remove redundant functionality: - fastparquet engine in `read_parquet`; - potentially simplifying the `read_csv` logic (needs more investigation); -- MaskedDtypes/Arrays; - factorization; - datetime/timezone ops. @@ -202,7 +201,8 @@ before releasing a new pandas version. **Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?** -**A**: We're not making them the default (yet). For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be +**A**: They will likely be ready by 3.0 - however, we're not making them the default (yet). + For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be `np.int64`. We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are stored as `object` dtype, such as strings and nested datatypes. From 2db0037b10aaa14994b307cbe64ff82b7c1dc260 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 13 Jul 2023 09:21:49 -0500 Subject: [PATCH 28/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index f2f8a86f22fd8..3fc911dd54018 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -59,7 +59,7 @@ ecosystem (as well as improving interoperability with it). ### Immediate User Benefit 1: pyarrow strings Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type -is `object`, which has horrendous memory and performance implications. +is `object`, which has significantly much worse memory usage and performance as compared to pyarrow strings. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default the inferred type to the more efficient pyarrow string type. From 4e0515153dabe3b929933b61e094ab3fc5e48afc Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sun, 30 Jul 2023 17:33:29 +0200 Subject: [PATCH 29/29] Update 0010-required-pyarrow-dependency.md --- web/pandas/pdeps/0010-required-pyarrow-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md index 3fc911dd54018..4d6e928ce68bd 100644 --- a/web/pandas/pdeps/0010-required-pyarrow-dependency.md +++ b/web/pandas/pdeps/0010-required-pyarrow-dependency.md @@ -1,7 +1,7 @@ # PDEP-10: PyArrow as a required dependency for default string inference implementation - Created: 17 April 2023 -- Status: Under discussion +- Status: Accepted - Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711) [#52509](https://github.com/pandas-dev/pandas/issues/52509) - Author: [Matthew Roeschke](https://github.com/mroeschke)