From fbeb69db49e659bc39ddd445264591d134c263f1 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 3 May 2024 17:12:42 +0200 Subject: [PATCH 01/24] PDEP: Dedicated string data type for pandas 3.0 --- web/pandas/pdeps/00xx-string-dtype.md | 245 ++++++++++++++++++++++++++ 1 file changed, 245 insertions(+) create mode 100644 web/pandas/pdeps/00xx-string-dtype.md diff --git a/web/pandas/pdeps/00xx-string-dtype.md b/web/pandas/pdeps/00xx-string-dtype.md new file mode 100644 index 0000000000000..3be59fce11e1b --- /dev/null +++ b/web/pandas/pdeps/00xx-string-dtype.md @@ -0,0 +1,245 @@ +# PDEP-XX: Dedicated string data type for pandas 3.0 + +- Created: May 3, 2024 +- Status: Under discussion +- Discussion: +- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) +- Revision: 1 + +## Abstract + +This PDEP proposes to introduce a dedicated string dtype that will be used by +default in pandas 3.0: + +* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available + or otherwise the numpy object-dtype alternative. +* The default string dtype will use missing value semantics using NaN consistent + with the other default data types. + +This will give users a long-awaited proper string dtype for 3.0, while 1) not +(yet) making PyArrow a _hard_ dependency, but only a dependency used by default, +and 2) leaving room for future improvements (different missing value semantics, +using NumPy 2.0, etc). + +# Dedicated string data type for pandas 3.0 + +## Background + +Currently, pandas by default stores text data in an `object`-dtype NumPy array. +The current implementation has two primary drawbacks: First, `object`-dtype is +not specific to strings: any Python object can be stored in an `object`-dtype +array, not just strings, and seeing `object` as the dtype for a column with +strings is confusing for users. Second: this is not efficient (all string +methods on a Series are eventually done by calling Python methods on the +individual string objects). + +To solve the first issue, a dedicated extension dtype for string data has +already been +[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type). +This has always been opt-in for now, requiring users to explicitly request the +dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing +this string dtype was initially almost the same as the default implementation, +i.e. an `object`-dtype NumPy array of Python strings. + +To solve the second issue (performance), pandas contributed to the development +of string kernels in the PyArrow package, and a variant of the string dtype +backed by PyArrow was +[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type). +This could be specified with the `storage` keyword in the opt-in string dtype +(`pd.StringDtype(storage="pyarrow")`). + +Since its introduction, the `StringDtype` has always been opt-in, and has used +the experimental `pd.NA` sentinel for missing values (which was also [introduced +in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). +However, up to this date, pandas has not yet made the step to use `pd.NA` by +default. + +In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html) +proposed to start using a PyArrow-backed string dtype by default in pandas 3.0 +(i.e. infer this type for string data instead of object dtype). To ensure we +could use the variant of `StringDtype` backed by PyArrow instead of Python +objects (for better performance), it proposed to make `pyarrow` a new required +runtime dependency of pandas. + +In the meantime, NumPy has also been working on a native variable-width string +data type, which will be available [starting with NumPy +2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy). +This can provide a potential alternative to PyArrow for implementing a string +data type in pandas that is not backed by Python objects. + +After acceptance of PDEP-10, two aspects of the proposal have been under +reconsideration: + +- Based on user feedback, it has been considered to relax the new `pyarrow` + requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can + potentially reduce the need to make PyArrow a required dependency specifically + for a dedicated pandas string dtype. +- The PDEP did not consider the usage of the experimental `pd.NA` as a + consequence of adopting one of the existing implementations of the + `StringDtype`. + +For the second aspect, another variant of the `StringDtype` was +[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings) +that is still backed by PyArrow but follows the default missing values semantics +pandas uses for all other default data types (and using `NaN` as the missing +value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)). +At the time, the `storage` option for this new variant was called +`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using `pd.NA`. + +This last dtype variant is what you currently (pandas 2.2) get for string data +when enabling the ``future.infer_string`` option (to enable the behaviour which +is intended to become the default in pandas 3.0). + +## Proposal + +To be able to move forward with a string data type in pandas 3.0, this PDEP proposes: + +1. For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow + if installed, and otherwise falls back to an in-house functionally-equivalent + (but slower) version. +2. This default "string" dtype will follow the same behaviour for missing values + as our other default data types, and use `NaN` as the missing value sentinel. +3. The version that is not backed by PyArrow can reuse the existing numpy + object-dtype backed StringArray for its implementation. +4. We update installation guidelines to clearly encourage users to install + pyarrow for the default user experience. + +### Default inference of a string dtype + +By default, pandas will infer this new string dtype for string data (when +creating pandas objects, such as in constructors or IO functions). + +The existing `future.infer_string` option can be used to opt-in to the future +default behaviour: + +```python +>>> pd.options.future.infer_string = True +>>> pd.Series(["a", "b", None]) +0 a +1 b +2 NaN +dtype: string +``` + +This option will be expanded to also work when PyArrow is not installed. + +### Missing value semantics + +Given that all other default data types uses NaN semantics for missing values, +this proposal says that a new default string dtype should still use the same +default semantics. Further, it should result in default data types when doing +operations on the string column that result in a boolean or numeric data type +(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison +operators like `==`, should result in default `int64` and `bool` data types). + +Because the current original `StringDtype` implementations already use `pd.NA` +and return masked integer and boolean arrays in operations, a new variant of the +existing dtypes that uses `NaN` and default data types is needed. + +### Object-dtype "fallback" implementation + +To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep +a "fallback" option in case PyArrow is not installed. The original `StringDtype` +backed by a numpy object-dtype array of Python strings can be used for this, and +only need minor updates to follow the above-mentioned missing value semantics +([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). + +For pandas 3.0, this is the most realistic option given this implementation is +already available for a long time. Beyond 3.0, we can still explore further +improvements such as using nanoarrow or NumPy 2.0, but at that point that is an +implementation detail that should not have a direct impact on users (except for +performance). + +### Naming + +Given the long history of this topic, the naming of the dtypes is a difficult +topic. + +In the first place, we need to acknowledge that most users should not need to +use storage-specific options. Users are expected to specify `pd.StringDtype()` +or `"string"`, and that will give them their default string dtype (which +depends on whether PyArrow is installed or not). + +But for testing purposes and advanced use cases that want control over this, we +need some way to specify this and distinguish them from the other string dtypes. +Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where +"pyarrow_numpy" is a rather confusing option. + +TODO see if we can come up with a better naming scheme + +## Alternatives + +### Why not delay introducing a default string dtype? + +To avoid introducing a new string dtype while other discussions and changes are +in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as +the default missing value sentinel? using the new NumPy 2.0 capabilities?), we +could also delay introducing a default string dtype until there is more clarity +for those other discussions. + +However: + +1. Delaying has a cost: it further postpones introducing a dedicated string + dtype that has massive benefits for our users, both in usability as (for the + significant part of the user base that has PyArrow installed) in performance. +2. In case we eventually transition to use `pd.NA` as the default missing value + sentinel, we will need a migration path for _all_ our data types, and thus + the challenges around this will not be unique to the string dtype. + +### Why not use the existing StringDtype with `pd.NA`? + +Because adding even more variants of the string dtype will make things only more +confusing? Indeed, this proposal unfortunately introduces more variants of the +string dtype. However, the reason for this is to ensure the actual default user +experience is _less_ confusing, and the new string dtype fits better with the +other default data types. + +If the new default string data type would use `pd.NA`, then after some +operations, a user can easily end up with a DataFrame that mixes columns using +`NaN` semantics and columns using `NA` semantics (and thus a DataFrame that +could have columns with two different int64, two different float64, two different +bool, etc dtypes). This would lead to a very confusing default experience. + +With the proposed new variant of the StringDtype, this will ensure that for the +_default_ experience, a user will only see only 1 kind of integer dtype, only +kind of 1 bool dtype, etc. For now, a user should only get columns with an +`ArrowDtype` and/or using `pd.NA` when explicitly opting into this. + +## Backward compatibility + +The most visible backwards incompatible change will be that columns with string +data will no longer have an `object` dtype. Therefore, code that assumes +`object` dtype (such as `ser.dtype == object`) will need to be updated. + +To allow testing your code in advance, the +`pd.options.future.infer_string = True` option is available. + +Otherwise, the actual string-specific functionality (such as the `.str` accessor +methods) should all keep working as is. By preserving the current missing value +semantics, this proposal is also backwards compatible on this aspect. + +One other backwards incompatible change is present for early adopters of the +existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start +returning the new default string dtype, while up to now this returned the +experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users +will need to start specifying a keyword in the dtype constructor if they want to +keep using `pd.NA` (but if they just want to have a dedicated string dtype, they +don't need to change their code). + +## Timeline + +The future PyArrow-backed string dtype was already made available behind a feature +flag in pandas 2.1 (by `pd.options.future.infer_string = True`). + +Some small enhancements or fixes (or naming changes) might still be needed and +can be backported to pandas 2.2.x. + +The variant using numpy object-dtype could potentially also be backported to +2.2.x to allow easier testing. + +For pandas 3.0, this flag becomes enabled by default. + + +## PDEP-XX History + +- 3 May 2024: Initial version From f03f54d8c67a011a62779a363798d0dc2d9cf0f4 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 3 May 2024 18:41:02 +0200 Subject: [PATCH 02/24] small textual edits and typos --- web/pandas/pdeps/00xx-string-dtype.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/web/pandas/pdeps/00xx-string-dtype.md b/web/pandas/pdeps/00xx-string-dtype.md index 3be59fce11e1b..a96a9f96e9b9d 100644 --- a/web/pandas/pdeps/00xx-string-dtype.md +++ b/web/pandas/pdeps/00xx-string-dtype.md @@ -13,7 +13,7 @@ default in pandas 3.0: * In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise the numpy object-dtype alternative. -* The default string dtype will use missing value semantics using NaN consistent +* The default string dtype will use missing value semantics (using NaN) consistent with the other default data types. This will give users a long-awaited proper string dtype for 3.0, while 1) not @@ -26,12 +26,12 @@ using NumPy 2.0, etc). ## Background Currently, pandas by default stores text data in an `object`-dtype NumPy array. -The current implementation has two primary drawbacks: First, `object`-dtype is +The current implementation has two primary drawbacks. First, `object` dtype is not specific to strings: any Python object can be stored in an `object`-dtype array, not just strings, and seeing `object` as the dtype for a column with strings is confusing for users. Second: this is not efficient (all string -methods on a Series are eventually done by calling Python methods on the -individual string objects). +methods on a Series are eventually calling Python methods on the individual +string objects). To solve the first issue, a dedicated extension dtype for string data has already been @@ -51,8 +51,9 @@ This could be specified with the `storage` keyword in the opt-in string dtype Since its introduction, the `StringDtype` has always been opt-in, and has used the experimental `pd.NA` sentinel for missing values (which was also [introduced in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). -However, up to this date, pandas has not yet made the step to use `pd.NA` by -default. +However, up to this date, pandas has not yet taken the step to use `pd.NA` by +default, and thus the `StringDtype` deviates in missing value behaviour compared +to the default data types. In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html) proposed to start using a PyArrow-backed string dtype by default in pandas 3.0 @@ -125,15 +126,15 @@ This option will be expanded to also work when PyArrow is not installed. ### Missing value semantics -Given that all other default data types uses NaN semantics for missing values, +Given that all other default data types use NaN semantics for missing values, this proposal says that a new default string dtype should still use the same default semantics. Further, it should result in default data types when doing operations on the string column that result in a boolean or numeric data type (e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison operators like `==`, should result in default `int64` and `bool` data types). -Because the current original `StringDtype` implementations already use `pd.NA` -and return masked integer and boolean arrays in operations, a new variant of the +Because the original `StringDtype` implementations already use `pd.NA` and +return masked integer and boolean arrays in operations, a new variant of the existing dtypes that uses `NaN` and default data types is needed. ### Object-dtype "fallback" implementation @@ -175,7 +176,7 @@ To avoid introducing a new string dtype while other discussions and changes are in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as the default missing value sentinel? using the new NumPy 2.0 capabilities?), we could also delay introducing a default string dtype until there is more clarity -for those other discussions. +in those other discussions. However: @@ -184,11 +185,12 @@ However: significant part of the user base that has PyArrow installed) in performance. 2. In case we eventually transition to use `pd.NA` as the default missing value sentinel, we will need a migration path for _all_ our data types, and thus - the challenges around this will not be unique to the string dtype. + the challenges around this will not be unique to the string dtype and + therefore not a reason to delay this. ### Why not use the existing StringDtype with `pd.NA`? -Because adding even more variants of the string dtype will make things only more +Wouldn't adding even more variants of the string dtype will make things only more confusing? Indeed, this proposal unfortunately introduces more variants of the string dtype. However, the reason for this is to ensure the actual default user experience is _less_ confusing, and the new string dtype fits better with the From 561de87b27fbbfdac37e122c120431e6712bb264 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Sun, 5 May 2024 13:55:06 +0200 Subject: [PATCH 03/24] address part of the feedback --- web/pandas/pdeps/00xx-string-dtype.md | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/web/pandas/pdeps/00xx-string-dtype.md b/web/pandas/pdeps/00xx-string-dtype.md index a96a9f96e9b9d..705da434caabf 100644 --- a/web/pandas/pdeps/00xx-string-dtype.md +++ b/web/pandas/pdeps/00xx-string-dtype.md @@ -2,7 +2,7 @@ - Created: May 3, 2024 - Status: Under discussion -- Discussion: +- Discussion: https://github.com/pandas-dev/pandas/pull/58551 - Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) - Revision: 1 @@ -71,10 +71,11 @@ data type in pandas that is not backed by Python objects. After acceptance of PDEP-10, two aspects of the proposal have been under reconsideration: -- Based on user feedback, it has been considered to relax the new `pyarrow` - requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can - potentially reduce the need to make PyArrow a required dependency specifically - for a dedicated pandas string dtype. +- Based on user feedback (mostly around installation complexity and size), it + has been considered to relax the new `pyarrow` requirement to not be a _hard_ + runtime dependency. In addition, NumPy 2.0 could in the future potentially + reduce the need to make PyArrow a required dependency specifically for a + dedicated pandas string dtype. - The PDEP did not consider the usage of the experimental `pd.NA` as a consequence of adopting one of the existing implementations of the `StringDtype`. @@ -105,6 +106,9 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop 4. We update installation guidelines to clearly encourage users to install pyarrow for the default user experience. +Those string dtypes enabled by default will then no longer be considered as +experimental. + ### Default inference of a string dtype By default, pandas will infer this new string dtype for string data (when @@ -141,15 +145,17 @@ existing dtypes that uses `NaN` and default data types is needed. To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep a "fallback" option in case PyArrow is not installed. The original `StringDtype` -backed by a numpy object-dtype array of Python strings can be used for this, and -only need minor updates to follow the above-mentioned missing value semantics +backed by a numpy object-dtype array of Python strings can be mostly reused for +this (adding a new variant of the dtype) and a new `StringArray` subclass only +needs minor changes to follow the above-mentioned missing value semantics ([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). For pandas 3.0, this is the most realistic option given this implementation is already available for a long time. Beyond 3.0, we can still explore further -improvements such as using nanoarrow or NumPy 2.0, but at that point that is an -implementation detail that should not have a direct impact on users (except for -performance). +improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) +or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)), +but at that point that is an implementation detail that should not have a +direct impact on users (except for performance). ### Naming From 86f4e51bfc65f68866b9714409f06b9f3e136919 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Sun, 5 May 2024 13:56:16 +0200 Subject: [PATCH 04/24] Update web/pandas/pdeps/00xx-string-dtype.md Co-authored-by: Simon Hawkins --- web/pandas/pdeps/00xx-string-dtype.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/00xx-string-dtype.md b/web/pandas/pdeps/00xx-string-dtype.md index 705da434caabf..cd95845f9a3af 100644 --- a/web/pandas/pdeps/00xx-string-dtype.md +++ b/web/pandas/pdeps/00xx-string-dtype.md @@ -101,7 +101,7 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop (but slower) version. 2. This default "string" dtype will follow the same behaviour for missing values as our other default data types, and use `NaN` as the missing value sentinel. -3. The version that is not backed by PyArrow can reuse the existing numpy +3. The version that is not backed by PyArrow can reuse (with minor code additions) the existing numpy object-dtype backed StringArray for its implementation. 4. We update installation guidelines to clearly encourage users to install pyarrow for the default user experience. From 30c7b4337918940f104096bad5ac2bfe199c3be9 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 10:59:06 +0200 Subject: [PATCH 05/24] rename file --- web/pandas/pdeps/{00xx-string-dtype.md => 0014-string-dtype.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename web/pandas/pdeps/{00xx-string-dtype.md => 0014-string-dtype.md} (99%) diff --git a/web/pandas/pdeps/00xx-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md similarity index 99% rename from web/pandas/pdeps/00xx-string-dtype.md rename to web/pandas/pdeps/0014-string-dtype.md index cd95845f9a3af..2741b97df51f4 100644 --- a/web/pandas/pdeps/00xx-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -1,4 +1,4 @@ -# PDEP-XX: Dedicated string data type for pandas 3.0 +# PDEP-14: Dedicated string data type for pandas 3.0 - Created: May 3, 2024 - Status: Under discussion From 54a43b3e4fa2d2e90cca5a35637f7f816981f29b Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 11:36:02 +0200 Subject: [PATCH 06/24] expand Missing value semantics section --- web/pandas/pdeps/0014-string-dtype.md | 53 ++++++++++++++++++++------- 1 file changed, 39 insertions(+), 14 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 2741b97df51f4..c723280add4ef 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -101,8 +101,9 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop (but slower) version. 2. This default "string" dtype will follow the same behaviour for missing values as our other default data types, and use `NaN` as the missing value sentinel. -3. The version that is not backed by PyArrow can reuse (with minor code additions) the existing numpy - object-dtype backed StringArray for its implementation. +3. The version that is not backed by PyArrow can reuse (with minor code + additions) the existing numpy object-dtype backed StringArray for its + implementation. 4. We update installation guidelines to clearly encourage users to install pyarrow for the default user experience. @@ -111,8 +112,9 @@ experimental. ### Default inference of a string dtype -By default, pandas will infer this new string dtype for string data (when -creating pandas objects, such as in constructors or IO functions). +By default, pandas will infer this new string dtype instead of object dtype for +string data (when creating pandas objects, such as in constructors or IO +functions). The existing `future.infer_string` option can be used to opt-in to the future default behaviour: @@ -130,16 +132,39 @@ This option will be expanded to also work when PyArrow is not installed. ### Missing value semantics -Given that all other default data types use NaN semantics for missing values, -this proposal says that a new default string dtype should still use the same -default semantics. Further, it should result in default data types when doing -operations on the string column that result in a boolean or numeric data type -(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison -operators like `==`, should result in default `int64` and `bool` data types). +As mentioned in the background section, the original `StringDtype` has used +the experimental `pd.NA` sentinel for missing values. In addition to using +`pd.NA` as the scalar for a missing value, this essentially means +that: + +- String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics) + for missing values, where `NA` propagates in boolean operations such as + comparisons or predicates. +- Operations on the string column that give a numeric or boolean result use the + nullable Integer/Float/Boolean data types (e.g. `ser.str.len()` returns the + nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64` + dtype (or `float64` in case of missing values)). + +However, up to this date, all other default data types still use NaN semantics +for missing values. Therefore, this proposal says that a new default string +dtype should also still use the same default missing value semantics and return +default data types when doing operations on the string column, to be consistent +with the other default dtypes at this point. + +In practice, this means that the default `"string"` dtype will use `NaN` as +the missing value sentinel, and: + +- String columns will follow NaN-semantics for missing values, where `NaN` gives + False in boolean operations such as comparisons or predicates. +- Operations on the string column that give a numeric or boolean result will use + the default data types (i.e. numpy `int64`/`float64`/`bool`). Because the original `StringDtype` implementations already use `pd.NA` and return masked integer and boolean arrays in operations, a new variant of the -existing dtypes that uses `NaN` and default data types is needed. +existing dtypes that uses `NaN` and default data types is needed. The original +variant of `StringDtype` using `pd.NA` will still be available for those who +want to keep using it (see below in the "Naming" subsection for how to specify +this). ### Object-dtype "fallback" implementation @@ -196,7 +221,7 @@ However: ### Why not use the existing StringDtype with `pd.NA`? -Wouldn't adding even more variants of the string dtype will make things only more +Wouldn't adding even more variants of the string dtype make things only more confusing? Indeed, this proposal unfortunately introduces more variants of the string dtype. However, the reason for this is to ensure the actual default user experience is _less_ confusing, and the new string dtype fits better with the @@ -210,8 +235,8 @@ bool, etc dtypes). This would lead to a very confusing default experience. With the proposed new variant of the StringDtype, this will ensure that for the _default_ experience, a user will only see only 1 kind of integer dtype, only -kind of 1 bool dtype, etc. For now, a user should only get columns with an -`ArrowDtype` and/or using `pd.NA` when explicitly opting into this. +kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA` +when explicitly opting into this. ## Backward compatibility From 5b5835b2902a0d3d8db8ac6ce5e6357e250d3549 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 12:12:20 +0200 Subject: [PATCH 07/24] expand Naming subsection with storage+na_value proposal --- web/pandas/pdeps/0014-string-dtype.md | 66 ++++++++++++++++++++++++--- 1 file changed, 59 insertions(+), 7 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index c723280add4ef..b2e51795d8dfd 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -21,8 +21,6 @@ This will give users a long-awaited proper string dtype for 3.0, while 1) not and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0, etc). -# Dedicated string data type for pandas 3.0 - ## Background Currently, pandas by default stores text data in an `object`-dtype NumPy array. @@ -86,7 +84,9 @@ that is still backed by PyArrow but follows the default missing values semantics pandas uses for all other default data types (and using `NaN` as the missing value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)). At the time, the `storage` option for this new variant was called -`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using `pd.NA`. +`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using +`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming" +subsection below). This last dtype variant is what you currently (pandas 2.2) get for string data when enabling the ``future.infer_string`` option (to enable the behaviour which @@ -194,10 +194,49 @@ depends on whether PyArrow is installed or not). But for testing purposes and advanced use cases that want control over this, we need some way to specify this and distinguish them from the other string dtypes. -Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where -"pyarrow_numpy" is a rather confusing option. - -TODO see if we can come up with a better naming scheme +In addition, users that want to continue using the original NA-variant of the +dtype need a way to specify this. + +Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where +the `"pyarrow_numpy"` storage was used to disambiguate from the existing +`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather +confusing option and doesn't generalize well. Therefore, this PDEP proposes +a new naming scheme as outlined below, and we will deprecate and remove +"pyarrow_numpy" before pandas 3.0. + +The `storage` keyword of `StringDtype` is kept to disambiguate the underlying +storage of the string data (using pyarrow or python objects), but an additional +`na_value` is introduced to disambiguate the the variants using NA semantics +and NaN semantics. + +Overview of the different ways to specify a dtype and the resulting concrete +dtype of the data: + +| User specification | Concrete dtype | String alias | Note | +|----------------------------------------|---------------------------------------------------|-------------------------|------| +| Unspecified (inference) | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1) | +| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1), (2) | +| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | +| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | +| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | +| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[python]" | | +| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow"|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | +| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | + +Notes: + +- (1) You get "pyarrow" or "python" depending on pyarrow being installed. +- (2) Those three rows are backwards incompatible (i.e. they work now but give + you the NA-variant), see the "Backward compatibility" section below. +- (3) "pyarrow_numpy" is kept temporarily because this is already in a released + version, but we can deprecate it in 2.2.x and have it removed for 3.0. + +For the new default string dtype, only the `"string"` alias can be used to +specify the dtype as a string, i.e. we would not provide a way to make the +underlying storage (pyarrow or python) explicit through the string alias. This +string alias is only a convenience shortcut and for most users `"string"` is +sufficient (they don't need to specify the storage), and the explicit +`pd.StringDtype(...)` is still available for more fine-grained control. ## Alternatives @@ -238,6 +277,19 @@ _default_ experience, a user will only see only 1 kind of integer dtype, only kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA` when explicitly opting into this. +### Naming alternatives + +This PDEP now keeps the `pd.StringDtype` class constructor with the existing +`storage` keyword and with an additional `na_value` keyword. + +During the discussion, several alternatives have been brought up. Both +alternative keyword names as using a different constructor. This PDEP opted to +keep using the existing `pd.StringDtype()` for now to keep the changes as +minimal as possible, leaving a larger overhaul of the dtype system (potentially +including different constructor functions or namespace) for a future discussion. +See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full +discussion. + ## Backward compatibility The most visible backwards incompatible change will be that columns with string From 9ede2e64616ddcc3a4c4b6a74b932675b0b95d03 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 14:19:22 +0200 Subject: [PATCH 08/24] Expand Backward compatibility section + add proposal for deprecation --- web/pandas/pdeps/0014-string-dtype.md | 76 ++++++++++++++++++++++----- 1 file changed, 62 insertions(+), 14 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index b2e51795d8dfd..bdb5ec534d6ae 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -244,9 +244,10 @@ sufficient (they don't need to specify the storage), and the explicit To avoid introducing a new string dtype while other discussions and changes are in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as -the default missing value sentinel? using the new NumPy 2.0 capabilities?), we -could also delay introducing a default string dtype until there is more clarity -in those other discussions. +the default missing value sentinel? using the new NumPy 2.0 capabilities? +overhauling all our dtypes to use a logical data type system?), we could also +delay introducing a default string dtype until there is more clarity in those +other discussions. However: @@ -258,6 +259,11 @@ However: the challenges around this will not be unique to the string dtype and therefore not a reason to delay this. +Making this change now for 3.0 will benefit the majority of our users, while +coming at a cost for a part of the users who already started using the +`"string"` dtype (they will have to update their code to continue to the variant +using `pd.NA`, see the "Backward compatibility" section below). + ### Why not use the existing StringDtype with `pd.NA`? Wouldn't adding even more variants of the string dtype make things only more @@ -294,22 +300,64 @@ discussion. The most visible backwards incompatible change will be that columns with string data will no longer have an `object` dtype. Therefore, code that assumes -`object` dtype (such as `ser.dtype == object`) will need to be updated. +`object` dtype (such as `ser.dtype == object`) will need to be updated. This +change is done as a hard break in a major release, as warning in advance for the +changed inference is deemed to noisy. To allow testing your code in advance, the `pd.options.future.infer_string = True` option is available. Otherwise, the actual string-specific functionality (such as the `.str` accessor -methods) should all keep working as is. By preserving the current missing value -semantics, this proposal is also backwards compatible on this aspect. - -One other backwards incompatible change is present for early adopters of the -existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start -returning the new default string dtype, while up to now this returned the -experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users -will need to start specifying a keyword in the dtype constructor if they want to -keep using `pd.NA` (but if they just want to have a dedicated string dtype, they -don't need to change their code). +methods) should generally all keep working as is. By preserving the current +missing value semantics, this proposal is also backwards compatible on this +aspect. + +### For existing users of `StringDtype` + +Users of the existing `StringDtype` will see more backwards incompatible +changes, though. In pandas 3.0, calling `pd.StringDtype()` (or specifying +`dtype="string"`) will start returning the new default string dtype using `NaN`, +while up to now this returned the string dtype using `pd.NA` introduced in +pandas 1.0. + +For example, this code snippet returned the NA-variant of `StringDtype` with +pandas 1.x and 2.x: + +```python +>>> pd.Series(["a", "b", None], dtype="string") +0 a +1 b +2 +dtype: string +``` + +but will start returning the new default NaN-variant of `StringDtype` with +pandas 3.0. This means that the missing value sentinel will change from `pd.NA` +to `NaN`, and that operations will no longer return nullable dtypes but default +numpy dtypes (see the "Missing value semantics" section above). + +While this change will be transparent in many cases (e.g. checking for missing +values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of +a string predicate method keeps working regardless of the sentinel), this can be +a breaking change if you relied on the exact sentinel or resulting dtype. Since +pandas 1.0, the string dtype has been promoted quite a bit, and so we expect +that many users already have started using this dtype, even though officially +still labeled as "experimental". + +To smooth the upgrade experience for those users, we propose to add a +deprecation warning before 3.0 when such dtype is created, giving them two +options: + +- If the user just wants to have a dedicated "string" dtype (or the better + performance when using pyarrow) but is fine with using the default NaN + semantics, they can add `pd.options.future.infer_string = True` to their code + to suppress the warning and already opt-in to the future behaviour of pandas + 3.0. +- If the user specifically wants the variant of the string dtype that uses + `pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will + have to update their dtype specification from `"string"` / `pd.StringDtype()` + to `pd.StringDtype(na_value=pd.NA)` to suppress the warning and further keep + their code running as is. ## Timeline From f5faf4e0e5ef23e4cbb46ea1f2583eac6e6600ba Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 14:59:07 +0200 Subject: [PATCH 09/24] update timeline --- web/pandas/pdeps/0014-string-dtype.md | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index bdb5ec534d6ae..ffc1f3cd58f00 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -161,7 +161,7 @@ the missing value sentinel, and: Because the original `StringDtype` implementations already use `pd.NA` and return masked integer and boolean arrays in operations, a new variant of the -existing dtypes that uses `NaN` and default data types is needed. The original +existing dtypes that uses `NaN` and default data types was needed. The original variant of `StringDtype` using `pd.NA` will still be available for those who want to keep using it (see below in the "Naming" subsection for how to specify this). @@ -175,8 +175,8 @@ this (adding a new variant of the dtype) and a new `StringArray` subclass only needs minor changes to follow the above-mentioned missing value semantics ([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). -For pandas 3.0, this is the most realistic option given this implementation is -already available for a long time. Beyond 3.0, we can still explore further +For pandas 3.0, this is the most realistic option given this implementation has +already been available for a long time. Beyond 3.0, we can still explore further improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)), but at that point that is an implementation detail that should not have a @@ -362,16 +362,23 @@ options: ## Timeline The future PyArrow-backed string dtype was already made available behind a feature -flag in pandas 2.1 (by `pd.options.future.infer_string = True`). +flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`). -Some small enhancements or fixes (or naming changes) might still be needed and -can be backported to pandas 2.2.x. +Some small enhancements or fixes might still be needed and can continue to be +backported to pandas 2.2.x. -The variant using numpy object-dtype could potentially also be backported to -2.2.x to allow easier testing. +The variant using numpy object-dtype can also be backported to the 2.2.x branch +to allow easier testing. We would propose to release this as 2.3.0 (created from +the 2.2.x branch, given that the main branch already includes many other changes +targeted for 3.0), together with the deprecation warning when creating a dtype +from `"string"` / `pd.StringDtype()`. -For pandas 3.0, this flag becomes enabled by default. +The 2.3.0 release would then have all future string functionality available +(both the pyarrow and object-dtype based variants of the default string dtype), +and warn existing users of the `StringDtype` in advance of 3.0 about how to +update their code. +For pandas 3.0, this `future.infer_string` flag becomes enabled by default. ## PDEP-XX History From f554909e95e055745227e945e31dfc5fabc1c0bf Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 17:30:32 +0200 Subject: [PATCH 10/24] Apply suggestions from code review Co-authored-by: Irv Lustig --- web/pandas/pdeps/0014-string-dtype.md | 55 ++++++++++++++------------- 1 file changed, 28 insertions(+), 27 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index ffc1f3cd58f00..923ea67a54141 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -19,7 +19,7 @@ default in pandas 3.0: This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a _hard_ dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, -using NumPy 2.0, etc). +using NumPy 2.0 strings, etc). ## Background @@ -74,7 +74,7 @@ reconsideration: runtime dependency. In addition, NumPy 2.0 could in the future potentially reduce the need to make PyArrow a required dependency specifically for a dedicated pandas string dtype. -- The PDEP did not consider the usage of the experimental `pd.NA` as a +- PDEP-10 did not consider the usage of the experimental `pd.NA` as a consequence of adopting one of the existing implementations of the `StringDtype`. @@ -88,7 +88,7 @@ At the time, the `storage` option for this new variant was called `pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming" subsection below). -This last dtype variant is what you currently (pandas 2.2) get for string data +This last dtype variant is what users currently (pandas 2.2) get for string data when enabling the ``future.infer_string`` option (to enable the behaviour which is intended to become the default in pandas 3.0). @@ -96,15 +96,15 @@ is intended to become the default in pandas 3.0). To be able to move forward with a string data type in pandas 3.0, this PDEP proposes: -1. For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow +1. For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow if installed, and otherwise falls back to an in-house functionally-equivalent (but slower) version. 2. This default "string" dtype will follow the same behaviour for missing values - as our other default data types, and use `NaN` as the missing value sentinel. + as other default data types, and use `NaN` as the missing value sentinel. 3. The version that is not backed by PyArrow can reuse (with minor code additions) the existing numpy object-dtype backed StringArray for its implementation. -4. We update installation guidelines to clearly encourage users to install +4. Installation guidelines are updated to clearly encourage users to install pyarrow for the default user experience. Those string dtypes enabled by default will then no longer be considered as @@ -145,7 +145,7 @@ that: nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64` dtype (or `float64` in case of missing values)). -However, up to this date, all other default data types still use NaN semantics +However, up to this date, all other default data types still use `NaN` semantics for missing values. Therefore, this proposal says that a new default string dtype should also still use the same default missing value semantics and return default data types when doing operations on the string column, to be consistent @@ -176,9 +176,10 @@ needs minor changes to follow the above-mentioned missing value semantics ([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). For pandas 3.0, this is the most realistic option given this implementation has -already been available for a long time. Beyond 3.0, we can still explore further +already been available for a long time. Beyond 3.0, further improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) -or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)), +or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) +can still be explored, but at that point that is an implementation detail that should not have a direct impact on users (except for performance). @@ -187,7 +188,7 @@ direct impact on users (except for performance). Given the long history of this topic, the naming of the dtypes is a difficult topic. -In the first place, we need to acknowledge that most users should not need to +In the first place, it should be acknowledged that most users should not need to use storage-specific options. Users are expected to specify `pd.StringDtype()` or `"string"`, and that will give them their default string dtype (which depends on whether PyArrow is installed or not). @@ -201,8 +202,8 @@ Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where the `"pyarrow_numpy"` storage was used to disambiguate from the existing `"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather confusing option and doesn't generalize well. Therefore, this PDEP proposes -a new naming scheme as outlined below, and we will deprecate and remove -"pyarrow_numpy" before pandas 3.0. +a new naming scheme as outlined below, and +"pyarrow_numpy" will be deprecated and removed before pandas 3.0. The `storage` keyword of `StringDtype` is kept to disambiguate the underlying storage of the string data (using pyarrow or python objects), but an additional @@ -227,12 +228,12 @@ Notes: - (1) You get "pyarrow" or "python" depending on pyarrow being installed. - (2) Those three rows are backwards incompatible (i.e. they work now but give - you the NA-variant), see the "Backward compatibility" section below. + the NA-variant), see the "Backward compatibility" section below. - (3) "pyarrow_numpy" is kept temporarily because this is already in a released version, but we can deprecate it in 2.2.x and have it removed for 3.0. For the new default string dtype, only the `"string"` alias can be used to -specify the dtype as a string, i.e. we would not provide a way to make the +specify the dtype as a string, i.e. a way would not be provided to make the underlying storage (pyarrow or python) explicit through the string alias. This string alias is only a convenience shortcut and for most users `"string"` is sufficient (they don't need to specify the storage), and the explicit @@ -245,23 +246,23 @@ sufficient (they don't need to specify the storage), and the explicit To avoid introducing a new string dtype while other discussions and changes are in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as the default missing value sentinel? using the new NumPy 2.0 capabilities? -overhauling all our dtypes to use a logical data type system?), we could also -delay introducing a default string dtype until there is more clarity in those +overhauling all our dtypes to use a logical data type system?), +introducing a default string dtype could also be delayed until there is more clarity in those other discussions. However: 1. Delaying has a cost: it further postpones introducing a dedicated string - dtype that has massive benefits for our users, both in usability as (for the + dtype that has massive benefits for users, both in usability as (for the significant part of the user base that has PyArrow installed) in performance. -2. In case we eventually transition to use `pd.NA` as the default missing value - sentinel, we will need a migration path for _all_ our data types, and thus +2. In case pandas eventually transitions to use `pd.NA` as the default missing value + sentinel, a migration path for _all_ our data types will be needed, and thus the challenges around this will not be unique to the string dtype and therefore not a reason to delay this. -Making this change now for 3.0 will benefit the majority of our users, while +Making this change now for 3.0 will benefit the majority of users, while coming at a cost for a part of the users who already started using the -`"string"` dtype (they will have to update their code to continue to the variant +`"string"` or `pd.StringDtype()` dtype (they will have to update their code to continue to the variant using `pd.NA`, see the "Backward compatibility" section below). ### Why not use the existing StringDtype with `pd.NA`? @@ -302,10 +303,10 @@ The most visible backwards incompatible change will be that columns with string data will no longer have an `object` dtype. Therefore, code that assumes `object` dtype (such as `ser.dtype == object`) will need to be updated. This change is done as a hard break in a major release, as warning in advance for the -changed inference is deemed to noisy. +changed inference is deemed too noisy. -To allow testing your code in advance, the -`pd.options.future.infer_string = True` option is available. +To allow testing code in advance, the +`pd.options.future.infer_string = True` option is available for users. Otherwise, the actual string-specific functionality (such as the `.str` accessor methods) should generally all keep working as is. By preserving the current @@ -339,12 +340,12 @@ numpy dtypes (see the "Missing value semantics" section above). While this change will be transparent in many cases (e.g. checking for missing values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of a string predicate method keeps working regardless of the sentinel), this can be -a breaking change if you relied on the exact sentinel or resulting dtype. Since +a breaking change if users relied on the exact sentinel or resulting dtype. Since pandas 1.0, the string dtype has been promoted quite a bit, and so we expect that many users already have started using this dtype, even though officially still labeled as "experimental". -To smooth the upgrade experience for those users, we propose to add a +To smooth the upgrade experience for those users, it is proposed to add a deprecation warning before 3.0 when such dtype is created, giving them two options: @@ -368,7 +369,7 @@ Some small enhancements or fixes might still be needed and can continue to be backported to pandas 2.2.x. The variant using numpy object-dtype can also be backported to the 2.2.x branch -to allow easier testing. We would propose to release this as 2.3.0 (created from +to allow easier testing. It is proposed to release this as 2.3.0 (created from the 2.2.x branch, given that the main branch already includes many other changes targeted for 3.0), together with the deprecation warning when creating a dtype from `"string"` / `pd.StringDtype()`. From ac2d21a3d3bb1e71128e2d481ee0a1a2e28bb66b Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 17:31:13 +0200 Subject: [PATCH 11/24] Apply suggestions from code review --- web/pandas/pdeps/0014-string-dtype.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 923ea67a54141..d08d3c66c5915 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -128,7 +128,10 @@ default behaviour: dtype: string ``` -This option will be expanded to also work when PyArrow is not installed. +Right now (pandas 2.2), the existing option only enables the PyArrow-based +future dtype. For the remaining 2.x releases, this option will be expanded to +also work when PyArrow is not installed to enable the object-dtype fallback in +that case. ### Missing value semantics @@ -230,7 +233,7 @@ Notes: - (2) Those three rows are backwards incompatible (i.e. they work now but give the NA-variant), see the "Backward compatibility" section below. - (3) "pyarrow_numpy" is kept temporarily because this is already in a released - version, but we can deprecate it in 2.2.x and have it removed for 3.0. + version, but we can deprecate it in 2.x and have it removed for 3.0. For the new default string dtype, only the `"string"` alias can be used to specify the dtype as a string, i.e. a way would not be provided to make the From 82027d27d77aa2726bc4feac1a648931fb6b3c3b Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 17:33:54 +0200 Subject: [PATCH 12/24] reflow after online edits --- web/pandas/pdeps/0014-string-dtype.md | 31 +++++++++++++-------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index d08d3c66c5915..c753ca148b81c 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -179,12 +179,11 @@ needs minor changes to follow the above-mentioned missing value semantics ([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). For pandas 3.0, this is the most realistic option given this implementation has -already been available for a long time. Beyond 3.0, further -improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) -or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) -can still be explored, -but at that point that is an implementation detail that should not have a -direct impact on users (except for performance). +already been available for a long time. Beyond 3.0, further improvements such as +using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) +or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can +still be explored, but at that point that is an implementation detail that +should not have a direct impact on users (except for performance). ### Naming @@ -203,10 +202,10 @@ dtype need a way to specify this. Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where the `"pyarrow_numpy"` storage was used to disambiguate from the existing -`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather -confusing option and doesn't generalize well. Therefore, this PDEP proposes -a new naming scheme as outlined below, and -"pyarrow_numpy" will be deprecated and removed before pandas 3.0. +`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather confusing +option and doesn't generalize well. Therefore, this PDEP proposes a new naming +scheme as outlined below, and "pyarrow_numpy" will be deprecated and removed +before pandas 3.0. The `storage` keyword of `StringDtype` is kept to disambiguate the underlying storage of the string data (using pyarrow or python objects), but an additional @@ -249,8 +248,8 @@ sufficient (they don't need to specify the storage), and the explicit To avoid introducing a new string dtype while other discussions and changes are in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as the default missing value sentinel? using the new NumPy 2.0 capabilities? -overhauling all our dtypes to use a logical data type system?), -introducing a default string dtype could also be delayed until there is more clarity in those +overhauling all our dtypes to use a logical data type system?), introducing a +default string dtype could also be delayed until there is more clarity in those other discussions. However: @@ -263,10 +262,10 @@ However: the challenges around this will not be unique to the string dtype and therefore not a reason to delay this. -Making this change now for 3.0 will benefit the majority of users, while -coming at a cost for a part of the users who already started using the -`"string"` or `pd.StringDtype()` dtype (they will have to update their code to continue to the variant -using `pd.NA`, see the "Backward compatibility" section below). +Making this change now for 3.0 will benefit the majority of users, while coming +at a cost for a part of the users who already started using the `"string"` or +`pd.StringDtype()` dtype (they will have to update their code to continue to the +variant using `pd.NA`, see the "Backward compatibility" section below). ### Why not use the existing StringDtype with `pd.NA`? From 5b24c24abc0cbfa8f980367223a678661f01b449 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 18:33:59 +0200 Subject: [PATCH 13/24] Update web/pandas/pdeps/0014-string-dtype.md Co-authored-by: William Ayd --- web/pandas/pdeps/0014-string-dtype.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index c753ca148b81c..a0b370214a94e 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -222,7 +222,7 @@ dtype of the data: | `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | | `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | | `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | -| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[python]" | | +| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | | | `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow"|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | | `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | From f9c55f4b6bb986a84a8f1a5d136e5eec53d399a5 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 May 2024 20:10:49 +0200 Subject: [PATCH 14/24] Apply suggestions from code review Co-authored-by: Irv Lustig --- web/pandas/pdeps/0014-string-dtype.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index a0b370214a94e..1d057ca9a6aab 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -50,7 +50,7 @@ Since its introduction, the `StringDtype` has always been opt-in, and has used the experimental `pd.NA` sentinel for missing values (which was also [introduced in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). However, up to this date, pandas has not yet taken the step to use `pd.NA` by -default, and thus the `StringDtype` deviates in missing value behaviour compared +default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared to the default data types. In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html) @@ -116,7 +116,7 @@ By default, pandas will infer this new string dtype instead of object dtype for string data (when creating pandas objects, such as in constructors or IO functions). -The existing `future.infer_string` option can be used to opt-in to the future +In pandas 2.2, the existing `future.infer_string` option can be used to opt-in to the future default behaviour: ```python @@ -202,9 +202,9 @@ dtype need a way to specify this. Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where the `"pyarrow_numpy"` storage was used to disambiguate from the existing -`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather confusing +`"pyarrow"` option using `pd.NA`. However, `"pyarrow_numpy"` is a rather confusing option and doesn't generalize well. Therefore, this PDEP proposes a new naming -scheme as outlined below, and "pyarrow_numpy" will be deprecated and removed +scheme as outlined below, and `"pyarrow_numpy"` will be deprecated and removed before pandas 3.0. The `storage` keyword of `StringDtype` is kept to disambiguate the underlying @@ -258,7 +258,7 @@ However: dtype that has massive benefits for users, both in usability as (for the significant part of the user base that has PyArrow installed) in performance. 2. In case pandas eventually transitions to use `pd.NA` as the default missing value - sentinel, a migration path for _all_ our data types will be needed, and thus + sentinel, a migration path for _all_ pandas data types will be needed, and thus the challenges around this will not be unique to the string dtype and therefore not a reason to delay this. From 2c58c4cc02cd3e67dd409dff75cfb86edb348c96 Mon Sep 17 00:00:00 2001 From: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com> Date: Tue, 14 May 2024 13:28:50 -0400 Subject: [PATCH 15/24] Fixup table (#2) --- web/pandas/pdeps/0014-string-dtype.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 1d057ca9a6aab..f99f9184140c2 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -215,16 +215,16 @@ and NaN semantics. Overview of the different ways to specify a dtype and the resulting concrete dtype of the data: -| User specification | Concrete dtype | String alias | Note | -|----------------------------------------|---------------------------------------------------|-------------------------|------| -| Unspecified (inference) | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1) | -| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1), (2) | -| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | -| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | -| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | -| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | | -| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow"|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | -| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | +| User specification | Concrete dtype | String alias | Note | +|------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------| +| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "string" | (1) | +| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow" \| "python", na_value=np.nan)` | "string" | (1), (2) | +| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | +| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | +| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | +| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | | +| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow" \| "python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | +| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | Notes: From 8974c5bd2588017b0591aa4b3ea320e97455da50 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 20 May 2024 15:23:05 +0200 Subject: [PATCH 16/24] next round of updates (small text updates, add capitalized String alias) --- web/pandas/pdeps/0014-string-dtype.md | 44 ++++++++++++++++----------- 1 file changed, 26 insertions(+), 18 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index f99f9184140c2..592c5358a8e30 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -12,7 +12,7 @@ This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0: * In pandas 3.0, enable a "string" dtype by default, using PyArrow if available - or otherwise the numpy object-dtype alternative. + or otherwise a string dtype using numpy object-dtype under the hood as fallback. * The default string dtype will use missing value semantics (using NaN) consistent with the other default data types. @@ -69,11 +69,11 @@ data type in pandas that is not backed by Python objects. After acceptance of PDEP-10, two aspects of the proposal have been under reconsideration: -- Based on user feedback (mostly around installation complexity and size), it - has been considered to relax the new `pyarrow` requirement to not be a _hard_ - runtime dependency. In addition, NumPy 2.0 could in the future potentially - reduce the need to make PyArrow a required dependency specifically for a - dedicated pandas string dtype. +- Based on feedback from users and maintainers from other packages (mostly + around installation complexity and size), it has been considered to relax the + new `pyarrow` requirement to not be a _hard_ runtime dependency. In addition, + NumPy 2.0 could in the future potentially reduce the need to make PyArrow a + required dependency specifically for a dedicated pandas string dtype. - PDEP-10 did not consider the usage of the experimental `pd.NA` as a consequence of adopting one of the existing implementations of the `StringDtype`. @@ -250,13 +250,15 @@ in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as the default missing value sentinel? using the new NumPy 2.0 capabilities? overhauling all our dtypes to use a logical data type system?), introducing a default string dtype could also be delayed until there is more clarity in those -other discussions. +other discussions. Specifically, it would avoid temporarily switching to use +`NaN` for the string dtype, while in a future version we might switch back +to `pd.NA` by default. However: 1. Delaying has a cost: it further postpones introducing a dedicated string dtype that has massive benefits for users, both in usability as (for the - significant part of the user base that has PyArrow installed) in performance. + part of the user base that has PyArrow installed) in performance. 2. In case pandas eventually transitions to use `pd.NA` as the default missing value sentinel, a migration path for _all_ pandas data types will be needed, and thus the challenges around this will not be unique to the string dtype and @@ -264,8 +266,8 @@ However: Making this change now for 3.0 will benefit the majority of users, while coming at a cost for a part of the users who already started using the `"string"` or -`pd.StringDtype()` dtype (they will have to update their code to continue to the -variant using `pd.NA`, see the "Backward compatibility" section below). +`pd.StringDtype()` dtype (they will have to update their code to continue to use +the variant using `pd.NA`, see the "Backward compatibility" section below). ### Why not use the existing StringDtype with `pd.NA`? @@ -311,9 +313,14 @@ To allow testing code in advance, the `pd.options.future.infer_string = True` option is available for users. Otherwise, the actual string-specific functionality (such as the `.str` accessor -methods) should generally all keep working as is. By preserving the current -missing value semantics, this proposal is also backwards compatible on this -aspect. +methods) should generally all keep working as is. + +By preserving the current missing value semantics, this proposal is also mostly +backwards compatible on this aspect. When storing strings in object dtype, pandas +however did allow using `None` as the missing value indicator as well (and in +certain cases such as the `shift` method, pandas even introduced this itself). +For all the cases where currently `None` was used as the missing value sentinel, +this will change to use `NaN` consistently. ### For existing users of `StringDtype` @@ -359,17 +366,18 @@ options: - If the user specifically wants the variant of the string dtype that uses `pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will have to update their dtype specification from `"string"` / `pd.StringDtype()` - to `pd.StringDtype(na_value=pd.NA)` to suppress the warning and further keep - their code running as is. + to `"String"` / `pd.StringDtype(na_value=pd.NA)` to suppress the warning and + further keep their code running as is. + +A `"String"` alias (capitalized) would be added to make it easier for users to +continue using the variant using `pd.NA`, and such capitalized string alias is +consistent with other nullable dtypes (`"float64`" vs `"Float64"`). ## Timeline The future PyArrow-backed string dtype was already made available behind a feature flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`). -Some small enhancements or fixes might still be needed and can continue to be -backported to pandas 2.2.x. - The variant using numpy object-dtype can also be backported to the 2.2.x branch to allow easier testing. It is proposed to release this as 2.3.0 (created from the 2.2.x branch, given that the main branch already includes many other changes From cca3a7f4e5393877e0cee735799e6224700a7f69 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 20 May 2024 20:40:36 +0200 Subject: [PATCH 17/24] use capitalized alias in the overview table --- web/pandas/pdeps/0014-string-dtype.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 592c5358a8e30..80af77b8bb738 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -215,16 +215,16 @@ and NaN semantics. Overview of the different ways to specify a dtype and the resulting concrete dtype of the data: -| User specification | Concrete dtype | String alias | Note | -|------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------| -| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "string" | (1) | -| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow" \| "python", na_value=np.nan)` | "string" | (1), (2) | -| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | -| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | -| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | -| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | | -| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow" \| "python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | -| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | +| User specification | Concrete dtype | String alias | Note | +|---------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------| +| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "string" | (1) | +| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow" \| "python", na_value=np.nan)` | "string" | (1), (2) | +| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | +| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | +| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "String[pyarrow]" | | +| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "String[python]" | | +| `StringDtype(na_value=pd.NA)` or `"String"` | `StringDtype(storage="pyarrow" \| "python", na_value=pd.NA)` | "String[pyarrow]" or "String[python]" | (1) | +| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | Notes: From 9c5342aabb0fdb3debf7c6c3099db58c28ecdd94 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 10 Jun 2024 18:05:26 +0200 Subject: [PATCH 18/24] New revision: keep back compat for 'string', introduce 'str' for the new default dtype --- web/pandas/pdeps/0014-string-dtype.md | 175 +++++++++++--------------- 1 file changed, 76 insertions(+), 99 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 80af77b8bb738..a9b8f79cf7d1d 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -11,7 +11,7 @@ This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0: -* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available +* In pandas 3.0, enable a string dtype (`"str"`) by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. * The default string dtype will use missing value semantics (using NaN) consistent with the other default data types. @@ -96,10 +96,10 @@ is intended to become the default in pandas 3.0). To be able to move forward with a string data type in pandas 3.0, this PDEP proposes: -1. For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow +1. For pandas 3.0, a `"str"` string dtype is enabled by default, which will use PyArrow if installed, and otherwise falls back to an in-house functionally-equivalent (but slower) version. -2. This default "string" dtype will follow the same behaviour for missing values +2. This default string dtype will follow the same behaviour for missing values as other default data types, and use `NaN` as the missing value sentinel. 3. The version that is not backed by PyArrow can reuse (with minor code additions) the existing numpy object-dtype backed StringArray for its @@ -135,10 +135,9 @@ that case. ### Missing value semantics -As mentioned in the background section, the original `StringDtype` has used -the experimental `pd.NA` sentinel for missing values. In addition to using -`pd.NA` as the scalar for a missing value, this essentially means -that: +As mentioned in the background section, the original `StringDtype` has always +used the experimental `pd.NA` sentinel for missing values. In addition to using +`pd.NA` as the scalar for a missing value, this essentially means that: - String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics) for missing values, where `NA` propagates in boolean operations such as @@ -154,7 +153,7 @@ dtype should also still use the same default missing value semantics and return default data types when doing operations on the string column, to be consistent with the other default dtypes at this point. -In practice, this means that the default `"string"` dtype will use `NaN` as +In practice, this means that the default string dtype will use `NaN` as the missing value sentinel, and: - String columns will follow NaN-semantics for missing values, where `NaN` gives @@ -165,9 +164,8 @@ the missing value sentinel, and: Because the original `StringDtype` implementations already use `pd.NA` and return masked integer and boolean arrays in operations, a new variant of the existing dtypes that uses `NaN` and default data types was needed. The original -variant of `StringDtype` using `pd.NA` will still be available for those who -want to keep using it (see below in the "Naming" subsection for how to specify -this). +variant of `StringDtype` using `pd.NA` will continue to be available for those +who were already using it. ### Object-dtype "fallback" implementation @@ -185,23 +183,35 @@ or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can still be explored, but at that point that is an implementation detail that should not have a direct impact on users (except for performance). +For the original variant of `StringDtype` using `pd.NA`, currently the default +storage is `"python"` (the object-dtype based implementation). Also for this +variant, it is proposed follow the same logic for determining the default +storage, i.e. the default to `"pyarrow"` if available, and otherwise +fall back to `"python"`. + ### Naming Given the long history of this topic, the naming of the dtypes is a difficult topic. In the first place, it should be acknowledged that most users should not need to -use storage-specific options. Users are expected to specify `pd.StringDtype()` -or `"string"`, and that will give them their default string dtype (which -depends on whether PyArrow is installed or not). - -But for testing purposes and advanced use cases that want control over this, we -need some way to specify this and distinguish them from the other string dtypes. -In addition, users that want to continue using the original NA-variant of the -dtype need a way to specify this. - -Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where -the `"pyarrow_numpy"` storage was used to disambiguate from the existing +use storage-specific options. Users are expected to specify a generic name (such +as `"str"` or `"string"`), and that will give them their default string dtype +(which depends on whether PyArrow is installed or not). + +For the generic string alias to specify the dtype, `"string"` is already used +for the `StringDtype` using `pd.NA`. This PDEP proposes to use `"str"` for the +new default `StringDtype` using `NaN`. This ensures backwards compatibility for +code using `dtype="string"`, and was also chosen because `dtype="str"` or +`dtype=str` currently already works to ensure your data is converted to +strings (only using object dtype for the result). + +But for testing purposes and advanced use cases that want control over the exact +variant of the `StringDtype`, we need some way to specify this and distinguish +them from the other string dtypes. + +Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used for the new variant using `NaN`, +where the `"pyarrow_numpy"` storage was used to disambiguate from the existing `"pyarrow"` option using `pd.NA`. However, `"pyarrow_numpy"` is a rather confusing option and doesn't generalize well. Therefore, this PDEP proposes a new naming scheme as outlined below, and `"pyarrow_numpy"` will be deprecated and removed @@ -217,29 +227,31 @@ dtype of the data: | User specification | Concrete dtype | String alias | Note | |---------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------| -| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "string" | (1) | -| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow" \| "python", na_value=np.nan)` | "string" | (1), (2) | -| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) | -| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) | -| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "String[pyarrow]" | | -| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "String[python]" | | -| `StringDtype(na_value=pd.NA)` or `"String"` | `StringDtype(storage="pyarrow" \| "python", na_value=pd.NA)` | "String[pyarrow]" or "String[python]" | (1) | -| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) | +| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) | +| `"str"` or `StringDtype(na_value=np.nan)` | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) | +| `StringDtype("pyarrow", na_value=np.nan)` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "str" | | +| `StringDtype("python", na_value=np.nan)` | `StringDtype(storage="python", na_value=np.nan)` | "str" | | +| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | +| `StringDtype("python")` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | | +| `"string"` or `StringDtype()` | `StringDtype(storage="pyarrow"\|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | +| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (2) | Notes: - (1) You get "pyarrow" or "python" depending on pyarrow being installed. -- (2) Those three rows are backwards incompatible (i.e. they work now but give - the NA-variant), see the "Backward compatibility" section below. -- (3) "pyarrow_numpy" is kept temporarily because this is already in a released +- (2) "pyarrow_numpy" is kept temporarily because this is already in a released version, but we can deprecate it in 2.x and have it removed for 3.0. -For the new default string dtype, only the `"string"` alias can be used to -specify the dtype as a string, i.e. a way would not be provided to make the +For the new default string dtype, only the `"str"` alias can be used to +specify the dtype as a string, i.e. pandas would not provide a way to make the underlying storage (pyarrow or python) explicit through the string alias. This -string alias is only a convenience shortcut and for most users `"string"` is +string alias is only a convenience shortcut and for most users `"str"` is sufficient (they don't need to specify the storage), and the explicit -`pd.StringDtype(...)` is still available for more fine-grained control. +`pd.StringDtype(storage=..., na_value=np.nan)` is still available for more +fine-grained control. + +Also for the existing variant using `pd.NA`, specifying the storage through the +string alias could be deprecated, but that is left for a separate decision. ## Alternatives @@ -257,17 +269,16 @@ to `pd.NA` by default. However: 1. Delaying has a cost: it further postpones introducing a dedicated string - dtype that has massive benefits for users, both in usability as (for the + dtype that has significant benefits for users, both in usability as (for the part of the user base that has PyArrow installed) in performance. 2. In case pandas eventually transitions to use `pd.NA` as the default missing value - sentinel, a migration path for _all_ pandas data types will be needed, and thus + sentinel, a migration path for _all_ pandas data types will be needed, and thus the challenges around this will not be unique to the string dtype and therefore not a reason to delay this. -Making this change now for 3.0 will benefit the majority of users, while coming -at a cost for a part of the users who already started using the `"string"` or -`pd.StringDtype()` dtype (they will have to update their code to continue to use -the variant using `pd.NA`, see the "Backward compatibility" section below). +Making this change now for 3.0 will benefit the majority of users, and the PDEP +author believes this is worth the cost of the added complexity around "yet +another dtype" (also for other data types we already have multiple variants). ### Why not use the existing StringDtype with `pd.NA`? @@ -290,17 +301,26 @@ when explicitly opting into this. ### Naming alternatives -This PDEP now keeps the `pd.StringDtype` class constructor with the existing -`storage` keyword and with an additional `na_value` keyword. +An initial version of this PDEP proposed to use the `"string"` alias and the +default `pd.StringDtype()` class constructor for the new default dtype. +However, that caused a lot of discussion around backwards compatibility for +existing users of the `StringDtype` using `pd.NA`. During the discussion, several alternatives have been brought up. Both -alternative keyword names as using a different constructor. This PDEP opted to -keep using the existing `pd.StringDtype()` for now to keep the changes as +alternative keyword names as using a different constructor. In the end, +this PDEP proposes to use a different string alias (`"str"`) but to keep +using the existing `pd.StringDtype` (with the existing `storage` keyword but +with an additional `na_value` keyword) for now to keep the changes as minimal as possible, leaving a larger overhaul of the dtype system (potentially including different constructor functions or namespace) for a future discussion. See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full discussion. +One consequence is that when using the class constructor for the default dtype, +it has to be used with non-default arguments, i.e. a user needs to specify +`pd.StringDtype(na_value=np.nan)` to get the default dtype using `NaN`. +Therefore, the pandas documentation will focus on the usage of `dtype="str"`. + ## Backward compatibility The most visible backwards incompatible change will be that columns with string @@ -324,54 +344,14 @@ this will change to use `NaN` consistently. ### For existing users of `StringDtype` -Users of the existing `StringDtype` will see more backwards incompatible -changes, though. In pandas 3.0, calling `pd.StringDtype()` (or specifying -`dtype="string"`) will start returning the new default string dtype using `NaN`, -while up to now this returned the string dtype using `pd.NA` introduced in -pandas 1.0. - -For example, this code snippet returned the NA-variant of `StringDtype` with -pandas 1.x and 2.x: - -```python ->>> pd.Series(["a", "b", None], dtype="string") -0 a -1 b -2 -dtype: string -``` +Existing code that already opted in to use the `StringDtype` using `pd.NA` +should generally keep working as is. The latest version of this PDEP preserves +the behaviour of `dtype="string"` or `dtype=pd.StringDtype()` to mean the +`pd.NA` variant of the dtype. -but will start returning the new default NaN-variant of `StringDtype` with -pandas 3.0. This means that the missing value sentinel will change from `pd.NA` -to `NaN`, and that operations will no longer return nullable dtypes but default -numpy dtypes (see the "Missing value semantics" section above). - -While this change will be transparent in many cases (e.g. checking for missing -values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of -a string predicate method keeps working regardless of the sentinel), this can be -a breaking change if users relied on the exact sentinel or resulting dtype. Since -pandas 1.0, the string dtype has been promoted quite a bit, and so we expect -that many users already have started using this dtype, even though officially -still labeled as "experimental". - -To smooth the upgrade experience for those users, it is proposed to add a -deprecation warning before 3.0 when such dtype is created, giving them two -options: - -- If the user just wants to have a dedicated "string" dtype (or the better - performance when using pyarrow) but is fine with using the default NaN - semantics, they can add `pd.options.future.infer_string = True` to their code - to suppress the warning and already opt-in to the future behaviour of pandas - 3.0. -- If the user specifically wants the variant of the string dtype that uses - `pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will - have to update their dtype specification from `"string"` / `pd.StringDtype()` - to `"String"` / `pd.StringDtype(na_value=pd.NA)` to suppress the warning and - further keep their code running as is. - -A `"String"` alias (capitalized) would be added to make it easier for users to -continue using the variant using `pd.NA`, and such capitalized string alias is -consistent with other nullable dtypes (`"float64`" vs `"Float64"`). +It does propose the change the default storage to `"pyarrow"` (if available) for +the opt-in `pd.NA` variant as well, but this should not have much user-visible +impact. ## Timeline @@ -381,13 +361,10 @@ flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`). The variant using numpy object-dtype can also be backported to the 2.2.x branch to allow easier testing. It is proposed to release this as 2.3.0 (created from the 2.2.x branch, given that the main branch already includes many other changes -targeted for 3.0), together with the deprecation warning when creating a dtype -from `"string"` / `pd.StringDtype()`. +targeted for 3.0), together with the changes to the naming scheme. The 2.3.0 release would then have all future string functionality available -(both the pyarrow and object-dtype based variants of the default string dtype), -and warn existing users of the `StringDtype` in advance of 3.0 about how to -update their code. +(both the pyarrow and object-dtype based variants of the default string dtype). For pandas 3.0, this `future.infer_string` flag becomes enabled by default. From b5663cc3009cadecfd9a06044702e919e4a50798 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Tue, 11 Jun 2024 18:07:08 +0200 Subject: [PATCH 19/24] Apply suggestions from code review Co-authored-by: Irv Lustig --- web/pandas/pdeps/0014-string-dtype.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index a9b8f79cf7d1d..9743959c5db53 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -185,8 +185,8 @@ should not have a direct impact on users (except for performance). For the original variant of `StringDtype` using `pd.NA`, currently the default storage is `"python"` (the object-dtype based implementation). Also for this -variant, it is proposed follow the same logic for determining the default -storage, i.e. the default to `"pyarrow"` if available, and otherwise +variant, it is proposed to follow the same logic for determining the default +storage, i.e. default to `"pyarrow"` if available, and otherwise fall back to `"python"`. ### Naming @@ -214,8 +214,8 @@ Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used for the n where the `"pyarrow_numpy"` storage was used to disambiguate from the existing `"pyarrow"` option using `pd.NA`. However, `"pyarrow_numpy"` is a rather confusing option and doesn't generalize well. Therefore, this PDEP proposes a new naming -scheme as outlined below, and `"pyarrow_numpy"` will be deprecated and removed -before pandas 3.0. +scheme as outlined below, and `"pyarrow_numpy"` will be deprecated as an alias +in pandas 2.3 and removed in pandas 3.0. The `storage` keyword of `StringDtype` is kept to disambiguate the underlying storage of the string data (using pyarrow or python objects), but an additional @@ -240,7 +240,7 @@ Notes: - (1) You get "pyarrow" or "python" depending on pyarrow being installed. - (2) "pyarrow_numpy" is kept temporarily because this is already in a released - version, but we can deprecate it in 2.x and have it removed for 3.0. + version, but it will be deprecated it in 2.x and removed for 3.0. For the new default string dtype, only the `"str"` alias can be used to specify the dtype as a string, i.e. pandas would not provide a way to make the @@ -304,7 +304,8 @@ when explicitly opting into this. An initial version of this PDEP proposed to use the `"string"` alias and the default `pd.StringDtype()` class constructor for the new default dtype. However, that caused a lot of discussion around backwards compatibility for -existing users of the `StringDtype` using `pd.NA`. +existing users of `dtype=pd.StringDtype()` and `dtype="string"`, that uses +`pd.NA` to represent missing values. During the discussion, several alternatives have been brought up. Both alternative keyword names as using a different constructor. In the end, @@ -340,7 +341,7 @@ backwards compatible on this aspect. When storing strings in object dtype, panda however did allow using `None` as the missing value indicator as well (and in certain cases such as the `shift` method, pandas even introduced this itself). For all the cases where currently `None` was used as the missing value sentinel, -this will change to use `NaN` consistently. +this will change to consistently use `NaN`. ### For existing users of `StringDtype` @@ -350,8 +351,8 @@ the behaviour of `dtype="string"` or `dtype=pd.StringDtype()` to mean the `pd.NA` variant of the dtype. It does propose the change the default storage to `"pyarrow"` (if available) for -the opt-in `pd.NA` variant as well, but this should not have much user-visible -impact. +the opt-in `pd.NA` variant as well, but this should have limited, if any, +user-visible impact. ## Timeline From 1c4c2d9f1cecc6189ff681c6a825354ffa2d6e64 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 12 Jun 2024 16:50:51 +0200 Subject: [PATCH 20/24] Update web/pandas/pdeps/0014-string-dtype.md Co-authored-by: Irv Lustig --- web/pandas/pdeps/0014-string-dtype.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 9743959c5db53..b6ff52d0bef6e 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -240,7 +240,7 @@ Notes: - (1) You get "pyarrow" or "python" depending on pyarrow being installed. - (2) "pyarrow_numpy" is kept temporarily because this is already in a released - version, but it will be deprecated it in 2.x and removed for 3.0. + version, but it will be deprecated in 2.x and removed for 3.0. For the new default string dtype, only the `"str"` alias can be used to specify the dtype as a string, i.e. pandas would not provide a way to make the From c44bfb583e13d8dfb9bc278e7895a7367226d621 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 12 Jun 2024 17:11:20 +0200 Subject: [PATCH 21/24] rephrase main points in proposal --- web/pandas/pdeps/0014-string-dtype.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index b6ff52d0bef6e..135e54e2ff1a5 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -96,14 +96,15 @@ is intended to become the default in pandas 3.0). To be able to move forward with a string data type in pandas 3.0, this PDEP proposes: -1. For pandas 3.0, a `"str"` string dtype is enabled by default, which will use PyArrow - if installed, and otherwise falls back to an in-house functionally-equivalent - (but slower) version. +1. For pandas 3.0, a `"str"` string dtype is enabled by default, i.e. this + string dtype will be used as the default dtype for text data when creating + pandas objects (e.g. inference in constructors, I/O functions). 2. This default string dtype will follow the same behaviour for missing values as other default data types, and use `NaN` as the missing value sentinel. -3. The version that is not backed by PyArrow can reuse (with minor code - additions) the existing numpy object-dtype backed StringArray for its - implementation. +3. The string dtype will use PyArrow if installed, and otherwise falls back to + an in-house functionally-equivalent (but slower) version. This fallback can + reuse (with minor code additions) the existing numpy object-dtype backed + StringArray for its implementation. 4. Installation guidelines are updated to clearly encourage users to install pyarrow for the default user experience. From bd52f39be06a0c985a7257766478a006c63fdde3 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 14 Jun 2024 20:03:17 +0200 Subject: [PATCH 22/24] tiny edit --- web/pandas/pdeps/0014-string-dtype.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 135e54e2ff1a5..8c79ced5f5992 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -49,9 +49,9 @@ This could be specified with the `storage` keyword in the opt-in string dtype Since its introduction, the `StringDtype` has always been opt-in, and has used the experimental `pd.NA` sentinel for missing values (which was also [introduced in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). -However, up to this date, pandas has not yet taken the step to use `pd.NA` by -default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared -to the default data types. +However, up to this date, pandas has not yet taken the step to use `pd.NA` for +for any default dtype, and thus the `StringDtype` deviates in missing value +behaviour compared to the default data types. In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html) proposed to start using a PyArrow-backed string dtype by default in pandas 3.0 @@ -370,6 +370,6 @@ The 2.3.0 release would then have all future string functionality available For pandas 3.0, this `future.infer_string` flag becomes enabled by default. -## PDEP-XX History +## PDEP-14 History - 3 May 2024: Initial version From f8fbc614d1eb0976309049a6a1ad0e2f252300fe Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 14 Jun 2024 20:12:05 +0200 Subject: [PATCH 23/24] mismatched quote --- web/pandas/pdeps/0014-string-dtype.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 8c79ced5f5992..4e7ccf3b00442 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -145,7 +145,7 @@ used the experimental `pd.NA` sentinel for missing values. In addition to using comparisons or predicates. - Operations on the string column that give a numeric or boolean result use the nullable Integer/Float/Boolean data types (e.g. `ser.str.len()` returns the - nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64` + nullable `"Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64` dtype (or `float64` in case of missing values)). However, up to this date, all other default data types still use `NaN` semantics From d78462dbe9c1b62fa17df61c34b143108a20d566 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Mon, 22 Jul 2024 22:36:52 +0200 Subject: [PATCH 24/24] Update 0014-string-dtype.md --- web/pandas/pdeps/0014-string-dtype.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0014-string-dtype.md b/web/pandas/pdeps/0014-string-dtype.md index 4e7ccf3b00442..5b74f71216454 100644 --- a/web/pandas/pdeps/0014-string-dtype.md +++ b/web/pandas/pdeps/0014-string-dtype.md @@ -1,7 +1,7 @@ # PDEP-14: Dedicated string data type for pandas 3.0 - Created: May 3, 2024 -- Status: Under discussion +- Status: Accepted - Discussion: https://github.com/pandas-dev/pandas/pull/58551 - Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) - Revision: 1 @@ -61,7 +61,7 @@ objects (for better performance), it proposed to make `pyarrow` a new required runtime dependency of pandas. In the meantime, NumPy has also been working on a native variable-width string -data type, which will be available [starting with NumPy +data type, which was made available [starting with NumPy 2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy). This can provide a potential alternative to PyArrow for implementing a string data type in pandas that is not backed by Python objects.