Skip to content

Commit

Permalink
[skip-ci] pdep6 draft
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcoGorelli committed Dec 23, 2022
1 parent 9993ec4 commit 06fcb4c
Showing 1 changed file with 140 additions and 0 deletions.
140 changes: 140 additions & 0 deletions web/pandas/pdeps/0006-ban-upcasting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# PDEP-6: Ban upcasting in setitem-like operations

- Created: 23 December 2022
- Status: Draft
- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402)
- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche))
- Revision: 1

## Abstract

The suggestion is that setitem-like operations would
not change a ``Series``' dtype.

Current behaviour:
```python
In [1]: ser = pd.Series([1, 2, 3], dtype='int64')

In [2]: ser[2] = 'potage'

In [3]: ser # dtype changed to 'object'!
Out[3]:
0 1
1 2
2 potage
dtype: object
```

Suggested behaviour:

```python
In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser[2] = 'potage' # raises!
---------------------------------------------------------------------------
TypeError: Invalid value 'potage' for dtype int64
```

## Motivation and Scope

Currently, pandas is extremely flexible in handling different dtypes.
However, this can potentially hide bugs, break user expectations, and unnecessarily copy data.

An example of it hiding a bug is:
```python
In [9]: ser = pd.Series(pd.date_range('2000', periods=3))

In [10]: ser[2] = '2000-01-04' # works, is converted to datetime64

In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't error, it upcasts to object
```

The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as:
- ``ser[0] == 2``;
- ``ser.fillna(0, inplace=True)``;
- ``ser.where(ser.isna(), 0, inplace=True)``

There may be more. What is explicitly excluded from this PDEP is any operation would have no change
of operating inplace to begin with, such as:
- ``ser.diff()``;
- ``pd.concat([ser, other])``;
- ``ser.mean()``.

These would keep being allowed to change Series' dtypes.

## Detailed description

Concretely, the suggestion is:
- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype;
- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise.

For a start, this would involve:

1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in

```python
value = extract_array(value, extract_numpy=True)
try:
casted = np_can_hold_element(values.dtype, value)
except LossySetitemError:
# current dtype cannot store value, coerce to common dtype
nb = self.coerce_to_target_dtype(value)
return nb.setitem(indexer, value)
else:
```

2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places.

The above would already require several hundreds of tests to be adjusted.

### Ban upcasting altogether, or just upcasting to ``object``?

The trickiest part of this proposal concerns what to do when setting a float in an integer column:

```python
In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser[0] = 1.5
```

Possibly options could be:
1. just raise;
2. convert the float value to ``int``, preserve the Series' dtype;
3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions.

Let us compare with what other libraries do:
- ``numpy``: option 2
- ``cudf``: option 2
- ``polars``: option 2
- ``R data.frame``: option 3
- ``pandas`` (nullable dtype): option 1

If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable:
someone might set ``1.5`` and later be surprised to learn that they actually set ``1``.

Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity
to the codebase and to tests, and would be confusing to teach.

Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would
also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach:
"if you try to set an element of a ``Series`` to a new value, then that value must be compatible
with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for
``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule
starts to become confusing.

## Usage and Impact

This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs.

Unfortunately, it would also risk annoy users who might have been intentionally upcasting.

Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call,
I think it would be more beneficial to the community at large to err on the side of strictness.

## Timeline

Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0.

### PDEP History

- 23 December 2022: Initial draft

0 comments on commit 06fcb4c

Please sign in to comment.