-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoW: __array__ not recognizing ea dtypes #51966
Conversation
pandas/core/generic.py
Outdated
arr.flags.writeable = False | ||
if arr is values and using_copy_on_write() and self._mgr.is_single_block: | ||
# Check if self._values coerced data | ||
if not is_1d_only_ea_dtype(self.dtypes.iloc[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to not have a first column? And do we never have any block in a case of a DataFrame of zero columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does is_1d_only_ea_dtype
guarantee it will always have coerced the data? I suppose a general EA can give you a view when converting it to a numpy array (for .values it only needs a reshape to 2D, but that's a view)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
In [65]: df = pd.DataFrame({'a': pd.array(['a', 'b'], dtype="string")})
In [66]: np.shares_memory(df.values, df['a'].array._ndarray)
Out[66]: True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks very much. I was too focused on integers where we are coercing to object right now which causes a copy. With dtypes where converting to object does not cause a copy this still shares memory of course.
Adjusted the check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have zero columns then we have an empty array, correct? We can only modify an empty array through enlarging the array which would cause a copy anyway? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but so I was worrying about if doing something like df_empty.to_numpy()
could run into an IndexError (from self.dtypes.iloc[0]
if self.dtypes
has length 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I misunderstood you yesterday then. With empty you mean no columns, otherwise iloc would work, correct? In this case the is_single_block check fails and hence we don't get there. I'll add a test for this though, because we could easily cause a regression if we are not careful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the is_single_block check fails and hence we don't get there
Yes, but so I was also wondering if it would be possible to have a zero-column dataframe with one block (so whether it's possible to have a block with an axis of size 0).
It seems to be theoretically possible, by constructing this manually:
In [82]: block = pd.core.internals.make_block(np.zeros((0, 10)), np.array([]))
In [86]: mgr = pd.core.internals.BlockManager([block], [pd.Index([]), pd.Index(range(10))])
In [88]: df = pd.DataFrame(mgr)
In [89]: df
Out[89]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [90]: df._mgr.is_single_block
Out[90]: True
But I don't know if there is any way you could get that through actual pandas operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. It should be a bug if you could do this?
pandas/core/generic.py
Outdated
if arr is values and using_copy_on_write() and self._mgr.is_single_block: | ||
# Check if self._values coerced data | ||
if not is_1d_only_ea_dtype(self.dtypes.iloc[0]) or not is_numeric_dtype( | ||
self.dtypes.iloc[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the or not is_numeric_dtype
check?
Can you expand the comment a bit more to explain those checks? (from just reading the code, I find it hard to reason about, also with the "not .. or not ..")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else, could we also use the astype_is_view
here? (like is done below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 2 different cases here:
- NumPy dtypes that are caught by the is_1d_only_ea_dtype
- EA dtypes -> they get coerced to object by self._values above, so we only have to catch cases where a coercion to object does not trigger a copy, e.g. all numeric dtypes were copied already.
astype_is_view would work when we get rid of the conversion to object in the middle. I'll clarify the comment though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the EA dtypes, you rely on the assumption that numeric dtypes were cast to object. But in theory, someone could implement its own EA which is indicated to be "numeric" but does not do this conversion to object dtype.
Now, in general I think we are lacking a part of the story around having proper information about copies/views with generic EAs (that was also the case when doing the astype_is_view
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually what our numeric arrow dtypes do, but since those rely on pyarrow to convert itself to numpy arrays, those are already set to readonly in case they are a view:
In [91]: arr = pd.array([1, 2], dtype=pd.ArrowDtype(pa.int64()))
In [92]: np_arr = np.asarray(arr)
In [93]: np_arr
Out[93]: array([1, 2])
In [94]: np_arr[0] = 100
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [94], in <cell line: 1>()
----> 1 np_arr[0] = 100
ValueError: assignment destination is read-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, there is actual a more elegant solution. We can check if both steps can be done without copying via astype_is_view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I ran into this as well (that they set themselves to read only)
# Conflicts: # pandas/tests/copy_view/test_array.py
@jorisvandenbossche this would be nice to get into 2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow reply here, looks good to me, I just found one more failure case ;)
pandas/core/generic.py
Outdated
# TODO(CoW) also properly handle extension dtypes | ||
arr = arr.view() | ||
arr.flags.writeable = False | ||
if arr is values and using_copy_on_write() and self._mgr.is_single_block: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arr is values
might prevent catching some EA cases, since it seems that _values
can still return an EA (so the EA->ndarray conversion only happens in arr = np.asarray(values ..)
).
Example (running with this PR):
In [1]: pd.options.mode.copy_on_write = True
In [2]: df = pd.DataFrame({"a": pd.date_range("2012-01-01", periods=3)})
In [3]: arr = np.asarray(df)
In [4]: arr.flags.writeable
Out[4]: True
In [5]: arr[0] = 0
In [6]: df
Out[6]:
a
0 1970-01-01
1 2012-01-02
2 2012-01-03
For series you left out this check, so maybe can be done here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thx
|
||
arr = np.asarray(df) | ||
if using_copy_on_write: | ||
# TODO(CoW): This should be True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this one, because without specifying dtype="int64"
we create an object dtype array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, this triggers a copy and hence the array should be writeable?
Merging to get into 2.0 |
… dtypes) (#52358) Backport PR #51966: CoW: __array__ not recognizing ea dtypes Co-authored-by: Patrick Hoefler <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.