Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility #41071

Merged
merged 8 commits into from
Apr 15, 2024
15 changes: 14 additions & 1 deletion python/pyarrow/array.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -1543,7 +1543,20 @@ cdef class Array(_PandasConvertible):
def _to_pandas(self, options, types_mapper=None, **kwargs):
return _array_like_to_pandas(self, options, types_mapper=types_mapper)

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# TODO honor the copy=True case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an exception for now? Also, can you open a GH issue for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an exception for now?

I wouldn't do that I think, because then when numpy would start passing that down in let's say numpy 2.1, np.array(obj) would start erroring, while those never errored before (and for some cases this might incorrectly not return a copy (although still marked as read-only), many cases already do copy anyways)

Although to be honest, I don't really know what the strategy of numpy will be to enable this keyword. If they would just start to stop copying the result of __array__, that would cause such changes in many libraries with __array__s)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though ideally, we just directly implement copy=True as well (it was just lest critical for numpy 2.0 as it is not yet used, and also a bit harder to implement)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also a bit harder to implement

On second thought, it might be relatively easy to determine if to_numpy() returned a copy or not? I was first thinking we would have to mimic the logic based on the type ("if primitive and no nulls, then it is zero copy"), but we might be able to check if the resulting ndarray has a base pointing to a pyarrow object?

Although the simple logic of numeric+no nulls might be easier in practice.

if copy is False:
try:
values = self.to_numpy(zero_copy_only=True)
except ArrowInvalid as exc:
raise ArrowInvalid(
"Unable to avoid a copy while creating a numpy array as requested.\n"
"If using `np.array(obj, copy=False)` replace it with "
"`np.asarray(obj)` to allow a copy when needed"
)
# values is already a numpy array at this point, but calling np.array(..)
# again to handle the `dtype` keyword with a no-copy guarantee
return np.array(values, dtype=dtype, copy=False)
values = self.to_numpy(zero_copy_only=False)
if dtype is None:
return values
Expand Down
6 changes: 4 additions & 2 deletions python/pyarrow/table.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -525,7 +525,8 @@ cdef class ChunkedArray(_PandasConvertible):

return values

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# copy keyword can be ignored because to_numpy() already returns a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, copy=False should then raise an error, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that comment is from before I decided to handle the copy=False case for Array. Indeed we should just raise an error here that a no-copy is not possible.

Updated.

values = self.to_numpy()
if dtype is None:
return values
Expand Down Expand Up @@ -1533,7 +1534,8 @@ cdef class _Tabular(_PandasConvertible):
raise TypeError(f"Do not call {self.__class__.__name__}'s constructor directly, use "
f"one of the `{self.__class__.__name__}.from_*` functions instead.")

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# copy keyword can be ignored as this always already returns a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

column_arrays = [
np.asarray(self.column(i), dtype=dtype) for i in range(self.num_columns)
]
Expand Down
38 changes: 38 additions & 0 deletions python/pyarrow/tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@

import pyarrow as pa
import pyarrow.tests.strategies as past
from pyarrow.vendored.version import Version


def test_total_bytes_allocated():
Expand Down Expand Up @@ -3301,6 +3302,43 @@ def test_array_from_large_pyints():
pa.array([int(2 ** 63)])


def test_numpy_array_protocol():
# test the __array__ method on pyarrow.Array
arr = pa.array([1, 2, 3])
result = np.asarray(arr)
expected = np.array([1, 2, 3], dtype="int64")
np.testing.assert_array_equal(result, expected)

# this should not raise a deprecation warning with numpy 2.0+
result = np.array(arr, copy=False)
np.testing.assert_array_equal(result, expected)

result = np.array(arr, dtype="int64", copy=False)
np.testing.assert_array_equal(result, expected)

# no zero-copy is possible
arr = pa.array([1, 2, None])
expected = np.array([1, 2, np.nan], dtype="float64")
result = np.asarray(arr)
np.testing.assert_array_equal(result, expected)

if Version(np.__version__) < Version("2.0"):
# copy keyword is not strict and not passed down to __array__
result = np.array(arr, copy=False)
np.testing.assert_array_equal(result, expected)

result = np.array(arr, dtype="float64", copy=False)
np.testing.assert_array_equal(result, expected)
else:
# starting with numpy 2.0, the copy=False keyword is assumed to be strict
with pytest.raises(ValueError, match="Unable to avoid a copy"):
np.array(arr, copy=False)

arr = pa.array([1, 2, 3])
with pytest.raises(ValueError):
np.array(arr, dtype="float64", copy=False)


def test_array_protocol():

class MyArray:
Expand Down
Loading