-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoW: Track references in unstack if there is no copy #57487
Conversation
might be worth back porting, but not necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
if isinstance(values, np.ndarray): | ||
base, new_base = values.base, new_values.base | ||
elif isinstance(values, NDArrayBackedExtensionArray): | ||
base, new_base = values._ndarray.base, new_values._ndarray.base | ||
else: | ||
base, new_base = 1, 2 # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this base checking always works? (compared to doing something like np.shares_memory
)
From some quick tests, it seems that even if you chain multiple no-copy transformations (like we do with reshape().swapaxes().reshape()
), the base object stays the same.
If the base object is backed by something non-numpy (eg parrow), this doesn't seem to be the case, but, I think in a DataFrame we probably always end up with an array with a base object that is another numpy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we reshape into a 3dim array in between, so anything that is 1d copies there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to come up with a (contrived) example where the base might be different:
In [58]: parr = pa.array(range(100))
In [59]: arr = np.asarray(parr).base
In [60]: arr
Out[60]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99]])
In [61]: arr.base is parr
Out[61]: True
In [62]: arr.reshape(10, 10).base is arr.base
Out[62]: False
So in the above after reshaping the base is not equal to the calling array's base.
However, when putting this array in a DataFrame and getting _values
, it seems we still get the array wrapped in another array, so the direct base is again a numpy array:
In [63]: df = DataFrame(arr, copy=False)
In [64]: df._values
Out[64]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99]])
In [65]: df._values.base
Out[65]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99]])
In [66]: df._values.base.base
Out[66]:
<pyarrow.lib.Int64Array object at 0x7f325ca29ae0>
[
0,
...
99
]
pandas/core/reshape/reshape.py
Outdated
mgr = result._mgr | ||
mgr.blocks[0].refs = obj._mgr.blocks[0].refs | ||
mgr.blocks[0].refs.add_reference(mgr.blocks[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this use result._mgr.add_references(obj._mgr)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah thanks, I was looking for this one but couldn't find it (was looking in the wrong place..)
this one should be good to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @phofl |
* CoW: Track references in unstack if there is no copy * Update * Update * Update --------- Co-authored-by: Matthew Roeschke <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.cc @jorisvandenbossche @rhshadrach
@rhshadrach was correct, we can get there without making a copy. it's a special case, but nevertheless