Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW: Track references in unstack if there is no copy #57487

Merged
merged 7 commits into from
Apr 1, 2024

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 18, 2024

cc @jorisvandenbossche @rhshadrach

@rhshadrach was correct, we can get there without making a copy. it's a special case, but nevertheless

@phofl
Copy link
Member Author

phofl commented Feb 18, 2024

might be worth back porting, but not necessary

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

Comment on lines +248 to +253
if isinstance(values, np.ndarray):
base, new_base = values.base, new_values.base
elif isinstance(values, NDArrayBackedExtensionArray):
base, new_base = values._ndarray.base, new_values._ndarray.base
else:
base, new_base = 1, 2 # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this base checking always works? (compared to doing something like np.shares_memory)
From some quick tests, it seems that even if you chain multiple no-copy transformations (like we do with reshape().swapaxes().reshape()), the base object stays the same.

If the base object is backed by something non-numpy (eg parrow), this doesn't seem to be the case, but, I think in a DataFrame we probably always end up with an array with a base object that is another numpy array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we reshape into a 3dim array in between, so anything that is 1d copies there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to come up with a (contrived) example where the base might be different:

In [58]: parr = pa.array(range(100))

In [59]: arr = np.asarray(parr).base

In [60]: arr
Out[60]: 
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99]])

In [61]: arr.base is parr
Out[61]: True

In [62]: arr.reshape(10, 10).base is arr.base
Out[62]: False

So in the above after reshaping the base is not equal to the calling array's base.

However, when putting this array in a DataFrame and getting _values, it seems we still get the array wrapped in another array, so the direct base is again a numpy array:

In [63]: df = DataFrame(arr, copy=False)

In [64]: df._values
Out[64]: 
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99]])

In [65]: df._values.base
Out[65]: 
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99]])

In [66]: df._values.base.base
Out[66]: 
<pyarrow.lib.Int64Array object at 0x7f325ca29ae0>
[
  0,
  ...
  99
]

Comment on lines 256 to 258
mgr = result._mgr
mgr.blocks[0].refs = obj._mgr.blocks[0].refs
mgr.blocks[0].refs.add_reference(mgr.blocks[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this use result._mgr.add_references(obj._mgr) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah thanks, I was looking for this one but couldn't find it (was looking in the wrong place..)

@phofl
Copy link
Member Author

phofl commented Mar 16, 2024

this one should be good to merge

@mroeschke mroeschke added this to the 3.0 milestone Mar 20, 2024
@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Copy / view semantics labels Mar 20, 2024
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mroeschke mroeschke merged commit 07f6c4d into pandas-dev:main Apr 1, 2024
52 checks passed
@mroeschke
Copy link
Member

Thanks @phofl

@phofl phofl deleted the cow_unstack branch April 2, 2024 16:18
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* CoW: Track references in unstack if there is no copy

* Update

* Update

* Update

---------

Co-authored-by: Matthew Roeschke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants