Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add copy-on-write to DataFrame.drop #49689

Merged
merged 4 commits into from
Nov 16, 2022

Conversation

seljaks
Copy link
Contributor

@seljaks seljaks commented Nov 14, 2022

Progress towards #49473
Add copy-on-write support to df.drop.

Not sure if the test is checking the correct behavior. It was passing even before I set copy=None, but BlockManager.reindex_indexer() has copy=True by default. So my assumption is that there's an error in the test itself.

Also test only tries dropping a column. Wasn't sure how to write a test for dropping a row since there's no get_array equivalent for that.

@jorisvandenbossche Please advise when you have a moment.

@seljaks seljaks changed the title Add cow to df drop ENH: Add copy-on-write to DataFrame.drop Nov 14, 2022
@jorisvandenbossche
Copy link
Member

Also test only tries dropping a column. Wasn't sure how to write a test for dropping a row since there's no get_array equivalent for that.

Rows generally can't be dropped without making a copy, so it's OK to focus here on dropping columns.

@seljaks
Copy link
Contributor Author

seljaks commented Nov 14, 2022

Still confused what happening internally. For example: df.reindex also uses BlockManager.reindex_indexer and I've noticed that the test for it still passes even if you pass in copy=True:

def test_reindex_columns(using_copy_on_write):
    # Case: reindexing the column returns a new dataframe
    # + afterwards modifying the result
    df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})
    df_orig = df.copy()
    df2 = df.reindex(columns=["a", "c"], copy=True)   #<- This should return a copy even if using CoW?

    if using_copy_on_write:
        # still shares memory (df2 is a shallow copy)
        assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
    else:
        assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
    # mutating df2 triggers a copy-on-write for that column
    df2.iloc[0, 0] = 0
    assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
    if using_copy_on_write:
        assert np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
    tm.assert_frame_equal(df, df_orig)

The copy parameter is passed down to BlockManager.reindex_indexer but seems to have no effect on the outcome of the test.
@jorisvandenbossche is this expected behavior?

@jorisvandenbossche
Copy link
Member

The copy parameter is passed down to BlockManager.reindex_indexer but seems to have no effect on the outcome of the test.

Indeed, looking a bit in more detail at the current drop / BlockManager.reindex_indexer implementation, that seems to be the case. Basically, the copy keyword that is being forwarded to BlockManager.reindex_indexer is not really used in that method: it's only used for the special case when indexer is None (and so when we don't actually need to reindex actual values, just returning the same data with potentially a new index):

if indexer is None:
if new_axis is self.axes[axis] and not copy:
return self
result = self.copy(deep=copy)
result.axes = list(self.axes)
result.axes[axis] = new_axis
return result

For the rest of the BlockManager.reindex_indexer implementation, copy isn't used anymore. It already assumes that the operation simply always makes a copy in the default behaviour, and so CoW is also already implemented here for the case of reindexing (or dropping, in this case) columns, which is done in _slice_take_blocks_ax0.

So to summarize: in hindsight this already worked the way we want it (so that your test was already passing is expected), but it is of course still good to explicitly test this!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 15, 2022

I've noticed that the test for it still passes even if you pass in copy=True:

That's a good question, and something we have to decide in general what we want to do with those methods that already have a copy argument. I think that copy=True, if passed manually by the user, probably should keep meaning that we make an eager copy (so not a lazy CoW copy). And assuming we make copy=None the default (which basically means copy=False when copy_on_write option is enabled), there should also not be much reason for the user to explicitly do this, though (since behaviour wise it doesn't matter, it will just impact when a copy is being made under the hood).

If we want to support that here, this would require some changes in _slice_take_blocks_ax0, though (which is a quite complex method ..). I don't think this is necessarily a high priority to do, and could also be done as a separate PR (but certainly welcome to take a look at it)

(and to be clear, this is relevant for the reindex method, but not for drop)

UPDATE: opened a dedicated issue for this: #50535

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Since the test is related to the reindex one, could you move it to just below test_reindex_columns?

@@ -52,6 +52,24 @@ def test_copy_shallow(using_copy_on_write):
# DataFrame methods returning new DataFrame using shallow copy


def test_drop_on_column(using_copy_on_write):
df = DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]}, index=[10, 11, 12]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]}, index=[10, 11, 12]
{"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]}

(I used a custom index for the reset_index test, but for the drop test, a custom index shouldn't matter)

@seljaks
Copy link
Contributor Author

seljaks commented Nov 15, 2022

I've moved the test and deleted the index. Also added an explicit check that an eager copy is made when not using CoW.

Thank you for taking the time to give an in-depth reply and for your guidance in general. The internals are a bit clearer to me now. Will probably try implementing CoW on a few other methods and then maybe diving into _slice_take_blocks_ax0, but it's logic seems quite daunting.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@jorisvandenbossche jorisvandenbossche merged commit b3db4b9 into pandas-dev:main Nov 16, 2022
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Nov 18, 2022
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants