-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473
Comments
... |
This PR 49557 is an attempt to handle the |
Hi, I'll take a look at |
Took at look at |
Found an inconsistency while looking at import pandas as pd
pd.options.mode.copy_on_write = True
df = pd.DataFrame({"a": [1, 2], "b": [0.4, 0.5]})
df2 = df.iloc[slice(None), slice(None)] # or [:, :]
print("before:", df, df2, sep="\n")
df2.iloc[0, 0] = 0 # does not trigger copy_on_write
print("after:", df, df2, sep="\n") # both df and df2 are 0 at [0, 0] @jorisvandenbossche This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix? Sidenote: not sure how to test |
Specifically for A PR to just add tests to confirm that right now they use CoW is certainly welcome. |
Sorry for the slow reply here. That's indeed an inconsistency in So when that is merged, the
I think for the For In numpy,
but an actual scalar is not:
In the case of pandas,
which I think means that the |
I will take a look at |
It looks like reindex is already handled for index as well (if possible). Would be good to double check so that I did not miss anything |
I will take a look at Edit: @phofl has already covered |
Hm forgot to link, already opened a PR, will go ahead and link all of them now Edit: Done |
I removed |
With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:
By default, the
rename()
method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values indf2
never mutatesdf
). With CoW enabled (pd.options.mode.copy_on_write = True
), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour ofdf2
being a copy / not mutatingdf
whendf2
is mutated (though the CoW mechanism, only copying the data indf2
when actually needed upon mutation, i.e. a delayed or lazy copy).The way this is done in practice for a method like
rename()
orreset_index()
is by using the fact thatcopy(deep=None)
will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:pandas/pandas/core/frame.py
Lines 6246 to 6249 in 7bf8d6b
The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working):
rename
,reset_index
,reindex
(when reindexing the columns),select_dtypes
,to_frame
andcopy
itself.But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.
There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.
Series / DataFrame methods to update (I added a
?
for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):add_prefix
/add_suffix
-> TST/CoW: copy-on-write tests for add_prefix and add_suffix #49991align
-> ENH: Add lazy copy to align #50432asfreq
-> ENH: Add test for asfreq CoW when doing noop #50916assign
-> ENH/TST: expand copy-on-write to assign() method #50010astype
-> ENH: Add lazy copy to astype #50802between_time
-> ENH: Add lazy copy for take and between_time #50476bfill
/backfill
-> ENH: Add CoW optimization to interpolate #51249clip
-> TST: Add tests for clip with CoW #51492convert_dtypes
-> ENH: Implement CoW for convert_dtypes #51265copy
(tackled in initial implemention in #46958)drop
-> ENH: Add copy-on-write toDataFrame.drop
#49689drop_duplicates
(in case no duplicates are dropped) -> ENH: Add lazy copy for drop duplicates #50431droplevel
-> ENH: test CoW for drop_level #50552dropna
-> ENH: Use lazy copy for dropna #50429eval
-> ENH / CoW: Add lazy copy to eval #53746ffill
/pad
-> ENH: Add CoW optimization to interpolate #51249fillna
-> ENH: Add CoW optimization for fillna #51279filter
-> TST: Copy on Write for filter #50589get
-> TST: add CoW tests for xs() and get() #51292head
-> TST/CoW: copy-on-write tests for df.head and df.tail #49963infer_objects
-> ENH: Use lazy copy in infer objects #50428insert
?interpolate
-> ENH: Add CoW optimization to interpolate #51249isetitem
-> TST: CoW with df.isetitem() #50692items
-> TST: Test CoW with DataFrame.items() #50595iterrows
? -> CoW: Ensure that iterrows does not allow mutating parent #51271join
/merge
-> ENH: enable lazy copy in merge() for CoW #51297mask
-> ENH: Add lazy copy to where #51336where
, but could use an independent test -> TST / CoW: Add test for mask #53745pipe
- > ENH: Add lazy copy to pipe #50567pop
-> TST: Add test for CoW in pop #50569reindex
reindex_like
-> ENH: Use cow for reindex_like #50426rename
(tackled in initial implementation in #46958)rename_axis
-> ENH: add lazy copy (CoW) mechanism to rename_axis #50415reorder_levels
-> ENH: add copy on write for df reorder_levels GH49473 #50016replace
-> ENH: Add lazy copy to replace #50746reset_index
(tackled in initial implemention in #46958)round
(for columns that are not rounded) -> ENH: Add lazy copy to concat and round #50501select_dtypes
(tackled in initial implemention in #46958)set_axis
-> ENH/CoW: use lazy copy in set_axis method #49600set_flags
-> TST: Test cow for set_flags #50489set_index
-> ENH/CoW: use lazy copy in set_index method #49557shift
-> ENH: Add lazy copy to shift #50753sort_index
/sort_values
(optimization if nothing needs to be sorted)sort_index
-> ENH: Add lazy copy for sort_index #50491sort_values
-> ENH: Add lazy copy for sort_values #50643squeeze
-> TST: Test squeeze with CoW #50590style
. (phofl: I don't think there is anything to do here)swapaxes
-> ENH: Add lazy copy for swapaxes no op #50573swaplevel
-> ENH: Add lazy copy to swaplevel #50478T
/transpose
-> BUG: transpose not respecting CoW #51430tail
-> TST/CoW: copy-on-write tests for df.head and df.tail #49963take
(optimization if everything is taken?) -> ENH: Add lazy copy for take and between_time #50476to_timestamp
/to_period
-> ENH: Add lazy copy to to_timestamp and to_period #50575transform
-> BUG / CoW: Series.transform not respecting CoW #53747truncate
-> ENH: Add lazy copy for truncate #50477tz_convert
/tz_localize
-> ENH: Add lazy copy for tz_convert and tz_localize #50490unstack
(in optimized case where each column is a slice?)update
-> TST: add CoW test for update() #51426where
-> ENH: Add lazy copy to where #51336xs
-> TST: add CoW tests for xs() and get() #51292Series.to_frame()
(tackled in initial implemention in #46958)Top-level functions:
pd.concat
-> ENH: Add lazy copy to concat and round #50501pd.merge
et al? -> ENH: enable lazy copy in merge() for CoW #51297, ENH: Avoid copy when possible in merge #51327join
Want to contribute to this issue?
Pull requests tackling one of the bullet points above are certainly welcome!
copy(deep=None)
somewhere, but for some methods it will be more involved)/pandas/tests/copy_view/test_methods.py
(you can mimick on of the existing ones, egtest_select_dtypes
)PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py
to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled.using_copy_on_write
fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.The text was updated successfully, but these errors were encountered: