-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame(data, ...) creates a copy when 'data' is a NumPy array (pandas 3.0+) #58913
Comments
Thanks for the report. It appears this was changed/enforced with copy-on-write in #57254 cc @phofl I think it's because with copy-on-write it was also decided that the underlying numpy array should be made read-only which would require a copy to not affect the input. In [1]: import numpy as np
...: import pandas as pd
+ /opt/miniconda3/envs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command
In [2]: arr = np.array([3, 1, 2])
In [3]: df = pd.DataFrame(arr)
In [4]: arr.flags
Out[4]:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
In [5]: df.values.flags
Out[5]:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False |
Thanks very much @mroeschke . I'm not familiar enough with the tradeoffs, so don't have a strong opinion about whether the |
IIRC the read-only behavior was also needed in order to avoid mutations of a pandas object by modifying the underlying numpy array (e.g. I agree it would be nice to not have to always copy a numpy array. I'm not sure if there exists a work-around to only copy if the array is about to be written to. |
The copy behaviour change in If And so the idea was to protect the general user from this by making For example, if you are a library implementing some |
In the Copy-on-Write migration guide (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html) there is a small section on this "Constructors now copy NumPy arrays by default", which says:
Would it be useful to expand that? (note that this is in the user guide, we still need to add an entry to the 3.0.0 release notes pointing to that, as I just see that the release notes don't mention anything about CoW) |
I'm not sure whether this is relevant or interesting, but it looks like polars decided not to make a copy by default. import numpy as np
import polars as pl
X = np.arange(4).reshape(2, 2, order="F")
df = pl.DataFrame(X)
print(df)
X[0, 0] = 42
print(df)
|
Interesting. It indeed doesn't copy the numpy array, so as you show polars isn't protected from the array getting modified. But it still seems to ensure when you modify the dataframe, another dataframe viewing the same data doesn't change (I assume it does a copy on write at that point): import numpy as np
import polars as pl
X = np.arange(4).reshape(2, 2, order="F")
df1 = pl.DataFrame(X)
df2 = pl.DataFrame(X)
df1[0, "column_0"] = 42
print(df1)
print(df2)
So The current CoW implementation in pandas is not set up to track numpy objects like that (we only keep track of other pandas objects viewing the same data), although in theory I think we could also implement such a "one way" tracking (mutating via a pandas object would trigger CoW, but mutating the numpy array propagates those changes). But I think for pandas it is probably a lot more common that people combine it with numpy and might run in the unexpected mutation having or having not effect, so I still think the |
Thank you so much! I hadn't found that while searching, but that exactly answers my question here. And I really appreciate our explanation in #58913 (comment), totally makes sense to me why the default behavior is changing. Thanks very much for keeping the
I think that's perfect (clear and concise) as-is. All of my questions have been answered, and I don't have any suggestions for changes to behavior or docs. So I think that this issue could be closed... but I'll let a maintainer close this, since there is some other discussion that's started here that goes a bit beyond my original questions, and you might find this thread a good place to keep that discussion. |
I'll close this issue out since the behavior was documented in the migration guide |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Starting with
pandas==3.0.0
, it appears thatDataFrame(data)
creates a copy whendata
is anumpy
array.Expected Behavior
I expected
df.values
and the result ofdf.to_numpy()
(withcopy
argument omitted) to return the samenumpy
array that theDataFrame
was created from.I think that that has been the behavior of most combinations of
pandas
andnumpy
for at least the last 2 years. I think that because we've been running a test in LightGBM with similar code, to confirm thatlightgbm
isn't creating unnecessary copies in itspandas
support since January 2022 (microsoft/LightGBM#4927), and that test is now failing withpandas>=3.0.0
.Apologies in advance if this is intentional behavior. I did try to look through the
git
blame, issues, and PRs. Did not see anything in these possibly-related discussions:copy
keyword (except in constructors) #56022Installed Versions
Observed this in a
conda
environment on an M2 macbook, using Python 3.11.9output of 'conda info' (click me)
How I installed stable versions of `numpy`, `pandas`, and `pyarrow` (click me)
How I gradually replaced those versions with latest nightlies (click me)
numpy
andpyarrrow
:pandas
:python -m pip install \ --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \ --prefer-binary \ --pre \ --upgrade \ 'pandas>=3.0.0.dev0'
output of 'pd.show_versions()' with all nightlies installed (click me)
python -c "import pandas; pandas.show_versions()"
result:
The text was updated successfully, but these errors were encountered: