Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame(data, ...) creates a copy when 'data' is a NumPy array (pandas 3.0+) #58913

Closed
3 tasks done
jameslamb opened this issue Jun 4, 2024 · 9 comments
Closed
3 tasks done
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics Needs Discussion Requires discussion from core team before further action

Comments

@jameslamb
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

X = np.random.default_rng().uniform(size=(10, 2)).astype(np.float32)
df = pd.DataFrame(X)
df_arr = df.to_numpy(dtype=np.float32)

np.shares_memory(X, df.values)
# numpy==1.26.4    , pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==3.0.0.dev0+1067 : False

np.shares_memory(X, df_arr)
# numpy==1.26.4    , pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==3.0.0.dev0+1067 : False

Issue Description

Starting with pandas==3.0.0, it appears that DataFrame(data) creates a copy when data is a numpy array.

Expected Behavior

I expected df.values and the result of df.to_numpy() (with copy argument omitted) to return the same numpy array that the DataFrame was created from.

I think that that has been the behavior of most combinations of pandas and numpy for at least the last 2 years. I think that because we've been running a test in LightGBM with similar code, to confirm that lightgbm isn't creating unnecessary copies in its pandas support since January 2022 (microsoft/LightGBM#4927), and that test is now failing with pandas>=3.0.0.

Apologies in advance if this is intentional behavior. I did try to look through the git blame, issues, and PRs. Did not see anything in these possibly-related discussions:

Installed Versions

Observed this in a conda environment on an M2 macbook, using Python 3.11.9

output of 'conda info' (click me)
     active environment : lgb-dev
    active env location : /Users/jlamb/miniforge3/envs/lgb-dev
            shell level : 1
       user config file : /Users/jlamb/.condarc
 populated config files : /Users/jlamb/miniforge3/.condarc
          conda version : 24.3.0
    conda-build version : not installed
         python version : 3.12.3.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=m2
                          __conda=24.3.0=0
                          __osx=14.4.1=0
                          __unix=0=0
       base environment : /Users/jlamb/miniforge3  (writable)
      conda av data dir : /Users/jlamb/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /Users/jlamb/miniforge3/pkgs
                          /Users/jlamb/.conda/pkgs
       envs directories : /Users/jlamb/miniforge3/envs
                          /Users/jlamb/.conda/envs
               platform : osx-arm64
             user-agent : conda/24.3.0 requests/2.31.0 CPython/3.12.3 Darwin/23.4.0 OSX/14.4.1 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 501:20
             netrc file : None
           offline mode : False
How I installed stable versions of `numpy`, `pandas`, and `pyarrow` (click me)
conda install \
    -c conda-forge \
    --yes \
        'numpy' \
        'pandas' \
        'pyarrow'
How I gradually replaced those versions with latest nightlies (click me)

numpy and pyarrrow:

python -m pip install \
    --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
    --prefer-binary \
    --pre \
    --upgrade \
        'numpy>=2.1.0.dev0'

python -m pip install \
    --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
    --prefer-binary \
    --pre \
    --upgrade \
        'pyarrow>=17.0.0.dev227'

pandas:

python -m pip install \
    --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
    --prefer-binary \
    --pre \
    --upgrade \
        'pandas>=3.0.0.dev0'
output of 'pd.show_versions()' with all nightlies installed (click me)
python -c "import pandas; pandas.show_versions()"

result:

INSTALLED VERSIONS
------------------
commit                : 76c7274985215c487248fa5640e12a9b32a06e8c
python                : 3.11.9.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.4.0
Version               : Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1067.g76c7274985
numpy                 : 2.1.0.dev0+git20240531.da5a779
pytz                  : 2024.1
dateutil              : 2.9.0
setuptools            : 69.5.1
pip                   : 24.0
Cython                : None
pytest                : 8.2.0
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : None
IPython               : None
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pyarrow               : 17.0.0.dev227
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.13.0
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None
@jameslamb jameslamb added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2024
@jameslamb jameslamb changed the title BUG: DataFrame(data, ...).to_numpy() creates a copy when 'data' is a NumPy array (pandas 3.0+) BUG: DataFrame(data, ...) creates a copy when 'data' is a NumPy array (pandas 3.0+) Jun 4, 2024
@mroeschke
Copy link
Member

mroeschke commented Jun 4, 2024

Thanks for the report.

It appears this was changed/enforced with copy-on-write in #57254 cc @phofl

I think it's because with copy-on-write it was also decided that the underlying numpy array should be made read-only which would require a copy to not affect the input.

In [1]: import numpy as np
   ...: import pandas as pd
+ /opt/miniconda3/envs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command

In [2]: arr = np.array([3, 1, 2])

In [3]: df = pd.DataFrame(arr)

In [4]: arr.flags
Out[4]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

In [5]: df.values.flags
Out[5]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2024
@jameslamb
Copy link
Contributor Author

Thanks very much @mroeschke .

I'm not familiar enough with the tradeoffs, so don't have a strong opinion about whether the pandas 2.x behavior for this case should be restored. I wasn't sure if this was intentional or not, so just wanted to report it.

@mroeschke
Copy link
Member

IIRC the read-only behavior was also needed in order to avoid mutations of a pandas object by modifying the underlying numpy array (e.g. df.values[0, 0] = "new_value") which would sidestep the reference counting mechanism behind copy on write.

I agree it would be nice to not have to always copy a numpy array. I'm not sure if there exists a work-around to only copy if the array is about to be written to.

@jorisvandenbossche
Copy link
Member

The copy behaviour change in pd.DataFrame was indeed intentional. Apart from making the array read-only in .values / to_numpy, the main reason for doing this by default is that we want to be able to give (at least by default) the guarantee of "changing values in a specific df can only be done by specifically mutating that df (e.g. df.lioc[..] = ..), and doing so only affects that df" (which is one of the main behaviour aspects of the CoW proposal).

If df = pd.DataFrame(arr) did not create a copy of the data, then modifying arr would also update the df, and if doing that df creation twice, then modifying the one df would also modify the other df (because they share the same data, but we are not tracking that they do so, thus we can't trigger a copy-on-write). So both cases would violate the general behavioural guarantees.

And so the idea was to protect the general user from this by making copy=True the default (in case of a 2D numpy array, for other inputs that is not necessarily the case, e.g. for pandas input we don't need to copy), while still give the advanced users who know what they are doing the option to specify copy=False.

For example, if you are a library implementing some read_format function returning a DataFrame, and your implementation might first create numpy arrays, at that point it is of course safe (and best practice to avoid an unnecessary copy) to specify copy=False in the constructor, if you know that those numpy arrays are only intermediate objects local to your implementation but afterwards never returned to the user.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 5, 2024

In the Copy-on-Write migration guide (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html) there is a small section on this "Constructors now copy NumPy arrays by default", which says:

The Series and DataFrame constructors now copies a NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.

Would it be useful to expand that?

(note that this is in the user guide, we still need to add an entry to the 3.0.0 release notes pointing to that, as I just see that the release notes don't mention anything about CoW)

@s-banach
Copy link

s-banach commented Jun 6, 2024

I'm not sure whether this is relevant or interesting, but it looks like polars decided not to make a copy by default.

import numpy as np
import polars as pl

X = np.arange(4).reshape(2, 2, order="F")
df = pl.DataFrame(X)
print(df)
X[0, 0] = 42
print(df)
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i32      ┆ i32      │
╞══════════╪══════════╡
│ 0        ┆ 2        │
│ 1        ┆ 3        │
└──────────┴──────────┘
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i32      ┆ i32      │
╞══════════╪══════════╡
│ 42       ┆ 2        │
│ 1        ┆ 3        │
└──────────┴──────────┘

@jorisvandenbossche
Copy link
Member

Interesting. It indeed doesn't copy the numpy array, so as you show polars isn't protected from the array getting modified. But it still seems to ensure when you modify the dataframe, another dataframe viewing the same data doesn't change (I assume it does a copy on write at that point):

import numpy as np
import polars as pl

X = np.arange(4).reshape(2, 2, order="F")
df1 = pl.DataFrame(X)
df2 = pl.DataFrame(X)
df1[0, "column_0"] = 42
print(df1)
print(df2)
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 42       ┆ 2        │
│ 1        ┆ 3        │
└──────────┴──────────┘
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 0        ┆ 2        │
│ 1        ┆ 3        │
└──────────┴──────────┘

So df2 was not changed because of modifying df1 inplace.

The current CoW implementation in pandas is not set up to track numpy objects like that (we only keep track of other pandas objects viewing the same data), although in theory I think we could also implement such a "one way" tracking (mutating via a pandas object would trigger CoW, but mutating the numpy array propagates those changes).

But I think for pandas it is probably a lot more common that people combine it with numpy and might run in the unexpected mutation having or having not effect, so I still think the copy=True by default is the safer option.

@jameslamb
Copy link
Contributor Author

In the Copy-on-Write migration guide (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html) there is a small section on this "Constructors now copy NumPy arrays by default"

Thank you so much! I hadn't found that while searching, but that exactly answers my question here.

And I really appreciate our explanation in #58913 (comment), totally makes sense to me why the default behavior is changing. Thanks very much for keeping the copy keyword argument so it's still possible to avoid that copy for some use cases.

Would it be useful to expand that?

I think that's perfect (clear and concise) as-is.

All of my questions have been answered, and I don't have any suggestions for changes to behavior or docs. So I think that this issue could be closed... but I'll let a maintainer close this, since there is some other discussion that's started here that goes a bit beyond my original questions, and you might find this thread a good place to keep that discussion.

@mroeschke
Copy link
Member

I'll close this issue out since the behavior was documented in the migration guide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Copy / view semantics Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants