iloc breaks on read-only dataframe #10043

amueller · 2015-05-01T17:05:44Z

This is picking up #9928 again. I don't know if the behavior is expected, but it is a bit odd to me. Maybe I'm doing something wrong, I'm not that familiar with the pandas internals.

We call df.iloc[indices] and that breaks with a read-only dataframe. I feel that it shouldn't though, as it is not writing.

Minimal reproducing example:

import pandas as pd
import numpy as np
array = np.eye(10)
array.setflags(write=False)

X = pd.DataFrame(array)
X.iloc[[1, 2, 3]]

ValueError buffer source array is read-only

Is there a way to slice the rows of the dataframe in another way that doesn't need a writeable array?

The text was updated successfully, but these errors were encountered:

amueller · 2015-05-01T17:09:26Z

Maybe to clarify: I would be happy if the sliced DF is a copy, I just don't want to copy the original data [I think].

jreback · 2015-05-01T18:41:44Z

by definition what you are doing is ALWAYS going to be a copy. Frames are stored in columns.

In [10]: df = DataFrame(np.arange(20).reshape(5,4))

In [11]: df
Out[11]: 
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

In [12]: result = df.iloc[[1,2,3]]

In [13]: result
Out[13]: 
    0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15

In [14]: result.values.base
Out[14]: 
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [15]: df._data
Out[15]: 
BlockManager
Items: Int64Index([0, 1, 2, 3], dtype='int64')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 4, 1), 4 x 5, dtype: int64


In [17]: df._data.blocks[0].values
Out[17]: 
array([[ 0,  4,  8, 12, 16],
       [ 1,  5,  9, 13, 17],
       [ 2,  6, 10, 14, 18],
       [ 3,  7, 11, 15, 19]])

you can just access using [17] if you want to get the actual numpy array. Keep in mind, this is ONLY valid for a single-dtyped frame.

TomAugspurger · 2015-05-01T19:19:48Z

Would it be OK for scikit-learn to rely on ._data / blocks in library code?

This also isn't great since I've passed DataFrames into scikit-learn that are mixes of int / float. I'm guessing that's a pretty common thing.

amueller · 2015-05-01T19:40:00Z

Getting a copy of the slice if fine, I'd just like to avoid copying the whole initial data.
I would like to retain the dataframe structure.

Can you elaborate what it means that this is only valid for single-dtyped frames?

I would like to have something that works on mixed type and single dtype frames and produces a dataframe.

To elaborate on the use-case:
We have a function that does some slicing, cross_val_score that takes a model. I want the slicing to work and not destroy the dataframe, in case the user provided a custom model that wants a dataframe.

jreback · 2015-05-01T19:43:13Z

then use use .iloc as you have indicated. I am not sure why you'd have a read-only array in the first place. I suppose that could be handled (it actually borks because its taking with a memory view).

Slicing doesn't touch the original data at all. It will make a copy. (It will hand you a view if its possible, but this is cross-sectional slicing so that is numpy dependent, but ONLY if you have a single-dtype). If you have multiple dtypes, you will ALWAYS get a copy.

What is the actual error you are getting?

shoyer · 2015-05-01T19:43:23Z

External libraries should not be using the private pandas block system _data. I think that suggestion is a non-starter.

I agree that it is weird that read-only arrays don't work with .iloc. The ultimate problem appears to be that Cython memory views can't be created on read-only arrays:
https://mail.python.org/pipermail/cython-devel/2013-February/003384.html

To fix this on our end, we should probably wrap our calls to the internal take_nd Cython routines with something that ensures the source arrays are writable. Given that we aren't actually writing to them, I think that should be safe.

amueller · 2015-05-01T20:23:41Z

The traceback is in #9928, I agree with @shoyer that it seems to be cause by what is described in the cython thread. The solution sounds ok to me but as I said I don't know much about the internals of pandas.

The non-writeable array comes from joblib. I am not entirely sure about the background and I will investigate if we can get rid of it. I was not aware of the cython behavior when I opened the issue and hoped I was actually doing something wrong.

shoyer · 2015-05-01T20:32:08Z

Ultimately, I think this issue should be solved in on the Cython end of things -- there are plenty of legitimate users for readonly memory views. In fact @richardhansen reports in this this stackoverflow answer that he has a rough patch for this.

In the meantime, does anyone know how to create a "writeable" array from a readonly array with (a) not making any copies of the underlying data and (b) no modifications to the source array? This does not seem to work:

array = np.arange(10.0)
array.setflags(write=False)

array2 = np.asarray(array)
array2.setflags(write=True)
print array.flags.writeable

True

amueller · 2015-05-01T20:35:48Z

Ok, I think it is fair of you to punt this to cython. I have to work around it in sklearn anyhow for the moment (even if you fixed it we want to support current stable pandas etc).
I think asarray doesn't do anything if you pass it an array, but

array = np.arange(10.0)
array.setflags(write=False)

array2 = np.array(array)
array2.setflags(write=True)
print array.flags.writeable

False

amueller · 2015-05-01T20:36:10Z

Feel free to close.

shoyer · 2015-05-01T20:38:58Z

@amueller The problem with calling np.array is that you're copying the data... which is generally not ideal (especially for a routine like .iloc, which is supposed to be fast). (I just updated my comment from 8 minutes ago to clarify what I'm looking for.)

amueller · 2015-05-01T20:41:30Z

brainfart. With copy=False it also gives "True".

ogrisel · 2015-05-05T15:30:27Z

Here is a more realistic use case for the same underlying problem. Assume we have some data stored in a file on disk by some data generating process. For instance a process that save 100 random numbers:

>>> import numpy as np
>>> np.save('/tmp/data.npy', np.random.randn(100))

In a more realistic setting this file could be several hundreds of GB and would not necessarily fit in memory.

Now assume that we have a second program that wants to use the pandas API to manipulate this data but also wants to leverage the memory mapping feature of numpy to avoid copying the memory between processes running on the same host and only load from the drive the memory pages that are actually used by the program. Wrapping the memory mapped data works fine:

>>> import pandas as pd
>>> import numpy as np
>>> data = np.load('/tmp/data.npy', mmap_mode='r')
>>> df = pd.DataFrame(data, columns=['column'])

Slicing the data frame by rows works fine:

>>> df.iloc[:3]
     column
0  0.419577
1 -1.050912
2 -0.562929

However fancy indexing on the DataFrame breaks:

>>> df.iloc[[1, 2, 3]]
Traceback (most recent call last):
  File "<ipython-input-55-7550557fa06e>", line 1, in <module>
    df.iloc[[1, 2, 3]]
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 1217, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 1508, in _getitem_axis
    return self._get_loc(key, axis=axis)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 92, in _get_loc
    return self.obj._ixs(key, axis=axis)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/frame.py", line 1714, in _ixs
    result = self.take(i, axis=axis)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/generic.py", line 1351, in take
    convert=True, verify=True)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3269, in take
    axis=axis, allow_dups=True)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3156, in reindex_indexer
    for blk in self.blocks]
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3156, in <listcomp>
    for blk in self.blocks]
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 851, in take_nd
    allow_fill=True, fill_value=fill_value)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/common.py", line 823, in take_nd
    func(arr, indexer, out, fill_value)
  File "pandas/src/generated.pyx", line 3472, in pandas.algos.take_2d_axis0_float64_float64 (pandas/algos.c:89674)
  File "stringsource", line 614, in View.MemoryView.memoryview_cwrapper (pandas/algos.c:175798)
  File "stringsource", line 321, in View.MemoryView.memoryview.__cinit__ (pandas/algos.c:172387)
ValueError: buffer source array is read-only

However I don't see any compelling reason why would pandas require write access to the underlying memory mapped buffer.

Edit: in my original comment I used to create the data frame with pd.DataFrame({'column': data}) which apparently prevents the problem from occurring.

amueller · 2015-05-05T18:38:39Z

@ogrisel as mentioned above, as soon as you get buffer into cython, you'll run into this issue, so pandas can't really do much.

ogrisel · 2015-05-05T21:22:53Z

Indeed I had not properly read the end of the discussion. An alternative would be to use the ndarray Cython type instead of the Cython memoryview for those functions.

I did a quick patch on the generated.pyx file to check:

diff --git a/pandas/src/generated.pyx b/pandas/src/generated.pyx
index cab3a84..cb9f586 100644
--- a/pandas/src/generated.pyx
+++ b/pandas/src/generated.pyx
@@ -3469,7 +3469,7 @@ def take_2d_axis0_float32_float64(float32_t[:, :] values,

 @cython.wraparound(False)
 @cython.boundscheck(False)
-def take_2d_axis0_float64_float64(float64_t[:, :] values,
+def take_2d_axis0_float64_float64(ndarray[float64_t, ndim=2] values,
                                     ndarray[int64_t] indexer,
                                     float64_t[:, :] out,
                                     fill_value=np.nan):
@@ -9360,5 +9360,3 @@ def inner_join_indexer_int64(ndarray[int64_t] left,
                 j += 1

     return result, lindexer, rindexer

and it seems to make iloc work properly on dataframes backed by read-only arrays. Is there any performance benefit or other benefit in using memorviews for the values argument of the take_nd_axis* functions?

jreback · 2015-05-05T21:42:08Z

@ogrisel this was originally done for a non-trivial perf benefit, which requires the memory to be correctly aligned (which is done at a higher level). So would have to see how perf is different.

amueller · 2015-05-05T22:43:34Z

you have difficulty and effort tags. nice.

jreback · 2015-05-05T22:44:25Z

stolen from astropy

amueller · 2015-05-05T22:45:57Z

sorry for OT: @jreback are they working well for you? We mostly use a very vague "easy" tag.

ogrisel · 2015-05-05T23:16:51Z

Maybe a "temporary" workaround would be to have take_2d_axis0_float64_float64 and the likes call an inline variant that uses the memoryview only when values.flags.writeable is True and fallback to an inline variant with ndarray dtype otherwise. That would introduce some small per-function call overhead but it should not depend on the size of the arrays. The main con would be the added code complexity but as this code is already generated from a template that might be such a big deal. WDYT?

jreback · 2015-05-06T00:20:13Z

@ogrisel yes that would work, and shouldn't be too complex.

@amueller I think too soon to tell if the labels are working; We used to have a 'good as first PR' which is now our 'Difficulty Novice'. These were useful at PyCon hackathon. I think also will be useful in the long run, but will take some time to classify things.

ogrisel · 2015-05-06T10:02:30Z

@ogrisel yes that would work, and shouldn't be too complex.

Do you want to do it your-self or would rather me to give it a try?

jreback · 2015-05-06T13:31:19Z

@ogrisel be my guest!

ogrisel · 2015-05-06T16:58:35Z

I submitted a first draft in #10070. Feedback appreciated.

rhansen · 2015-05-07T00:33:06Z

In the meantime, does anyone know how to create a "writeable" array from a readonly array with (a) not making any copies of the underlying data and (b) no modifications to the source array?

I'm not familiar with pandas, but maybe this gross hack will do what you want:

from cpython.buffer cimport PyBUF_WRITABLE, \
    PyObject_CheckBuffer, PyObject_GetBuffer

cdef class ForceWritableBufferWrapper:
    cdef object buf
    def __cinit__(self, object buf):
        if not PyObject_CheckBuffer(buf):
            raise TypeError("argument must follow the buffer protocol")
        self.buf = buf
    def __getbuffer__(self, Py_buffer *view, int flags):
        PyObject_GetBuffer(self.buf, view, flags & ~PyBUF_WRITABLE)
        view.readonly = 0
    def __releasebuffer__(self, Py_buffer *view):
        pass

This Cython extension type provides a writable buffer interface for any buffer object, even if the underlying buffer is read-only. Of course this must only be used if you are absolutely certain nobody will write to the buffer, otherwise Bad Things(tm) will happen.

You can probably do some __setattr__ and __getattribute__ magic to make this wrapper act more like the underlying object if you so desire.

jreback · 2015-05-07T13:28:26Z

@richardhansen interesting idea. Do you know if this works with np.memmap (that is the original usecase). Would this be thread-safe? (I suppose we could just add a with gil to force it anyhow)

jreback · 2015-05-08T00:06:40Z

closed by #10070

rhansen · 2015-05-08T04:49:40Z

Do you know if this works with np.memmap (that is the original usecase).

I haven't tested that specific class, but it should work with any object that implements the buffer protocol.

Would this be thread-safe?

Yes. Just like with most other Cython code, you shouldn't need with gil unless you're doing something special (e.g., in a callback invoked by a plain C function that was called in a with nogil block).

amueller mentioned this issue May 1, 2015

LassoLarsCV error if n_jobs > 1 scikit-learn/scikit-learn#4597

Closed

jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label May 1, 2015

jreback added Difficulty Intermediate labels May 4, 2015

jreback added this to the Next Major Release milestone May 4, 2015

ogrisel mentioned this issue May 6, 2015

BUG make take_nd support readonly arrays #10070

Closed

jreback modified the milestones: 0.16.1, Next Major Release May 6, 2015

jreback closed this as completed May 8, 2015

kyleabeauchamp mentioned this issue Nov 1, 2015

ValueError: buffer source array is read-only #11502

Closed

rabernat mentioned this issue Apr 6, 2016

Factorize fails with read-only array #12813

Closed

ogrisel mentioned this issue Apr 19, 2016

[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types scikit-learn/scikit-learn#6430

Closed

This was referenced Aug 7, 2017

GridsearchCV.fit throws ValueError when passed a large dataframe that contains an Object column scikit-learn/scikit-learn#9483

Closed

DataFrame.iloc raises ValueError: buffer source array is read-only when applied to read-only indices #17192

Closed

jreback mentioned this issue Oct 31, 2017

BUG: method .nunique on categorical series in v0.21 with only NaNs gives ValueError #18051

Closed

shoyer mentioned this issue Dec 14, 2017

pd.cut fails on readonly arrays #18773

Closed

xhochy added a commit to xhochy/pandas that referenced this issue Jul 8, 2018

Modernize the fix for pandas-dev#10043

07af0f7

dycw mentioned this issue Aug 5, 2020

BUG: A read-only DataFrame cannot be .diff()'ed #35559

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iloc breaks on read-only dataframe #10043

iloc breaks on read-only dataframe #10043

amueller commented May 1, 2015

amueller commented May 1, 2015

jreback commented May 1, 2015

TomAugspurger commented May 1, 2015

amueller commented May 1, 2015

jreback commented May 1, 2015

shoyer commented May 1, 2015

amueller commented May 1, 2015

shoyer commented May 1, 2015

amueller commented May 1, 2015

amueller commented May 1, 2015

shoyer commented May 1, 2015

amueller commented May 1, 2015

ogrisel commented May 5, 2015

amueller commented May 5, 2015

ogrisel commented May 5, 2015

jreback commented May 5, 2015

amueller commented May 5, 2015

jreback commented May 5, 2015

amueller commented May 5, 2015

ogrisel commented May 5, 2015

jreback commented May 6, 2015

ogrisel commented May 6, 2015

jreback commented May 6, 2015

ogrisel commented May 6, 2015

rhansen commented May 7, 2015

jreback commented May 7, 2015

jreback commented May 8, 2015

rhansen commented May 8, 2015

iloc breaks on read-only dataframe #10043

iloc breaks on read-only dataframe #10043

Comments

amueller commented May 1, 2015

amueller commented May 1, 2015

jreback commented May 1, 2015

TomAugspurger commented May 1, 2015

amueller commented May 1, 2015

jreback commented May 1, 2015

shoyer commented May 1, 2015

amueller commented May 1, 2015

shoyer commented May 1, 2015

amueller commented May 1, 2015

amueller commented May 1, 2015

shoyer commented May 1, 2015

amueller commented May 1, 2015

ogrisel commented May 5, 2015

amueller commented May 5, 2015

ogrisel commented May 5, 2015

jreback commented May 5, 2015

amueller commented May 5, 2015

jreback commented May 5, 2015

amueller commented May 5, 2015

ogrisel commented May 5, 2015

jreback commented May 6, 2015

ogrisel commented May 6, 2015

jreback commented May 6, 2015

ogrisel commented May 6, 2015

rhansen commented May 7, 2015

jreback commented May 7, 2015

jreback commented May 8, 2015

rhansen commented May 8, 2015