-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iloc breaks on read-only dataframe #10043
Comments
Maybe to clarify: I would be happy if the sliced DF is a copy, I just don't want to copy the original data [I think]. |
by definition what you are doing is ALWAYS going to be a copy. Frames are stored in columns.
you can just access using [17] if you want to get the actual numpy array. Keep in mind, this is ONLY valid for a single-dtyped frame. |
Would it be OK for scikit-learn to rely on This also isn't great since I've passed DataFrames into scikit-learn that are mixes of int / float. I'm guessing that's a pretty common thing. |
Getting a copy of the slice if fine, I'd just like to avoid copying the whole initial data. Can you elaborate what it means that this is only valid for single-dtyped frames? I would like to have something that works on mixed type and single dtype frames and produces a dataframe. To elaborate on the use-case: |
then use use Slicing doesn't touch the original data at all. It will make a copy. (It will hand you a view if its possible, but this is cross-sectional slicing so that is numpy dependent, but ONLY if you have a single-dtype). If you have multiple dtypes, you will ALWAYS get a copy. What is the actual error you are getting? |
External libraries should not be using the private pandas block system I agree that it is weird that read-only arrays don't work with To fix this on our end, we should probably wrap our calls to the internal |
The traceback is in #9928, I agree with @shoyer that it seems to be cause by what is described in the cython thread. The solution sounds ok to me but as I said I don't know much about the internals of pandas. The non-writeable array comes from joblib. I am not entirely sure about the background and I will investigate if we can get rid of it. I was not aware of the cython behavior when I opened the issue and hoped I was actually doing something wrong. |
Ultimately, I think this issue should be solved in on the Cython end of things -- there are plenty of legitimate users for readonly memory views. In fact @richardhansen reports in this this stackoverflow answer that he has a rough patch for this. In the meantime, does anyone know how to create a "writeable" array from a readonly array with (a) not making any copies of the underlying data and (b) no modifications to the source array? This does not seem to work:
|
Ok, I think it is fair of you to punt this to cython. I have to work around it in sklearn anyhow for the moment (even if you fixed it we want to support current stable pandas etc). array = np.arange(10.0)
array.setflags(write=False)
array2 = np.array(array)
array2.setflags(write=True)
print array.flags.writeable
|
Feel free to close. |
@amueller The problem with calling |
brainfart. With copy=False it also gives "True". |
Here is a more realistic use case for the same underlying problem. Assume we have some data stored in a file on disk by some data generating process. For instance a process that save 100 random numbers:
In a more realistic setting this file could be several hundreds of GB and would not necessarily fit in memory. Now assume that we have a second program that wants to use the pandas API to manipulate this data but also wants to leverage the memory mapping feature of numpy to avoid copying the memory between processes running on the same host and only load from the drive the memory pages that are actually used by the program. Wrapping the memory mapped data works fine: >>> import pandas as pd
>>> import numpy as np
>>> data = np.load('/tmp/data.npy', mmap_mode='r')
>>> df = pd.DataFrame(data, columns=['column']) Slicing the data frame by rows works fine: >>> df.iloc[:3]
column
0 0.419577
1 -1.050912
2 -0.562929 However fancy indexing on the DataFrame breaks: >>> df.iloc[[1, 2, 3]]
Traceback (most recent call last):
File "<ipython-input-55-7550557fa06e>", line 1, in <module>
df.iloc[[1, 2, 3]]
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 1217, in __getitem__
return self._getitem_axis(key, axis=0)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 1508, in _getitem_axis
return self._get_loc(key, axis=axis)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 92, in _get_loc
return self.obj._ixs(key, axis=axis)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/frame.py", line 1714, in _ixs
result = self.take(i, axis=axis)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/generic.py", line 1351, in take
convert=True, verify=True)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3269, in take
axis=axis, allow_dups=True)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3156, in reindex_indexer
for blk in self.blocks]
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3156, in <listcomp>
for blk in self.blocks]
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 851, in take_nd
allow_fill=True, fill_value=fill_value)
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/common.py", line 823, in take_nd
func(arr, indexer, out, fill_value)
File "pandas/src/generated.pyx", line 3472, in pandas.algos.take_2d_axis0_float64_float64 (pandas/algos.c:89674)
File "stringsource", line 614, in View.MemoryView.memoryview_cwrapper (pandas/algos.c:175798)
File "stringsource", line 321, in View.MemoryView.memoryview.__cinit__ (pandas/algos.c:172387)
ValueError: buffer source array is read-only However I don't see any compelling reason why would pandas require write access to the underlying memory mapped buffer. Edit: in my original comment I used to create the data frame with |
@ogrisel as mentioned above, as soon as you get buffer into cython, you'll run into this issue, so pandas can't really do much. |
Indeed I had not properly read the end of the discussion. An alternative would be to use the ndarray Cython type instead of the Cython memoryview for those functions. I did a quick patch on the
and it seems to make |
@ogrisel this was originally done for a non-trivial perf benefit, which requires the memory to be correctly aligned (which is done at a higher level). So would have to see how perf is different. |
you have difficulty and effort tags. nice. |
stolen from astropy |
sorry for OT: @jreback are they working well for you? We mostly use a very vague "easy" tag. |
Maybe a "temporary" workaround would be to have |
@ogrisel yes that would work, and shouldn't be too complex. @amueller I think too soon to tell if the labels are working; We used to have a 'good as first PR' which is now our 'Difficulty Novice'. These were useful at PyCon hackathon. I think also will be useful in the long run, but will take some time to classify things. |
Do you want to do it your-self or would rather me to give it a try? |
@ogrisel be my guest! |
I submitted a first draft in #10070. Feedback appreciated. |
I'm not familiar with pandas, but maybe this gross hack will do what you want: from cpython.buffer cimport PyBUF_WRITABLE, \
PyObject_CheckBuffer, PyObject_GetBuffer
cdef class ForceWritableBufferWrapper:
cdef object buf
def __cinit__(self, object buf):
if not PyObject_CheckBuffer(buf):
raise TypeError("argument must follow the buffer protocol")
self.buf = buf
def __getbuffer__(self, Py_buffer *view, int flags):
PyObject_GetBuffer(self.buf, view, flags & ~PyBUF_WRITABLE)
view.readonly = 0
def __releasebuffer__(self, Py_buffer *view):
pass This Cython extension type provides a writable buffer interface for any buffer object, even if the underlying buffer is read-only. Of course this must only be used if you are absolutely certain nobody will write to the buffer, otherwise Bad Things(tm) will happen. You can probably do some |
@richardhansen interesting idea. Do you know if this works with |
closed by #10070 |
I haven't tested that specific class, but it should work with any object that implements the buffer protocol.
Yes. Just like with most other Cython code, you shouldn't need |
This is picking up #9928 again. I don't know if the behavior is expected, but it is a bit odd to me. Maybe I'm doing something wrong, I'm not that familiar with the pandas internals.
We call
df.iloc[indices]
and that breaks with a read-only dataframe. I feel that it shouldn't though, as it is not writing.Minimal reproducing example:
Is there a way to slice the rows of the dataframe in another way that doesn't need a writeable array?
The text was updated successfully, but these errors were encountered: