-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when converting arrow table to r data frame #11
Comments
I can't reproduce a segfault for the first example with
I suggest to update rpy2 to the latest release. Beside that, the warning messages you are show suggest that there is more going on with your R process than what is in the example. For example the warning about loading |
What version of arrow for R? There was a bug that was identified in 11.0.0 that wasn't present in 10.0.0 (and is fixed on dev) that might be responsible. |
@paleolimbot : I just updated R to 4.2.3-Patched and ran Running Python through gdb lands on |
I was thinking of apache/arrow#34489 . Does installing the R package from nightly help? https://arrow.apache.org/docs/r/articles/install_nightly.html |
Still segfault when exiting the process with arrow-nightly. gdb says:
|
I worked out a smaller example. In a nutshell the sequence is:
Oddly the segfault happens with one of the column types but not the other (see import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
rcode = """
function(df) {
# cbind(df$col1,df$col2) # segfault on exit
# cbind(df$col2, df$col2) # segfault on exit
cbind(df$col1, df$col1) # no segfault on exit
}
"""
rfunction = rpy2.robjects.r(rcode)
pd_df = pd.DataFrame({
"col1": range(10),
"col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(r_tbl)
output = rfunction(r_df) |
Thanks for the quick answers! I'm quite new to gdb but according to some discussions it seems that the clock_nanosleep error could also be due to a wrong resolution from gdb and not necessarily point to the correct issue. I'm not sure though how to further investigate this. |
It does seem unlikely that nanosleep is segfaulting. Can you see what all the other threads are doing at the time of segfault (Maybe |
With Thread 5 (Thread 0x7f979c2dd700 (LWP 787) "python3"):
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x55bb15c01b30) at ../sysdeps/nptl/futex-internal.h:186
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55bb15c01ae0, cond=0x55bb15c01b08) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x55bb15c01b08, mutex=0x55bb15c01ae0) at pthread_cond_wait.c:638
#3 0x00007f97adee490c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f97a134a465 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /usr/local/lib/R/site-library/arrow/libs/arrow.so
#5 0x00007f97adee9ed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7 0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 4 (Thread 0x7f97a30dc700 (LWP 786) "python3"):
#0 0x00007f97b7f3a561 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f97a3eab610 <cli.tick_ts>, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 0x00007f97b7f3fd43 in __GI___nanosleep (requested_time=requested_time@entry=0x7f97a3eab610 <cli.tick_ts>, remaining=remaining@entry=0x0) at nanosleep.c:27
#2 0x00007f97a3e8dd82 in clic_thread_func (arg=<optimized out>) at thread.c:37
#3 clic_thread_func (arg=<optimized out>) at thread.c:23
#4 0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 3 (Thread 0x7f97acfff700 (LWP 764) "jemalloc_bg_thd"):
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x7f97ad60a5f0) at ../sysdeps/nptl/futex-internal.h:186
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f97ad60a638, cond=0x7f97ad60a5c8) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7f97ad60a5c8, mutex=0x7f97ad60a638) at pthread_cond_wait.c:638
#3 0x00007f97afa493c4 in background_thread_sleep (tsdn=<optimized out>, interval=<optimized out>, info=<optimized out>) at src/background_thread.c:232
#4 background_work_sleep_once (ind=0, info=<optimized out>, tsdn=<optimized out>) at src/background_thread.c:307
#5 background_thread0_work (tsd=<optimized out>) at src/background_thread.c:452
#6 background_work (ind=<optimized out>, tsd=<optimized out>) at src/background_thread.c:490
#7 background_thread_entry (ind_arg=<optimized out>) at src/background_thread.c:522
#8 0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9 0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
--Type <RET> for more, q to quit, c to continue without paging--
Thread 2 (Thread 0x7f97b4c92700 (LWP 763) "python3"):
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x7f97b73596e0 <thread_status+96>) at ../sysdeps/nptl/futex-internal.h:186
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f97b7359690 <thread_status+16>, cond=0x7f97b73596b8 <thread_status+56>) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7f97b73596b8 <thread_status+56>, mutex=0x7f97b7359690 <thread_status+16>) at pthread_cond_wait.c:638
#3 0x00007f97b56d7deb in blas_thread_server () from /usr/local/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
#4 0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7f97b7d01740 (LWP 759) "python3"):
#0 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:78
#1 0x00007f97b7f621de in __spawnix (pid=pid@entry=0x7fff4553758c, file=file@entry=0x7f97b800d152 "/bin/sh", file_actions=file_actions@entry=0x0, attrp=0x7fff455372e0, attrp@entry=0x7fff45537790, argv=argv@entry=0x7fff455375d0, envp=0x55bb0c9501f0, xflags=0, exec=0x7f97b7f3ffc0 <execve>) at ../sysdeps/unix/sysv/linux/spawni.c:382
#2 0x00007f97b7f62817 in __spawni (pid=pid@entry=0x7fff4553758c, file=file@entry=0x7f97b800d152 "/bin/sh", acts=acts@entry=0x0, attrp=attrp@entry=0x7fff45537790, argv=argv@entry=0x7fff455375d0, envp=<optimized out>, xflags=0) at ../sysdeps/unix/sysv/linux/spawni.c:431
#3 0x00007f97b7f6205b in __GI___posix_spawn (pid=pid@entry=0x7fff4553758c, path=path@entry=0x7f97b800d152 "/bin/sh", file_actions=file_actions@entry=0x0, attrp=attrp@entry=0x7fff45537790, argv=argv@entry=0x7fff455375d0, envp=<optimized out>) at spawn.c:30
#4 0x00007f97b7ebca29 in do_system (line=0x7fff45537930 "rm -Rf /tmp/RtmpP6I3cA") at ../sysdeps/posix/system.c:148
#5 0x00007f97a6e024b6 in R_system () from /usr/lib/libR.so
#6 0x00007f97a6e613cb in R_CleanTempDir () from /usr/lib/libR.so
#7 0x00007f97a709ed64 in _cffi_f_R_CleanTempDir (self=<optimized out>, noarg=<optimized out>) at build/temp.linux-x86_64-cpython-310/_rinterface_cffi_api.c:2279
#8 0x00007f97b819f024 in cfunction_vectorcall_NOARGS (func=<built-in method R_CleanTempDir of _cffi_backend.Lib object at remote 0x7f97a70fa020>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/methodobject.c:489
#9 0x00007f97b8194f4a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f979c2edcf8, callable=<built-in method R_CleanTempDir of _cffi_backend.Lib object at remote 0x7f97a70fa020>, tstate=0x55bb0b442080) at ./Include/cpython/abstract.h:114
#10 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f979c2edcf8, callable=<built-in method R_CleanTempDir of _cffi_backend.Lib object at remote 0x7f97a70fa020>) at ./Include/cpython/abstract.h:123
#11 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fff45537e90, tstate=<optimized out>) at Python/ceval.c:5891
#12 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4181
#13 0x00007f97b81a1878 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7f979c2edb70, for file /usr/local/lib/python3.10/site-packages/rpy2/rinterface_lib/embedded.py, line 322, in endr (fatal=0, rlib=<_cffi_backend.Lib at remote 0x7f97a70fa020>), tstate=0x55bb0b442080) at ./Include/internal/pycor--Type <RET> for more, q to quit, c to continue without paging--c
e_ceval.h:46
#14 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7f97a7128cb0, tstate=0x55bb0b442080) at Python/ceval.c:5065
#15 _PyFunction_Vectorcall (func=<function at remote 0x7f97a7128ca0>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:342
#16 0x00007f97b826f311 in atexit_callfuncs (state=0x55bb0b427010) at ./Modules/atexitmodule.c:98
#17 0x00007f97b826eb1b in _PyAtExit_Call (interp=<optimized out>) at ./Modules/atexitmodule.c:118
#18 Py_FinalizeEx () at Python/pylifecycle.c:1731
#19 0x00007f97b82672a3 in Py_RunMain () at Modules/main.c:668
#20 0x00007f97b823d6c9 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#21 0x00007f97b7e9ad0a in __libc_start_main (main=0x55bb09a1c140 <main>, argc=2, argv=0x7fff455381d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff455381c8) at ../csu/libc-start.c:308
#22 0x000055bb09a1c07a in _start () The error I get when running one of the minimal examples above is the following:
|
I simplified the example to reproduce the issue further. The issue might be with the way R arrow or arrow keeps references to the underlying arrays, and walks the nested references and frees them when Tables are referenced in both Python and R. Also this seems specific to string arrays. import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
code = """
function(df) {
# df$col1 # no segfault on exit
# I(df$col1) # no segfault on exit
# df$col2 # no segfault on exit
I(df$col2) # segfault on exit
}
"""
rfunction = rpy2.robjects.r(code)
pd_df = pd.DataFrame({
"col1": range(10),
"col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(r_tbl)
output = rfunction(r_df)
|
Thank you for this! I know you checked the nightly builds, but do you know if this bug is also present in 10.0.0? (That would help narrow down the change that introduced it). We're about to do a release and I'd love to fix this! It smells to me like a problem with the R package...I don't think anything about the C data interface in the C++ bindings changed recently (but I will check). If I'm reading your example correctly, it seems like this problem is specific to It looks like x <- as.vector(arrow::as_arrow_array("xs"))
arrow:::is_arrow_altrep(x)
#> [1] TRUE
arrow:::test_arrow_altrep_is_materialized(x)
#> [1] FALSE
print(I(x))
#> [1] "xs"
arrow:::test_arrow_altrep_is_materialized(x)
#> [1] TRUE
# Could also try
# arrow:::test_arrow_altrep_force_materialize()
# as a more explicit test Created on 2023-04-04 with reprex v2.0.2 |
Some thing else to try is an explicit |
Calling R's |
I did some sleuthing and added a note to the issue on the Arrow side...it's almost certainly something we need to fix there. |
@lgautier What platform are you on? (I can generate a pyarrow wheel with a potential fix from a development branch but I need to know which wheel to generate...). |
|
Building from source is a pain. You should be able to pick your OS/Python version from here: apache/arrow#34948 (comment) , click the green "Crossbow" symbol, and then click "Summary", and then click "wheel" towards the bottom of the page. (Or just tell me your OS/Python version and I'll find a better link for you!) |
Still a segfault (Python 3.10, numpy 1.24, pandas 2.0.0)
|
Thank you for trying! Given that neither the PR nor running |
I even tried when R's garbage collection is exhaustive ( The stack trace in gdb shows that this happens in |
Thanks for continuing to dig into this! Even though the error is coming from Python, I worry that the reason there's an array that needs cleaning up at all is still R's fault (even if that array originated in Python). Arrow's memory representation doesn't rely on the duplication of identical strings (it copies them into a big long buffer and in this case that big long buffer would have come from Python anyway). I do think that a "freeing one time too many" type of thing might be happening although the crash seems more consistent with attempting to acquire the GIL during finalization rather than a straight double free. Just curious...are you installing both arrow and pyarrow in a conda environment? |
I am not using conda. R is compiled from source, |
The call stack when it segfaults indicates that the shared library ( If trying to acquire the GIL while the Python process has already shut down is the issue as you suspect then this is happening here: https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/src/arrow/python/numpy_convert.cc#L56 . And then it means that this code should have been called before Python has shut down. The segfault happens even when both R and Python arrow objects are deleted and the garbage collection for both languages is performed. Or even when the R code that creates the R code = """
function(df) {
# df$col1 # no segfault on exit
# I(df$col1) # no segfault on exit
# df$col2 # no segfault on exit
tmp <- I(df$col2) # segfault on exit
"a" + 1 # Error here
tmp
}
""" ). This means that materializing a |
Thank you again for this! I'll investigate from both ends: I think there were some PRs that added exit handlers to pyarrow recently. Also, I know of at least one place in the R code base where strings and non-strings get handled differently that was touched recently (this is the ALTREP stuff I keep mentioning). |
I just tried again with release 12.0.0 (both on the I have observed that the following small change in my minimal example (see earlier in this thread)
This seems like additional evidence supporting some form unwanted or incomplete copy of underlying data / memory regions. The issue is only present with strings, so may be this is cause by mismatched expectations between shallow and deep copies of a string array? |
Fixed upstream (apache/arrow#35812). |
… any Array references (#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with #35391 since it's also a segfault involving the Python <-> R bridge. * Closes: #34897 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…ot own any Array references (apache#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge. * Closes: apache#34897 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…ot own any Array references (apache#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge. * Closes: apache#34897 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
I am trying to translate the pandas - arrow table - R data frame conversion example from the official documentation ("From pandas.DataFrame to R data.frame through an Arrow Table") to python without Ipython magic.
This produces a segmentation fault in zsh and bash on exit, so when the file has finished running. That means the code is executed and outputs correct results and then produces the segmentation fault.
Here is a minimal reproducible example:
This is another example with the same error:
This second example works fine without the print statement in the last line. As soon as I use r_df in any other function (in R or Python) a segmentation fault occurs on exit.
I am running the code as a python file with this command:
python3 -q -X faulthandler myfile.py
with the output below. Unfortunately faulthandler does not output a trace to point me to the C Module where the potential failure could be.The segmentation fault occurs both on my Mac and inside a Docker container running Linux:
Mac Ventura 13.1 (zhs shell):
Python version: 3.10.6 (with pyenv)
R version: 4.2.1
rpy2==3.5.4
rpy2-arrow==0.0.7
Docker OS: Debian GNU/Linux (bash shell):
Python version: 3.10.6
R version: 4.0.4
rpy2==3.5.4
rpy2-arrow==0.0.7
The text was updated successfully, but these errors were encountered: