Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when converting arrow table to r data frame #11

Closed
jasmincl opened this issue Mar 24, 2023 · 26 comments
Closed

Segmentation fault when converting arrow table to r data frame #11

jasmincl opened this issue Mar 24, 2023 · 26 comments

Comments

@jasmincl
Copy link

I am trying to translate the pandas - arrow table - R data frame conversion example from the official documentation ("From pandas.DataFrame to R data.frame through an Arrow Table") to python without Ipython magic.

This produces a segmentation fault in zsh and bash on exit, so when the file has finished running. That means the code is executed and outputs correct results and then produces the segmentation fault.

Here is a minimal reproducible example:

import faulthandler; faulthandler.enable()
import pandas as pd
import pyarrow
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
import rpy2.robjects.conversion
import rpy2_arrow.arrow as pyra
from rpy2.robjects.conversion import localconverter
base = importr('base')

code = """
    function(df) {
        cbind(df$col1,df$col2)
    }
"""
rfunction = rpy2.robjects.r(code)

df = pd.DataFrame({
                    "col1": range(10),
                    "col2":["a" for num in range(10)]
                    })

conv_arrow = rpy2.robjects.conversion.Converter(
    'Pandas to data.frame',
    template=pyra.converter)

@conv_arrow.py2rpy.register(pd.DataFrame)
def py2rpy_pandas(dataf):
    pa_tbl = pyarrow.Table.from_pandas(dataf)
    return base.as_data_frame(pa_tbl)

conv = (
    rpy2.robjects.default_converter
    + pandas2ri.converter
    + conv_arrow
)

with localconverter(conv):
    output = rfunction(df)

This is another example with the same error:

import faulthandler; faulthandler.enable()
import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2_arrow.arrow as pyra

base = importr('base')


df = pd.DataFrame({
                    "col1": range(10),
                    "col2":["a" for num in range(10)]
                    })

df_pyarrow = pyarrow.Table.from_pandas(df)
r_df = pyra.pyarrow_table_to_r_table(df_pyarrow)
r_df = base.as_data_frame(r_df) 
print(r_df)

This second example works fine without the print statement in the last line. As soon as I use r_df in any other function (in R or Python) a segmentation fault occurs on exit.

I am running the code as a python file with this command: python3 -q -X faulthandler myfile.py with the output below. Unfortunately faulthandler does not output a trace to point me to the C Module where the potential failure could be.

R[write to console]: 
Attaching package: ‘dplyr’

R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag

R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

R[write to console]: There were 25 warnings (use warnings() to see them)
R[write to console]: 

zsh: segmentation fault  python3 -q -X faulthandler 

The segmentation fault occurs both on my Mac and inside a Docker container running Linux:

Mac Ventura 13.1 (zhs shell):
Python version: 3.10.6 (with pyenv)
R version: 4.2.1
rpy2==3.5.4
rpy2-arrow==0.0.7

Docker OS: Debian GNU/Linux (bash shell):
Python version: 3.10.6
R version: 4.0.4
rpy2==3.5.4
rpy2-arrow==0.0.7

@lgautier
Copy link
Member

I can't reproduce a segfault for the first example with

  • Linux
  • R: 4.2.1-Patched
  • rpy2: 3.5.10
  • rpy2-arrow: 0.0.7

I suggest to update rpy2 to the latest release.

Beside that, the warning messages you are show suggest that there is more going on with your R process than what is in the example. For example the warning about loading dplyr. Do you have an .RData file loading data from an older R session, or do you have a custom startup script for R?

@paleolimbot
Copy link
Collaborator

What version of arrow for R? There was a bug that was identified in 11.0.0 that wasn't present in 10.0.0 (and is fixed on dev) that might be responsible.

@lgautier
Copy link
Member

lgautier commented Mar 25, 2023

@paleolimbot : I just updated R to 4.2.3-Patched and ran update.packages(). Now I see a segfault when exiting the Python process after running the first example. arrow is 11.0.0.3.

Running Python through gdb lands on ../sysdeps/unix/sysv/linux/clock_nanosleep.c as the point of failure. Could this be related to apache/arrow#33424 ?

@paleolimbot
Copy link
Collaborator

I was thinking of apache/arrow#34489 . Does installing the R package from nightly help? https://arrow.apache.org/docs/r/articles/install_nightly.html

@lgautier
Copy link
Member

Still segfault when exiting the process with arrow-nightly.

gdb says:

Thread 18 "python" received signal SIG32, Real-time event 32.
[Switching to Thread 0x7fffd1871700 (LWP 44062)]
0x00007ffff7e9623f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffd189f680 <cli.tick_ts>, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.

@lgautier
Copy link
Member

I worked out a smaller example. In a nutshell the sequence is:

pandas DataFrame -> pyarray table -> R arrow Table -> R data.frame -> R cbind() column of that data.frame 

Oddly the segfault happens with one of the column types but not the other (see rcode below).

import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')

rcode = """
    function(df) {
        # cbind(df$col1,df$col2)  # segfault on exit
        # cbind(df$col2, df$col2) # segfault on exit
        cbind(df$col1, df$col1)  # no segfault on exit
    }
"""
rfunction = rpy2.robjects.r(rcode)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(r_tbl)

output = rfunction(r_df)

@jasmincl
Copy link
Author

Thanks for the quick answers!
I tried the failing example code with different pyarrow versions 8.0.0 throughout 11.0.0. and also the latest rpy2 now, I also used arrow nightly builds and also R=4.2.3. And both on my Mac and inside my Debian docker container. Still the same issue and also managed to reproduce the clock_nanosleep error in gdb.

I'm quite new to gdb but according to some discussions it seems that the clock_nanosleep error could also be due to a wrong resolution from gdb and not necessarily point to the correct issue. I'm not sure though how to further investigate this.

@paleolimbot
Copy link
Collaborator

It does seem unlikely that nanosleep is segfaulting. Can you see what all the other threads are doing at the time of segfault (Maybe thread apply bt...I forget the exact incantation)

@jasmincl
Copy link
Author

With (gdb) thread apply all bt I get the following output on my linux system. Not sure if that is helpful?

Thread 5 (Thread 0x7f979c2dd700 (LWP 787) "python3"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55bb15c01b30) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55bb15c01ae0, cond=0x55bb15c01b08) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x55bb15c01b08, mutex=0x55bb15c01ae0) at pthread_cond_wait.c:638
#3  0x00007f97adee490c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f97a134a465 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /usr/local/lib/R/site-library/arrow/libs/arrow.so
#5  0x00007f97adee9ed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f97a30dc700 (LWP 786) "python3"):
#0  0x00007f97b7f3a561 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f97a3eab610 <cli.tick_ts>, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007f97b7f3fd43 in __GI___nanosleep (requested_time=requested_time@entry=0x7f97a3eab610 <cli.tick_ts>, remaining=remaining@entry=0x0) at nanosleep.c:27
#2  0x00007f97a3e8dd82 in clic_thread_func (arg=<optimized out>) at thread.c:37
#3  clic_thread_func (arg=<optimized out>) at thread.c:23
#4  0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f97acfff700 (LWP 764) "jemalloc_bg_thd"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f97ad60a5f0) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f97ad60a638, cond=0x7f97ad60a5c8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f97ad60a5c8, mutex=0x7f97ad60a638) at pthread_cond_wait.c:638
#3  0x00007f97afa493c4 in background_thread_sleep (tsdn=<optimized out>, interval=<optimized out>, info=<optimized out>) at src/background_thread.c:232
#4  background_work_sleep_once (ind=0, info=<optimized out>, tsdn=<optimized out>) at src/background_thread.c:307
#5  background_thread0_work (tsd=<optimized out>) at src/background_thread.c:452
#6  background_work (ind=<optimized out>, tsd=<optimized out>) at src/background_thread.c:490
#7  background_thread_entry (ind_arg=<optimized out>) at src/background_thread.c:522
#8  0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9  0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

--Type <RET> for more, q to quit, c to continue without paging--
Thread 2 (Thread 0x7f97b4c92700 (LWP 763) "python3"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f97b73596e0 <thread_status+96>) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f97b7359690 <thread_status+16>, cond=0x7f97b73596b8 <thread_status+56>) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f97b73596b8 <thread_status+56>, mutex=0x7f97b7359690 <thread_status+16>) at pthread_cond_wait.c:638
#3  0x00007f97b56d7deb in blas_thread_server () from /usr/local/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
#4  0x00007f97b7e5cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f97b7f73a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f97b7d01740 (LWP 759) "python3"):
#0  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:78
#1  0x00007f97b7f621de in __spawnix (pid=pid@entry=0x7fff4553758c, file=file@entry=0x7f97b800d152 "/bin/sh", file_actions=file_actions@entry=0x0, attrp=0x7fff455372e0, attrp@entry=0x7fff45537790, argv=argv@entry=0x7fff455375d0, envp=0x55bb0c9501f0, xflags=0, exec=0x7f97b7f3ffc0 <execve>) at ../sysdeps/unix/sysv/linux/spawni.c:382
#2  0x00007f97b7f62817 in __spawni (pid=pid@entry=0x7fff4553758c, file=file@entry=0x7f97b800d152 "/bin/sh", acts=acts@entry=0x0, attrp=attrp@entry=0x7fff45537790, argv=argv@entry=0x7fff455375d0, envp=<optimized out>, xflags=0) at ../sysdeps/unix/sysv/linux/spawni.c:431
#3  0x00007f97b7f6205b in __GI___posix_spawn (pid=pid@entry=0x7fff4553758c, path=path@entry=0x7f97b800d152 "/bin/sh", file_actions=file_actions@entry=0x0, attrp=attrp@entry=0x7fff45537790, argv=argv@entry=0x7fff455375d0, envp=<optimized out>) at spawn.c:30
#4  0x00007f97b7ebca29 in do_system (line=0x7fff45537930 "rm -Rf /tmp/RtmpP6I3cA") at ../sysdeps/posix/system.c:148
#5  0x00007f97a6e024b6 in R_system () from /usr/lib/libR.so
#6  0x00007f97a6e613cb in R_CleanTempDir () from /usr/lib/libR.so
#7  0x00007f97a709ed64 in _cffi_f_R_CleanTempDir (self=<optimized out>, noarg=<optimized out>) at build/temp.linux-x86_64-cpython-310/_rinterface_cffi_api.c:2279
#8  0x00007f97b819f024 in cfunction_vectorcall_NOARGS (func=<built-in method R_CleanTempDir of _cffi_backend.Lib object at remote 0x7f97a70fa020>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/methodobject.c:489
#9  0x00007f97b8194f4a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f979c2edcf8, callable=<built-in method R_CleanTempDir of _cffi_backend.Lib object at remote 0x7f97a70fa020>, tstate=0x55bb0b442080) at ./Include/cpython/abstract.h:114
#10 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f979c2edcf8, callable=<built-in method R_CleanTempDir of _cffi_backend.Lib object at remote 0x7f97a70fa020>) at ./Include/cpython/abstract.h:123
#11 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fff45537e90, tstate=<optimized out>) at Python/ceval.c:5891
#12 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4181
#13 0x00007f97b81a1878 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7f979c2edb70, for file /usr/local/lib/python3.10/site-packages/rpy2/rinterface_lib/embedded.py, line 322, in endr (fatal=0, rlib=<_cffi_backend.Lib at remote 0x7f97a70fa020>), tstate=0x55bb0b442080) at ./Include/internal/pycor--Type <RET> for more, q to quit, c to continue without paging--c
e_ceval.h:46
#14 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7f97a7128cb0, tstate=0x55bb0b442080) at Python/ceval.c:5065
#15 _PyFunction_Vectorcall (func=<function at remote 0x7f97a7128ca0>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:342
#16 0x00007f97b826f311 in atexit_callfuncs (state=0x55bb0b427010) at ./Modules/atexitmodule.c:98
#17 0x00007f97b826eb1b in _PyAtExit_Call (interp=<optimized out>) at ./Modules/atexitmodule.c:118
#18 Py_FinalizeEx () at Python/pylifecycle.c:1731
#19 0x00007f97b82672a3 in Py_RunMain () at Modules/main.c:668
#20 0x00007f97b823d6c9 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#21 0x00007f97b7e9ad0a in __libc_start_main (main=0x55bb09a1c140 <main>, argc=2, argv=0x7fff455381d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff455381c8) at ../csu/libc-start.c:308
#22 0x000055bb09a1c07a in _start ()

The error I get when running one of the minimal examples above is the following:

Starting program: /usr/local/bin/python3 segfault.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
[New Thread 0x7ffff4852700 (LWP 152)]
[New Thread 0x7fffec9ff700 (LWP 153)]
[Detaching after vfork from child process 154]
[Detaching after vfork from child process 156]
[Detaching after vfork from child process 171]
[Detaching after vfork from child process 173]
[New Thread 0x7fffe2c98700 (LWP 175)]
[New Thread 0x7fffdbe99700 (LWP 176)]
   col1 col2
1     0    a
2     1    a
3     2    a
4     3    a
5     4    a
6     5    a
7     6    a
8     7    a
9     8    a
10    9    a

[Detaching after vfork from child process 177]

Thread 4 "python3" received signal SIG32, Real-time event 32.
[Switching to Thread 0x7fffe2c98700 (LWP 175)]
0x00007ffff7afa561 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffe3a63610 <cli.tick_ts>,
    rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.

@lgautier
Copy link
Member

lgautier commented Apr 2, 2023

I simplified the example to reproduce the issue further. The issue might be with the way R arrow or arrow keeps references to the underlying arrays, and walks the nested references and frees them when Tables are referenced in both Python and R. Also this seems specific to string arrays.

import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(r_tbl)

output = rfunction(r_df)
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00005555557868b0 in new_threadstate (interp=0x0, init=1) at ../Python/pystate.c:616
616	../Python/pystate.c: No such file or directory.
(gdb) up
#1  0x00005555557a4ee2 in PyThreadState_New (interp=<optimized out>)
    at ../Python/pystate.c:684
684	../Python/pystate.c: No such file or directory.
(gdb) up
#2  PyGILState_Ensure () at ../Python/pystate.c:1504
1504	in ../Python/pystate.c
(gdb) up
#3  0x00007fffe0ee6003 in arrow::py::NumPyBuffer::~NumPyBuffer() ()
   from /opt/software/python/py310_env/lib/python3.10/site-packages/pyarrow/libarrow_python.so
(gdb) up
#4  0x00007fffe0ecb56d in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<arrow::ArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /opt/software/python/py310_env/lib/python3.10/site-packages/pyarrow/libarrow_python.so
(gdb) up
#5  0x00007fffe0ecb4d5 in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<arrow::ArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /opt/software/python/py310_env/lib/python3.10/site-packages/pyarrow/libarrow_python.so
(gdb) up
#6  0x00007fffde32f04a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
   from /opt/software/python/py310_env/lib/python3.10/site-packages/pyarrow/libarrow.so.1100
(gdb) up
#7  0x00007fffdee02a4c in arrow::(anonymous namespace)::ReleaseExportedArray(ArrowArray*) ()
   from /opt/software/python/py310_env/lib/python3.10/site-packages/pyarrow/libarrow.so.1100
(gdb) up
#8  0x00007fffd0aab6f7 in std::_Sp_counted_ptr_inplace<arrow::(anonymous namespace)::ImportedArrayData, std::allocator<arrow::(anonymous namespace)::ImportedArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /usr/local/packages/R/4.2/lib/R/library/arrow/libs/arrow.so
(gdb) up
#9  0x00007fffd04eda4a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555c8b2a70)
    at /usr/include/c++/11/bits/shared_ptr_base.h:168
168		    _M_dispose();
(gdb) 

@paleolimbot
Copy link
Collaborator

Thank you for this!

I know you checked the nightly builds, but do you know if this bug is also present in 10.0.0? (That would help narrow down the change that introduced it). We're about to do a release and I'd love to fix this! It smells to me like a problem with the R package...I don't think anything about the C data interface in the C++ bindings changed recently (but I will check).

If I'm reading your example correctly, it seems like this problem is specific to character() arrays that have been "materialized": In that example df$col2 would be a ChunkedArray pretending to be a character() and I() might "materialize" the array. In 11.0.0 there was a PR that changed the way that worked.

It looks like I() does, in fact, materialize the array:

x <- as.vector(arrow::as_arrow_array("xs"))
arrow:::is_arrow_altrep(x)
#> [1] TRUE
arrow:::test_arrow_altrep_is_materialized(x)
#> [1] FALSE

print(I(x))
#> [1] "xs"
arrow:::test_arrow_altrep_is_materialized(x)
#> [1] TRUE

# Could also try
# arrow:::test_arrow_altrep_force_materialize()
# as a more explicit test

Created on 2023-04-04 with reprex v2.0.2

@paleolimbot
Copy link
Collaborator

Some thing else to try is an explicit gc() before shutting down the session. One other thing that could be going wrong is that R is trying to release memory at a very inconvenient time (session exit) and I wonder if explicitly releasing that memory before session exit would work or result in the same crash.

@lgautier
Copy link
Member

lgautier commented Apr 6, 2023

Calling R's gc() before shutting down the session does not solve the segfault. Neither does Python's gc.collect().

@paleolimbot
Copy link
Collaborator

I did some sleuthing and added a note to the issue on the Arrow side...it's almost certainly something we need to fix there.

@paleolimbot
Copy link
Collaborator

@lgautier What platform are you on? (I can generate a pyarrow wheel with a potential fix from a development branch but I need to know which wheel to generate...).

@lgautier
Copy link
Member

lgautier commented Apr 7, 2023

x86_64. Depending on what is involved I might be able to build from source.

@paleolimbot
Copy link
Collaborator

Building from source is a pain. You should be able to pick your OS/Python version from here: apache/arrow#34948 (comment) , click the green "Crossbow" symbol, and then click "Summary", and then click "wheel" towards the bottom of the page. (Or just tell me your OS/Python version and I'll find a better link for you!)

@lgautier
Copy link
Member

Still a segfault (Python 3.10, numpy 1.24, pandas 2.0.0)

(gdb) backtrace
#0  0x00005555557868b0 in new_threadstate (interp=0x0, init=1) at ../Python/pystate.c:616
#1  0x00005555557a4ee2 in PyThreadState_New (interp=<optimized out>) at ../Python/pystate.c:684
#2  PyGILState_Ensure () at ../Python/pystate.c:1504
#3  0x00007fffe0f315a3 in arrow::py::NumPyBuffer::~NumPyBuffer() ()
   from /home/guest/software/py310_env/lib/python3.10/site-packages/pyarrow/libarrow_python.so
#4  0x00007fffe0f16c5d in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<arrow::ArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/guest/software/py310_env/lib/python3.10/site-packages/pyarrow/libarrow_python.so
#5  0x00007fffe0f16bc5 in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<arrow::ArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/guest/software/py310_env/lib/python3.10/site-packages/pyarrow/libarrow_python.so
#6  0x00007fffddbc297a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
   from /home/guest/software/py310_env/lib/python3.10/site-packages/pyarrow/libarrow.so.1200
#7  0x00007fffde6909ac in arrow::(anonymous namespace)::ReleaseExportedArray(ArrowArray*) ()
   from /home/guest/software/py310_env/lib/python3.10/site-packages/pyarrow/libarrow.so.1200
#8  0x00007fffd03746f7 in std::_Sp_counted_ptr_inplace<arrow::(anonymous namespace)::ImportedArrayData, std::allocator<arrow::(anonymous namespace)::ImportedArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /usr/local/packages/R/4.2/lib/R/library/arrow/libs/arrow.so
#9  0x00007fffcfdb6a4a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555c017cb0)
    at /usr/include/c++/11/bits/shared_ptr_base.h:168
#10 0x00007fffd0378031 in std::_Sp_counted_ptr_inplace<arrow::(anonymous namespace)::ImportedBuffer, std::allocator<arrow::(anonymous namespace)::ImportedBuffer>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /usr/local/packages/R/4.2/lib/R/library/arrow/libs/arrow.so
#11 0x00007fffcfe7ecde in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555559f5fe20)
    at /usr/include/c++/11/bits/shared_ptr_base.h:168
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55555cd82208, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/shared_ptr_base.h:705
#13 std::__shared_ptr<arrow::Buffer, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55555cd82200, __in_chrg=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--c
   /11/bits/shared_ptr_base.h:1154
#14 std::shared_ptr<arrow::Buffer>::~shared_ptr (this=0x55555cd82200, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/shared_ptr.h:122
#15 std::_Destroy<std::shared_ptr<arrow::Buffer> > (__pointer=0x55555cd82200) at /usr/include/c++/11/bits/stl_construct.h:151
#16 std::_Destroy_aux<false>::__destroy<std::shared_ptr<arrow::Buffer>*> (__last=0x55555cd82210, __first=0x55555cd82200) at /usr/include/c++/11/bits/stl_construct.h:163
#17 std::_Destroy<std::shared_ptr<arrow::Buffer>*> (__last=0x55555cd82210, __first=<optimized out>) at /usr/include/c++/11/bits/stl_construct.h:196
#18 std::_Destroy<std::shared_ptr<arrow::Buffer>*, std::shared_ptr<arrow::Buffer> > (__last=0x55555cd82210, __first=<optimized out>) at /usr/include/c++/11/bits/alloc_traits.h:848
#19 std::vector<std::shared_ptr<arrow::Buffer>, std::allocator<std::shared_ptr<arrow::Buffer> > >::~vector (this=0x5555579454f8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:680
#20 arrow::ArrayData::~ArrayData (this=0x5555579454d0, __in_chrg=<optimized out>) at /tmp/Rtmp3j0oQh/R.INSTALL2da0c279091ac/arrow/libarrow/arrow-11.0.0.100000321/include/arrow/array/data.h:77
#21 __gnu_cxx::new_allocator<arrow::ArrayData>::destroy<arrow::ArrayData> (__p=0x5555579454d0, this=0x5555579454d0) at /usr/include/c++/11/ext/new_allocator.h:168
#22 std::allocator_traits<std::allocator<arrow::ArrayData> >::destroy<arrow::ArrayData> (__p=0x5555579454d0, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:535
#23 std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<arrow::ArrayData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5555579454c0) at /usr/include/c++/11/bits/shared_ptr_base.h:528
#24 0x00007fffd0190042 in arrow::StringArray::~StringArray() () from /usr/local/packages/R/4.2/lib/R/library/arrow/libs/arrow.so
#25 0x00007fffcfdb6a4a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555b140940) at /usr/include/c++/11/bits/shared_ptr_base.h:168
#26 0x00007ffff7c45495 in __run_exit_handlers (status=0, listp=0x7ffff7e19838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#27 0x00007ffff7c45610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#28 0x00007ffff7c29d97 in __libc_start_call_main (main=main@entry=0x55555577f2f0 <main>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe508) at ../sysdeps/nptl/libc_start_call_main.h:74
#29 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x55555577f2f0 <main>, argc=2, argv=0x7fffffffe508, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe4f8) at ../csu/libc-start.c:392
#30 0x000055555577f225 in _start ()

@paleolimbot
Copy link
Collaborator

Thank you for trying! Given that neither the PR nor running gc() explicitly is working to fix this, I'm wondering if my mental model of what's happening here is not complete. I will try one other fix from the R side (we can request that external pointers not be deleted on exit which may help).

@lgautier
Copy link
Member

I even tried when R's garbage collection is exhaustive (gc(full = TRUE)), with or without Python's gc.collect() (and when with it the two permutations of calling sequences). Still a segfault.

The stack trace in gdb shows that this happens in pyarrow rather that in R's arrow. I am also noting that the issue is only present with string arrays. Freeing the individual strings in the array twice would lead to a segfault. Also, IIRC R using a memory optimization trick to avoid duplication of identical strings (see installChar in the R C-API). May the exit code leads to the memory for those string being freed one time too many.

@paleolimbot
Copy link
Collaborator

Thanks for continuing to dig into this!

Even though the error is coming from Python, I worry that the reason there's an array that needs cleaning up at all is still R's fault (even if that array originated in Python).

Arrow's memory representation doesn't rely on the duplication of identical strings (it copies them into a big long buffer and in this case that big long buffer would have come from Python anyway). I do think that a "freeing one time too many" type of thing might be happening although the crash seems more consistent with attempting to acquire the GIL during finalization rather than a straight double free.

Just curious...are you installing both arrow and pyarrow in a conda environment?

@lgautier
Copy link
Member

I am not using conda. R is compiled from source, arrow is installed using R's package management, pyarrow through pip and in a virtualenv.

@lgautier
Copy link
Member

The call stack when it segfaults indicates that the shared library (.so) in arrow is a starting point, and it moves to pyarrow mid-way. I looked at what is the status of the embedded R during that sequence an noticed that the embedded R is ended (the embedded R has to be initialized before anything R can be done, and ended to run R cleanup and finalizers). This mean that the starting is not R itself, but more a registered exit handler for Python calling a function in that .so.

If trying to acquire the GIL while the Python process has already shut down is the issue as you suspect then this is happening here: https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/src/arrow/python/numpy_convert.cc#L56 . And then it means that this code should have been called before Python has shut down.

The segfault happens even when both R and Python arrow objects are deleted and the garbage collection for both languages is performed. Or even when the R code that creates the R data.frame using the pyarrow-created array fails (see snippet below as a change to my minimal example few comments above).

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        tmp <- I(df$col2)  # segfault on exit
        "a" + 1  # Error here
        tmp
    }
"""

). This means that materializing a pyarrow-created array in R (or whatever I() is doing) is sufficient to create the segfault at exit without having an R object still protecting that data from collection. The problem is limited to string arrays though. The issue is almost certain where things differs between say, FloatArrays and StringArrays but I am unfamiliar with Arrow's code base.

@paleolimbot
Copy link
Collaborator

Thank you again for this! I'll investigate from both ends: I think there were some PRs that added exit handlers to pyarrow recently. Also, I know of at least one place in the R code base where strings and non-strings get handled differently that was touched recently (this is the ALTREP stuff I keep mentioning).

@lgautier
Copy link
Member

lgautier commented May 20, 2023

I just tried again with release 12.0.0 (both on the arrow and pyarrow sides). It still segfault and the backtrace in gdb looks similar.

I have observed that the following small change in my minimal example (see earlier in this thread)
toggles the segfault:

identity(df$col2)  # no segfault
I(df$col2)  # segfault 

This seems like additional evidence supporting some form unwanted or incomplete copy of underlying data / memory regions. The issue is only present with strings, so may be this is cause by mismatched expectations between shallow and deep copies of a string array?

@lgautier
Copy link
Member

Fixed upstream (apache/arrow#35812).

paleolimbot added a commit to apache/arrow that referenced this issue May 30, 2023
… any Array references (#35812)

This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you!

I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was:

Install fresh deps:

```bash
pip3 install pandas pyarrow rpy2-arrow
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'
```

Run this python script:

```python
import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
nanoarrow = importr('nanoarrow')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl))

output = rfunction(r_df)
print(output)
```

Before this PR (installing R/arrow from main) I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

zsh: segmentation fault  python reprex-arrow.py
```

After this PR I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

(with no segfault)

I wonder if this also will help with #35391 since it's also a segfault involving the Python <-> R bridge.
* Closes: #34897

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Jun 6, 2023
…ot own any Array references (apache#35812)

This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you!

I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was:

Install fresh deps:

```bash
pip3 install pandas pyarrow rpy2-arrow
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'
```

Run this python script:

```python
import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
nanoarrow = importr('nanoarrow')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl))

output = rfunction(r_df)
print(output)
```

Before this PR (installing R/arrow from main) I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

zsh: segmentation fault  python reprex-arrow.py
```

After this PR I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

(with no segfault)

I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge.
* Closes: apache#34897

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Jun 13, 2023
…ot own any Array references (apache#35812)

This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you!

I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was:

Install fresh deps:

```bash
pip3 install pandas pyarrow rpy2-arrow
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'
```

Run this python script:

```python
import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
nanoarrow = importr('nanoarrow')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl))

output = rfunction(r_df)
print(output)
```

Before this PR (installing R/arrow from main) I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

zsh: segmentation fault  python reprex-arrow.py
```

After this PR I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

(with no segfault)

I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge.
* Closes: apache#34897

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants