Using pyarrow with pypy #2089

bivald · 2018-05-29T18:23:38Z

Hi,

I'm trying to create parquet files with pypy (using pyarrow) . After having spent quite a few hours on this I'm stuck. My base question is:

Is it futile to even try to use pyarrow with pypy? 😄

The built wheels can't be used for pypy (of course) so I'm trying to build pyarrow from source. To simplify I'm using the libarrow debian apt repository.

I'm using Debian Stretch since that's what the repository supports
I'm installing libarrow+dev packages via apt repository
I've checked out the arrow source code
(I do some source folder trickery, cp -r /arrow/cpp/src/arrow /arrow/python/build/temp.linux-x86_64-2.7/)
I run pypy setup.py build_ext --build-type=release

This fails with:

CMakeFiles/lib.dir/build.make:62: recipe for target 'CMakeFiles/lib.dir/lib.cpp.o' failed
make[2]: *** [CMakeFiles/lib.dir/lib.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/lib.dir/all' failed
make[1]: *** [CMakeFiles/lib.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
error: command 'make' failed with exit status 2

Full output on https://gist.github.com/bivald/01ab26bd6e5cedcf4d34354095d33bf2

I'm running all of this via Docker and can provide the Dockerfiles if anyone is interested. I guess my main question is:

Does anyone know any fundamentals that hinders pyarrow on pypy?

Nowadays pypy supports numpy, pandas and (not sure about Cython)

Normally you can often use pure python implementations when your on pypy, but the only I found for parquet is read-only. Worst case I'll posix spawn a "normal" python process, but would love to get it working properly.

The background is that I have several workers which run on pypy and I'm shifting them to produce parquet files over csv. The next step in the process uses CPython so parquet works great there.

The text was updated successfully, but these errors were encountered:

xhochy · 2018-05-29T21:15:26Z

The failure does not seem to be related to pypy but rather a version mismatch between the libarrow package and what you install as Python. For building with pypy I would rather suggest that you use the plain build from source https://arrow.apache.org/docs/python/development.html#development

I'm not aware of anyone that uses pyarrow together with pypy so it might or might not work. Probably the latter but then we should have a quick look if it may be possible to easily fix it or if it's a larger task.

bivald · 2018-05-30T11:25:10Z

I semi-followed the instructions (with a few modifications, such as editing arrow/cpp/CMakeCache.txt to PYTHON_EXECUTABLE). I was able to build both arrow and parquet.

I am able to write a simple parquet file (haven't yet tried a more advanced scenario):

pypy -c """import numpy as np
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
>                        'two': ['foo', 'bar', 'baz'],
>                        'three': [True, False, True]},
>                        index=list('abc'))
> table = pa.Table.from_pandas(df)
> import pyarrow.parquet as pq
> pq.write_table(table, 'example.parquet')"""

shasum example.parquet
d11ec654e1bed28fcfad2d60e8820b1cfbc8f837  example.parquet

So.. it might work, will try to get some time to do a more "real world" scenario.

Tests
Tests appears to give segfault with regards to dates. There are a few that fails as well, but mostly it appears to be pass or segfault (after installing the futures module for python2). Note that I'm running tests on python2, which may or may not effect results

pyarrow/tests/test_array.py segfaults
- test_cast_date32_to_int
pyarrow/tests/test_builder.py passed
pyarrow/tests/test_convert_builtin.py segfaults
- test_limited_iterator_size_overflow
pyarrow/tests/test_convert_pandas.py segfaults
- test_datetime64_to_date32
pyarrow/tests/test_cython.py passes
pyarrow/tests/test_deprecations.py no tests
pyarrow/tests/test_feather.py passes
pyarrow/tests/test_hdfs.py skipped
pyarrow/tests/test_io.py a few fails (getrefcount, which is not to be implemented in pypy)
pyarrow/tests/test_ipc.py passed
pyarrow/tests/test_misc.py passed
pyarrow/tests/test_parquet.py passed (1 fail)
pyarrow/tests/test_plasma.py (I didn't build it)
pyarrow/tests/test_scalars.py passes
pyarrow/tests/test_schema.py passes
pyarrow/tests/test_serialization.py segfaults
- test_primitive_serialization
pyarrow/tests/test_table.py passes
pyarrow/tests/test_tensor.py passes, 1 fail (getrefcount, which is not to be implemented in pypy)
pyarrow/tests/test_types.py passes

So most of them passes (and a few fails) but there are 4 segfaults:

pyarrow/tests/test_array.py segfaults
- test_cast_date32_to_int
pyarrow/tests/test_convert_builtin.py segfaults
- test_limited_iterator_size_overflow
pyarrow/tests/test_convert_pandas.py segfaults
- test_datetime64_to_date32
pyarrow/tests/test_serialization.py segfaults
- test_primitive_serialization

They might have more then one test that segfaults, I just took the one that aborted the test (I didn't go in an manually exclude the segfaulting to see if there are more)

xhochy · 2018-05-31T09:47:13Z

I have opened https://issues.apache.org/jira/browse/ARROW-2651 to track the PyPy support of Arrow. I must say that your outcome is better than I had expected. Regading the segfaults, it is hard to estimate on how difficult they are to fix.

To provide more information on the segmentation faults, you could run the code with coredumps enabled and afterwards inspect the coredump and post the backtrace here. This will work roughly as follows:

ulimit -c unlimited  # enables coredumps of unlimited size
py.test
# The above command will produce a file called `core` or `core.<pid>`
gdb python core
> thread apply all bt

Then post the output of thread apply all bt here. You might need to exchange python above with the name of your python executable (e.g. pypy).

bivald · 2018-05-31T12:42:09Z

I've also opened an issue on PyPy https://bitbucket.org/pypy/pypy/issues/2842/running-pyarrow-on-pypy-segfaults and created a reproducable sample at https://github.com/bivald/pyarrow-docker-test

bivald · 2018-05-31T12:44:16Z

Backtrace:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/venv/bin/pypy /venv/bin/py.test pyarrow'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fdc9c399cab in pypy_debug_catch_fatal_exception () from /venv/bin/libpypy-c.so
(gdb) thread apply all bt

Thread 1 (Thread 0x7fdc9e21c780 (LWP 10)):
#0  0x00007fdc9c399cab in pypy_debug_catch_fatal_exception () from /venv/bin/libpypy-c.so
#1  0x00007fdc9af21d04 in pypy_g_ccall_pypy_debug_catch_fatal_exception_ () from /venv/bin/libpypy-c.so
#2  0x00007fdc9b5815ca in pypy_g_unexpected_exception () from /venv/bin/libpypy-c.so
#3  0x00007fdc9b54f6fd in pypy_g_wrapper_second_level.star_1_13 () from /venv/bin/libpypy-c.so
#4  0x00007fdc95c55182 in arrow::py::PyDate_to_ms(PyDateTime_Date*) () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#5  0x00007fdc95c64624 in arrow::Status arrow::py::internal::VisitSequence<arrow::py::TypedConverterVisitor<arrow::NumericBuilder<arrow::Date64Type>, arrow::py::Date64Converter>::AppendMultiple(_object*, long)::{lambda(_object*)#1}&>(_object*, arrow::py::TypedConverterVisitor<arrow::NumericBuilder<arrow::Date64Type>, arrow::py::Date64Converter>::AppendMultiple(_object*, long)::{lambda(_object*)#1}&) () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#6  0x00007fdc95c64ca5 in arrow::py::TypedConverterVisitor<arrow::NumericBuilder<arrow::Date64Type>, arrow::py::Date64Converter>::AppendMultiple(_object*, long)
    () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#7  0x00007fdc95c5597c in arrow::py::AppendPySequence(_object*, long, std::shared_ptr<arrow::DataType> const&, arrow::ArrayBuilder*) ()
   from /repos/arrow/python/pyarrow/libarrow_python.so.10
#8  0x00007fdc95c562f2 in arrow::py::ConvertPySequenceReal(_object*, long, std::shared_ptr<arrow::DataType> const*, arrow::MemoryPool*, std::shared_ptr<arrow::Array>*) () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#9  0x00007fdc95c564e8 in arrow::py::ConvertPySequence(_object*, arrow::MemoryPool*, std::shared_ptr<arrow::Array>*) ()
   from /repos/arrow/python/pyarrow/libarrow_python.so.10
#10 0x00007fdc964f2429 in __pyx_pw_7pyarrow_3lib_77array(_object*, _object*, _object*) () from /repos/arrow/python/pyarrow/lib.pypy-41.so
#11 0x00007fdc9b663853 in pypy_g_generic_cpy_call__StdObjSpaceConst_funcPtr_SomeI_6 () from /venv/bin/libpypy-c.so
#12 0x00007fdc9b675815 in pypy_g_W_PyCFunctionObject_call_keywords () from /venv/bin/libpypy-c.so
#13 0x00007fdc9b1ad7ea in pypy_g_BuiltinCodePassThroughArguments1_funcrun_obj () from /venv/bin/libpypy-c.so
#14 0x00007fdc9b976a01 in pypy_g_call_args () from /venv/bin/libpypy-c.so
#15 0x00007fdc9b22735d in pypy_g_call_valuestack__AccessDirect_None () from /venv/bin/libpypy-c.so
#16 0x00007fdc9bb27e7d in pypy_g_CALL_METHOD__AccessDirect_star_1 () from /venv/bin/libpypy-c.so
#17 0x00007fdc9b220cfe in pypy_g_dispatch_bytecode__AccessDirect_None () from /venv/bin/libpypy-c.so
#18 0x00007fdc9b222e40 in pypy_g_handle_bytecode__AccessDirect_None () from /venv/bin/libpypy-c.so
#19 0x00007fdc9b921ff2 in pypy_g_portal_28 () from /venv/bin/libpypy-c.so
#20 0x00007fdc9bcd990d in pypy_g_ll_portal_runner__Unsigned_Bool_pypy_interpreter () from /venv/bin/libpypy-c.so

wesm · 2018-07-09T20:48:36Z

Closing this in favor of tracking progress in ARROW-2651

As described in the [ARROW-2651](https://issues.apache.org/jira/browse/ARROW-2651) issue, this patch fixes the C datetime module import mechanism for PyPy. This is related to #2089 which was closed in favor of the JIRA issue. Authored-by: mattip <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

wesm closed this as completed Jul 9, 2018

mattip mentioned this issue Oct 28, 2022

ARROW-2651: [Python] Fix datetime C API import for PyPy #14539

Merged

asfimport mentioned this issue Jan 7, 2020

[C++/Python] Document how to provide information on segfaults #19047

Open

asfimport mentioned this issue May 3, 2023

[Python] Build & Test with PyPy #19046

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pyarrow with pypy #2089

Using pyarrow with pypy #2089

bivald commented May 29, 2018

xhochy commented May 29, 2018

bivald commented May 30, 2018 •

edited

Loading

xhochy commented May 31, 2018

bivald commented May 31, 2018

bivald commented May 31, 2018

wesm commented Jul 9, 2018

Using pyarrow with pypy #2089

Using pyarrow with pypy #2089

Comments

bivald commented May 29, 2018

xhochy commented May 29, 2018

bivald commented May 30, 2018 • edited Loading

xhochy commented May 31, 2018

bivald commented May 31, 2018

bivald commented May 31, 2018

wesm commented Jul 9, 2018

bivald commented May 30, 2018 •

edited

Loading