Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pyarrow with pypy #2089

Closed
bivald opened this issue May 29, 2018 · 6 comments
Closed

Using pyarrow with pypy #2089

bivald opened this issue May 29, 2018 · 6 comments

Comments

@bivald
Copy link

bivald commented May 29, 2018

Hi,

I'm trying to create parquet files with pypy (using pyarrow) . After having spent quite a few hours on this I'm stuck. My base question is:

  • Is it futile to even try to use pyarrow with pypy? 😄

The built wheels can't be used for pypy (of course) so I'm trying to build pyarrow from source. To simplify I'm using the libarrow debian apt repository.

  1. I'm using Debian Stretch since that's what the repository supports
  2. I'm installing libarrow+dev packages via apt repository
  3. I've checked out the arrow source code
  4. (I do some source folder trickery, cp -r /arrow/cpp/src/arrow /arrow/python/build/temp.linux-x86_64-2.7/)
  5. I run pypy setup.py build_ext --build-type=release

This fails with:

CMakeFiles/lib.dir/build.make:62: recipe for target 'CMakeFiles/lib.dir/lib.cpp.o' failed
make[2]: *** [CMakeFiles/lib.dir/lib.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/lib.dir/all' failed
make[1]: *** [CMakeFiles/lib.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
error: command 'make' failed with exit status 2

Full output on https://gist.github.com/bivald/01ab26bd6e5cedcf4d34354095d33bf2

I'm running all of this via Docker and can provide the Dockerfiles if anyone is interested. I guess my main question is:

Does anyone know any fundamentals that hinders pyarrow on pypy?

Nowadays pypy supports numpy, pandas and (not sure about Cython)

Normally you can often use pure python implementations when your on pypy, but the only I found for parquet is read-only. Worst case I'll posix spawn a "normal" python process, but would love to get it working properly.

The background is that I have several workers which run on pypy and I'm shifting them to produce parquet files over csv. The next step in the process uses CPython so parquet works great there.

@xhochy
Copy link
Member

xhochy commented May 29, 2018

The failure does not seem to be related to pypy but rather a version mismatch between the libarrow package and what you install as Python. For building with pypy I would rather suggest that you use the plain build from source https://arrow.apache.org/docs/python/development.html#development

I'm not aware of anyone that uses pyarrow together with pypy so it might or might not work. Probably the latter but then we should have a quick look if it may be possible to easily fix it or if it's a larger task.

@bivald
Copy link
Author

bivald commented May 30, 2018

I semi-followed the instructions (with a few modifications, such as editing arrow/cpp/CMakeCache.txt to PYTHON_EXECUTABLE). I was able to build both arrow and parquet.

I am able to write a simple parquet file (haven't yet tried a more advanced scenario):

pypy -c """import numpy as np
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
>                        'two': ['foo', 'bar', 'baz'],
>                        'three': [True, False, True]},
>                        index=list('abc'))
> table = pa.Table.from_pandas(df)
> import pyarrow.parquet as pq
> pq.write_table(table, 'example.parquet')"""

shasum example.parquet
d11ec654e1bed28fcfad2d60e8820b1cfbc8f837  example.parquet

So.. it might work, will try to get some time to do a more "real world" scenario.

Tests
Tests appears to give segfault with regards to dates. There are a few that fails as well, but mostly it appears to be pass or segfault (after installing the futures module for python2). Note that I'm running tests on python2, which may or may not effect results

  • pyarrow/tests/test_array.py segfaults
    • test_cast_date32_to_int
  • pyarrow/tests/test_builder.py passed
  • pyarrow/tests/test_convert_builtin.py segfaults
    • test_limited_iterator_size_overflow
  • pyarrow/tests/test_convert_pandas.py segfaults
    • test_datetime64_to_date32
  • pyarrow/tests/test_cython.py passes
  • pyarrow/tests/test_deprecations.py no tests
  • pyarrow/tests/test_feather.py passes
  • pyarrow/tests/test_hdfs.py skipped
  • pyarrow/tests/test_io.py a few fails (getrefcount, which is not to be implemented in pypy)
  • pyarrow/tests/test_ipc.py passed
  • pyarrow/tests/test_misc.py passed
  • pyarrow/tests/test_parquet.py passed (1 fail)
  • pyarrow/tests/test_plasma.py (I didn't build it)
  • pyarrow/tests/test_scalars.py passes
  • pyarrow/tests/test_schema.py passes
  • pyarrow/tests/test_serialization.py segfaults
    • test_primitive_serialization
  • pyarrow/tests/test_table.py passes
  • pyarrow/tests/test_tensor.py passes, 1 fail (getrefcount, which is not to be implemented in pypy)
  • pyarrow/tests/test_types.py passes

So most of them passes (and a few fails) but there are 4 segfaults:

  • pyarrow/tests/test_array.py segfaults
    • test_cast_date32_to_int
  • pyarrow/tests/test_convert_builtin.py segfaults
    • test_limited_iterator_size_overflow
  • pyarrow/tests/test_convert_pandas.py segfaults
    • test_datetime64_to_date32
  • pyarrow/tests/test_serialization.py segfaults
    • test_primitive_serialization

They might have more then one test that segfaults, I just took the one that aborted the test (I didn't go in an manually exclude the segfaulting to see if there are more)

@xhochy
Copy link
Member

xhochy commented May 31, 2018

I have opened https://issues.apache.org/jira/browse/ARROW-2651 to track the PyPy support of Arrow. I must say that your outcome is better than I had expected. Regading the segfaults, it is hard to estimate on how difficult they are to fix.

To provide more information on the segmentation faults, you could run the code with coredumps enabled and afterwards inspect the coredump and post the backtrace here. This will work roughly as follows:

ulimit -c unlimited  # enables coredumps of unlimited size
py.test
# The above command will produce a file called `core` or `core.<pid>`
gdb python core
> thread apply all bt

Then post the output of thread apply all bt here. You might need to exchange python above with the name of your python executable (e.g. pypy).

@bivald
Copy link
Author

bivald commented May 31, 2018

I've also opened an issue on PyPy https://bitbucket.org/pypy/pypy/issues/2842/running-pyarrow-on-pypy-segfaults and created a reproducable sample at https://github.com/bivald/pyarrow-docker-test

@bivald
Copy link
Author

bivald commented May 31, 2018

Backtrace:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/venv/bin/pypy /venv/bin/py.test pyarrow'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fdc9c399cab in pypy_debug_catch_fatal_exception () from /venv/bin/libpypy-c.so
(gdb) thread apply all bt

Thread 1 (Thread 0x7fdc9e21c780 (LWP 10)):
#0  0x00007fdc9c399cab in pypy_debug_catch_fatal_exception () from /venv/bin/libpypy-c.so
#1  0x00007fdc9af21d04 in pypy_g_ccall_pypy_debug_catch_fatal_exception_ () from /venv/bin/libpypy-c.so
#2  0x00007fdc9b5815ca in pypy_g_unexpected_exception () from /venv/bin/libpypy-c.so
#3  0x00007fdc9b54f6fd in pypy_g_wrapper_second_level.star_1_13 () from /venv/bin/libpypy-c.so
#4  0x00007fdc95c55182 in arrow::py::PyDate_to_ms(PyDateTime_Date*) () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#5  0x00007fdc95c64624 in arrow::Status arrow::py::internal::VisitSequence<arrow::py::TypedConverterVisitor<arrow::NumericBuilder<arrow::Date64Type>, arrow::py::Date64Converter>::AppendMultiple(_object*, long)::{lambda(_object*)#1}&>(_object*, arrow::py::TypedConverterVisitor<arrow::NumericBuilder<arrow::Date64Type>, arrow::py::Date64Converter>::AppendMultiple(_object*, long)::{lambda(_object*)#1}&) () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#6  0x00007fdc95c64ca5 in arrow::py::TypedConverterVisitor<arrow::NumericBuilder<arrow::Date64Type>, arrow::py::Date64Converter>::AppendMultiple(_object*, long)
    () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#7  0x00007fdc95c5597c in arrow::py::AppendPySequence(_object*, long, std::shared_ptr<arrow::DataType> const&, arrow::ArrayBuilder*) ()
   from /repos/arrow/python/pyarrow/libarrow_python.so.10
#8  0x00007fdc95c562f2 in arrow::py::ConvertPySequenceReal(_object*, long, std::shared_ptr<arrow::DataType> const*, arrow::MemoryPool*, std::shared_ptr<arrow::Array>*) () from /repos/arrow/python/pyarrow/libarrow_python.so.10
#9  0x00007fdc95c564e8 in arrow::py::ConvertPySequence(_object*, arrow::MemoryPool*, std::shared_ptr<arrow::Array>*) ()
   from /repos/arrow/python/pyarrow/libarrow_python.so.10
#10 0x00007fdc964f2429 in __pyx_pw_7pyarrow_3lib_77array(_object*, _object*, _object*) () from /repos/arrow/python/pyarrow/lib.pypy-41.so
#11 0x00007fdc9b663853 in pypy_g_generic_cpy_call__StdObjSpaceConst_funcPtr_SomeI_6 () from /venv/bin/libpypy-c.so
#12 0x00007fdc9b675815 in pypy_g_W_PyCFunctionObject_call_keywords () from /venv/bin/libpypy-c.so
#13 0x00007fdc9b1ad7ea in pypy_g_BuiltinCodePassThroughArguments1_funcrun_obj () from /venv/bin/libpypy-c.so
#14 0x00007fdc9b976a01 in pypy_g_call_args () from /venv/bin/libpypy-c.so
#15 0x00007fdc9b22735d in pypy_g_call_valuestack__AccessDirect_None () from /venv/bin/libpypy-c.so
#16 0x00007fdc9bb27e7d in pypy_g_CALL_METHOD__AccessDirect_star_1 () from /venv/bin/libpypy-c.so
#17 0x00007fdc9b220cfe in pypy_g_dispatch_bytecode__AccessDirect_None () from /venv/bin/libpypy-c.so
#18 0x00007fdc9b222e40 in pypy_g_handle_bytecode__AccessDirect_None () from /venv/bin/libpypy-c.so
#19 0x00007fdc9b921ff2 in pypy_g_portal_28 () from /venv/bin/libpypy-c.so
#20 0x00007fdc9bcd990d in pypy_g_ll_portal_runner__Unsigned_Bool_pypy_interpreter () from /venv/bin/libpypy-c.so

@wesm
Copy link
Member

wesm commented Jul 9, 2018

Closing this in favor of tracking progress in ARROW-2651

@wesm wesm closed this as completed Jul 9, 2018
pitrou pushed a commit that referenced this issue Oct 31, 2022
As described in the [ARROW-2651](https://issues.apache.org/jira/browse/ARROW-2651) issue, this patch fixes the C datetime module import mechanism for PyPy. 

This is related to #2089 which was closed in favor of the JIRA issue.

Authored-by: mattip <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants