-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11. #10488
Comments
You need to link to arrow_python_shared, too. It's necessary to use the unwrap functions in https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/pyarrow.h to retrieve the C++ object inside the Python wrapper objects. If you still have trouble, can you write to [email protected]? Thanks |
Perhaps @maartenbreddels can help here as he wrote the original example. |
It's been a while since I wrote the code in that repo, but it seems I only added Double support: |
Thank you everyone for helping me out with this example. I think the issues largely were twofold and largely answered by Maarten and Wes. These issues were:
However, I now seem to be running into an odd error where the critical arrow::py::import_pyarrow() is throwing a segmentation fault. The behavior is a bit strange and varies between situations. To debug, the modified code now looks like this:
When the compiled helperfuncs.cpp extension is imported into the python file I want to use it in while debugging, it imports without any problems, and I can even call the arrow::py::import_pyarrow() or arrow::py::import_pyarrow2() function, and both successfully return 0. However, it it throws a segmentation fault while performing the first UnsafeAppend. It behaves a bit differently when I import it in a standalone python interpreter shell. It leads to 2 different memory-related errors depending on how its imported:
or
uncommenting the arrow::py::import_pyarrow(); line in the PYBIND11_MODULE function, which was how the code was in the VAEX repository, will also fail, but with a Segmentation fault during import. Does anyone know why pyarrow can import_pyarrow() in the python file, but not from a raw python interpreter shell, and am I missing something that is causing segmentation faults? Is the libarrow_python.so.400 library the appropriate pyarrow library to link? |
I would suggest debugging these crashes using gdb. |
After running python with debug symbols in GDB, Here is the relevant part of the GDB backtrace with directory names redacted:
in addition to:
As suspected, the issue is in UnsafeAppend, but I'm not sure why this file is missing in my install/system. Do I need to build pyarrow from source to get this to install? It looks like it is a reference to an X86 AVX-512 SIMD command, as referenced here. The missing file in question can presumably be found publicly online at places like this. I think part of the problem may be that I have a 3rd gen Ryzen processor, which according to some reports does not support AVX-512. I can't really tell if it isn't working because of hardware limitations, or because the capability and file exists, but I am not linking all the dependencies I need. |
@frmnboi Thanks for the backtrace. Judging by the code and the error, I think the problem is simple: you're probably calling Your loop appends |
It sounds like you are reacting to It's also not a linking error at this point so don't worry about that. The function __memmove_avx_unaligned_erms is part of memcpy which is what UnsafeAppend is doing under the hood. The error is reporting that the destination of the copy is not valid. Antoine's guess seems the most likely reason for this. |
I have changed the From what I can tell by inserting print-statements, the loop is failing the very first time it is called. Is there a good way to test to see if pyarrow is working properly with regards to its memory management? |
To give an update, I've tried this on a second, intel-based computer and wasn't able to get it to run without segfaulting. |
@frmnboi If you can make a github repo that compiles and exhibits the error I'd be willing to help you debug the issue. |
Thanks @westonpace! I've put the necessary files into this repo:https://github.com/frmnboi/Arrow_Ext_Debug To build, the cmake file paths will need to be modified to properly link the shared libraries on your particular device. I have noted where this may be required in the Readme. I have also copied in my version of pybind11 I am using as a dependency as well to reduce variability between different devices. |
So if I use the attached CMakeLists.txt then everything seems to work. My setup is a conda setup and I am using an environment arrow-release-4 which has arrow-cpp installed (which is where those shared libraries in the attached CMakeLists come from). I get the following output.
My suspicion is that the issue is here:
It appears you are linking arrow statically but linking arrow_python as a shared library. I'm not sure if that is valid or not. Even if it is valid there is no static arrow library supplied by pyarrow. That means that find_package must be finding Arrow from some other location. The end result is some kind of library ABI mismatch. I'm still getting up to speed on cmake myself but what happens if you try...
I'm pretty sure at this point you can get rid of |
Forgot to attach the "attached CMakeLists" :) |
I think there might be something wrong with my arrow install. It turns out I had arrow installed in 2 locations (at least 2 of the locations python checks for modules). I installed pyarrow inside a python virtualenv, and was using that for my Cmake file. I have tried rebuilding using syntax of the same structure as the CMakeLists, using both shared library locations, but it appears I am getting a linking error:
The I will try reinstalling and trying this again. The odd state of the install may be due to an unsuccessful attempt I had earlier to build and install arrow before realizing that pyarrow was a pip package, and the fact I am using it as part of a virtualenv. |
I reinstalled python and pyarrow, and am now linking with the bottom 3 libraries without using a virtual environment. On my device, the shared libraries are located in:
Update However, I appear to still have the same issue as before with the same error message. Do you think installing with Anaconda like your install is required to get this to work @westonpace ? |
This error (I think, a bit out of my depth here) means that two of your components are built with different versions of glibc.
This probably indicates a mismatch between the python headers you compiled with and the python so file that you are dynamically linking with. Sorry for the delay, I missed the earlier ping. Installing with Anaconda shouldn't be required but getting a correct setup can be tricky. |
I'm not too sure of why it is happening myself either. I suppose there could be an incompatibility between pyarrow and the version of gcc I am using. In the future, I may try to build and install arrow from source to avoid this problem. I'm going to keep this topic closed as I'm not going to be in a position to debug to that granularity in the near future, and I currently can operate with the arrays in numpy. |
I'm still working through this, and I might be completely off: |
FWIW this seems to be much easier with the new PyCapsule protocol, I have a working example (albeit only for Schema and raw CPython binding) on the linked PR, will do array + table and pybind shortly. reading a schema and returning some info from it as a Python string from Python to C++ without pyarrow: PyObject* schema_info_py(PyObject* self, PyObject* args) {
PyObject* source;
// parse arguments
if(!PyArg_ParseTuple(args, "O", &source)) {
PyErr_SetString(PyExc_TypeError, "Bad value provided");
return NULL;
}
// read attribute holding capsule
if(!PyObject_HasAttrString(source, "__arrow_c_schema__")) {
PyErr_SetString(PyExc_TypeError, "Bad value provided");
return NULL;
}
// extract the capsule
PyObject* schema_capsule = PyObject_CallNoArgs(PyObject_GetAttrString(source, "__arrow_c_schema__"));
struct ArrowSchema* c_schema = (struct ArrowSchema*) PyCapsule_GetPointer(schema_capsule, "arrow_schema");
// Convert C schema to C++ schema and extract info
std::shared_ptr<arrow::Schema> arrow_schema = arrow::ImportSchema(c_schema).ValueOrDie();
std::string info = schema_info(arrow_schema);
return PyUnicode_FromStringAndSize(info.c_str(), info.length());
} |
The key bits are basically: std::shared_ptr<arrow::Array> unpack_array(PyObject* array) {
// call the method and get the tuple
PyObject* array_capsule_tuple = PyObject_CallNoArgs(PyObject_GetAttrString(array, "__arrow_c_array__"));
PyObject* schema_capsule_obj = PyTuple_GetItem(array_capsule_tuple, 0);
PyObject* array_capsule_obj = PyTuple_GetItem(array_capsule_tuple, 1);
// extract the capsule
struct ArrowArray* c_array = (struct ArrowArray*) PyCapsule_GetPointer(array_capsule_obj, "arrow_array");
// Convert C array to C++ array and extract info
std::shared_ptr<arrow::Array> arrow_array = arrow::ImportArray(c_array, unpack_dtype(schema_capsule_obj)).ValueOrDie();
return arrow_array;
}
PyObject* pack_array(std::shared_ptr<arrow::Array> array) {
// Convert to C api
struct ArrowArray* c_array = (struct ArrowArray*)malloc(sizeof(struct ArrowArray));
struct ArrowSchema* c_schema = (struct ArrowSchema*)malloc(sizeof(struct ArrowSchema));
(void)arrow::ExportArray(*array, c_array, c_schema);
// Hoist out to pycapsule
PyObject* array_capsule = PyCapsule_New(c_array, "arrow_array", ReleaseArrowArrayPyCapsule);
PyObject* schema_capsule = PyCapsule_New(c_schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
return PyTuple_Pack(2, schema_capsule, array_capsule);
}
std::shared_ptr<arrow::DataType> unpack_dtype(PyObject* dtype_capsule) {
// extract the capsule
struct ArrowSchema* c_dtype = (struct ArrowSchema*) PyCapsule_GetPointer(dtype_capsule, "arrow_schema");
std::shared_ptr<arrow::DataType> arrow_dtype = arrow::ImportType(c_dtype).ValueOrDie();
return arrow_dtype;
}
std::shared_ptr<arrow::Schema> unpack_schema(PyObject* schema) {
// extract the capsule
PyObject* schema_capsule = PyObject_CallNoArgs(PyObject_GetAttrString(schema, "__arrow_c_schema__"));
struct ArrowSchema* c_schema = (struct ArrowSchema*) PyCapsule_GetPointer(schema_capsule, "arrow_schema");
// Convert C schema to C++ schema and extract info
std::shared_ptr<arrow::Schema> arrow_schema = arrow::ImportSchema(c_schema).ValueOrDie();
return arrow_schema;
}
PyObject* pack_schema(std::shared_ptr<arrow::Schema> schema) {
// Convert to C api
struct ArrowSchema* c_schema = (struct ArrowSchema*)malloc(sizeof(struct ArrowSchema));
(void)arrow::ExportSchema(*schema, c_schema);
// Hoist out to pycapsule
return PyCapsule_New(c_schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
} |
@timkpaine Note that packing and unpacking in your snippet are not symmetrical. Calling |
I'm trying to write a C++ extension to add a new column to a table I have. I create the table with pyarrow in python, but I want to call a function in C++ to operate on the data, in-place if possible. Currently, I have this:
helperfuncs.cpp
This was taken from the one example I could find on Pybind11 and Pyarrow working together:
https://github.com/vaexio/vaex-arrow-ext
I compile this using Cmake with the following excerpt:
Cmake
and call it in python with the following excerpt:
test.py
where the unchunked data['close'] is a pyarrow.lib.DoubleArray object and unchunked data['volume'] is a pyarrow.lib.Int64Array object.
Using cmake, this code will compile to a shared library, and can be successfully imported into python as the helperfuncs module. However, there are 2 issues that arise:
vol_adj_close(): incompatible function arguments. The following argument types are supported:
1. (arg0: arrow::NumericArrayarrow::DoubleType, arg1: arrow::NumericArrayarrow::Int64Type) -> arrow::NumericArrayarrow::DoubleType
This one confuses me greatly, as from what I can see from the documentation and code testing is:
pa.Array ----------------> <class 'pyarrow.lib.Array'>
pa.NumericArray -----> <class 'pyarrow.lib.NumericArray'>
The documentation seems to indicate that a NumericArray is a specific type of Array, so an implicit conversion should not be causing an issue. I do not see any way to convert an Array to NumericArray or vice versa in the documentation otherwise.
Is there a difference between python's pyarrow.lib.DoubleArray and C++'s arrow::NumericArrayarrow::DoubleType ?
On a final note, I know that there is a division operation that pyarrow can use to perform element-wise division, like I have here for this problem, but I am trying in this case to see if I can get a C++ extension up and running for more complex problems.
The text was updated successfully, but these errors were encountered: