Load arrow tables into Perspective without converting to a binary buffer #1157

sc1f · 2020-08-19T17:13:30Z

Feature Request

Description of Problem:

Perspective expects arrow data to be loaded as an ArrayBuffer in Javascript, and a binary string in Python. This requires, in PyArrow at least, a few lines to convert an Arrow Table into binary:

stream = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(stream, arrow_table.schema)
writer.write_table(arrow_table)
writer.close()
perspective_table = perspective.Table(stream.getvalue().to_pybytes())

When loading arrow Tables, I expect Perspective to be compatible with Arrow Tables without having to do any conversion. The requirement of bytes for an Arrow binary is outlined slightly in the Python user guide, but the conversion process from an Arrow Table -> bytes is unclear.

Potential Solutions:

Write in an Arrow Table to binary conversion layer solely in the binding layer (using PyArrow or Arrow typescript), which would be simple but incomplete (would require reimplementation in future binding languages) and less performant. If there is a way to transfer a pointer to an arrow table from the binding layer into C++, then we should be able to write the conversion entirely in C++. We already malloc and memcpy the arrow binary from JS/Python into C++, so there might be something already there worth looking into.

The text was updated successfully, but these errors were encountered:

timkpaine · 2020-08-19T17:21:13Z

There is a way to transfer the pointer, I discussed this with @wesm a while back and it was eventually implemented to optimize transferring arrow tables between R and python. Let me look up the details of that thread

timkpaine · 2020-08-19T17:24:28Z

On Sep 10, 2019, at 19:35, Wes McKinney [email protected] wrote:

Hi Tim,

I see what you're saying now, sorry that I didn't understand sooner.

We actually need this feature to be able to pass instances of
shared_ptr (under very controlled conditions) into R using
reticulate, where T is any of

Array

ChunkedArray

DataType

RecordBatch

Table

and some other classes

I would suggest introducing a property on pyarrow Python objects that
returns the memory address of the wrapped shared_ptr (i.e. the
integer leading to shared_ptr*). Then you can create your copy of
that. Would that work? The only reason this is not implemented is that
no one has needed it yet, mechanically it does not strike me as that
complex.

See https://issues.apache.org/jira/browse/ARROW-3750. My comment in
November 2018 "Methods would need to be added to the Cython extension
types to give the memory address of the smart pointer object they
contain". I agree with my younger self. Are you up to submit a PR?

Wes

On Tue, Sep 10, 2019 at 6:31 PM Tim Paine [email protected] wrote:

The end goal is to go direct from pyarrow to wasm without intermediate transforms. I can definitely make it work as is, we'll just have to be careful that the code we compile to webassembly matches exactly either our local copy of arrow if the user hasn't installed pyarrow, otherwise their installed copy.

On Sep 10, 2019, at 19:12, Tim Paine [email protected] wrote:

We're building webassembly, so we obviously don't want to introduce a pyarrow dependency. I don't want to do any pyarrow manipulations in c++, just get the c++ table. I was hoping pyarrow might expose a raw pointer or have something castable.

It seems to be a big limitation, there is no way of communicating a pyarrow table to a c++ library that uses arrow without that library linking against pyarrow.

Tim

On Sep 10, 2019, at 17:44, Wes McKinney [email protected] wrote:

The Python extension types are defined in Cython, not C or C++ so you need
to load the Cython extensions in order to instantiate the classes.

Why do you have 2 copies of the C++ library? That seems easy to fix. If you
are using wheels from PyPI I would recommend that you switch to conda or
your own wheels without the C++ libraries bundled.

On Tue, Sep 10, 2019, 4:23 PM Tim Paine [email protected] wrote:

Is there no way to do it without PyArrow? My C++ library is building arrow
itself, which means if I use PyArow I’ll end up having 2 copies: one from
my local C++ only build, and one from PyArrow.

On Sep 10, 2019, at 5:18 PM, Wes McKinney [email protected] wrote:

hi Tim,

You can use the functions in

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/pyarrow.h

You need to call "import_pyarrow()" from C++ before these APIs can be
used. It's similar to the NumPy C API in that regard

Wes

On Tue, Sep 10, 2019 at 4:13 PM Tim Paine [email protected] wrote:

Hey all, following up on a question I asked on stack overflow <
https://stackoverflow.com/questions/57863751/how-to-convert-pyarrow-table-to-arrow-table-when-interfacing-between-pyarrow-in
.

It seems there is some code <
https://arrow.apache.org/docs/python/extending.html#_CPPv412unwrap_tableP8PyObjectPNSt10shared_ptrI5TableEE>
in PyArrow’s C++ to convert from a PyArrow table to an Arrow table. The
problem with this is that my C++ library <
https://github.com/finos/perspective> is going to build and link against
Arrow on the C++ side rather than PyArrow side (because it will also be
consumed in WebAssembly), so I want to avoid also linking against PyArrow’s
copy of the arrow library. I also need to look for PyArrow’s header files,
which might conflict with the version in the local C++ code.

My solution right now is to just assert that PyArrow version == Arrow
version and do some pruning (so I link against local libarrow and PyArrow’s
libarrow_python rather than use PyArrow’s libarrow), but ideally it would
be great if there was a clean way to hand a PyArrow Table over to C++
without requiring the C++ to have PyArrow (e.g. using only a PyObject *).
Please forgive my ignorance/google skills if its already possible!

unwrap_table code:

https://github.com/apache/arrow/blob/c39e3508f93ea41410c2ae17783054d05592dc0e/python/pyarrow/public-api.pxi#L310
<
https://github.com/apache/arrow/blob/c39e3508f93ea41410c2ae17783054d05592dc0e/python/pyarrow/public-api.pxi#L310

library pruning:

https://github.com/finos/perspective/blob/python_arrow/cmake/modules/FindPyArrow.cmake#L53
<
https://github.com/finos/perspective/blob/python_arrow/cmake/modules/FindPyArrow.cmake#L53

Tim

timkpaine · 2020-08-19T17:27:05Z

unwrap_array was the original way, but we weren't interested at the time because it required linking against libarrow_python. But since we do that anyway it should be ok
https://arrow.apache.org/docs/python/extending.html#_CPPv412unwrap_arrayP8PyObjectPNSt10shared_ptrI5ArrayEE

sc1f · 2020-08-19T17:30:31Z

@timkpaine is there a way to do the same in the WASM? I'm personally ok with only being able to load pyarrow Tables in our python binding but for the sake of completeness it would be helpful in WASM

wesm · 2020-08-19T18:54:13Z

You should be able to use the C interface now. cc @pitrou @nealrichardson

timkpaine · 2023-07-29T17:07:16Z

Ok I think this is an example now:
https://github.com/timkpaine/arrow-cpp-python-nocopy

It does both parts in one, but basically the README explains it:

This utilizes Apache Arrow's ABI-stable C interface, which would, for example, allow your combined C++/python extension to compile and link against one version of arrow and be utilized with pyarrow which would otherwise be ABI-incompatible.

Does this seem like the right thing to be doing @wesm @pitrou @nealrichardson @westonpace? AFAIK you need to force your python extension to interact with your C++ library via the arrow C ABI, otherwise your extension might pickup a libarrow from pyarrow that is C++ ABI incompatible with the libarrow your C++ code linked against?

pitrou · 2023-08-08T14:24:54Z

Does this seem like the right thing to be doing @wesm @pitrou @nealrichardson @westonpace?

Yes, it does!

AFAIK you need to force your python extension to interact with your C++ library via the arrow C ABI, otherwise your extension might pickup a libarrow from pyarrow that is C++ ABI incompatible with the libarrow your C++ code linked against?

You are right indeed. This is how we provide zero-copy data sharing between R and Python.

Also @jorisvandenbossche FYI

davenza · 2023-10-07T00:43:35Z

@timkpaine (@wesm @pitrou too) I checked your solution in https://github.com/timkpaine/arrow-cpp-python-nocopy, but it is a bit confusing for me and I believe there is something wrong. I would be grateful if you could give me some advice.

I have Arrow C++ 13 installed on my system. I can compile your example with Arrow C++ 13, and then I install it along with pyarrow 13 in a virtualenv. This works fine.

Then, I force the installation of pyarrow 12 in the virtualenv (keeping your example compiled with Arrow C++ 13). In this case, there is a MemoryError: std::bad_alloc in test_create_schema_in_python and a segmentation fault in test_create_schema_in_cpp.

I believe this is what happens:

When arrow_python_nocopy is imported in Python, then libarrow.so.1300 is loaded (the library is compiled linking to that version).
Then, when pyarrow is imported (import pyarrow as pa), libarrow.so.1200 is imported (from the site-packages/pyarrow).
Now, we have two different versions of libarrow.so loaded and this leads to a segmentation fault.

I may be misunderstanding something, but how does your example solve this problem using the C ABI?

In fact, you keep using unwrap_array() from libarrow_python to do the C++ <-> Python conversions (I don't see the C ABI being used in the pybind11 caster).

timkpaine · 2023-10-07T01:23:16Z

When arrow_python_nocopy is imported in Python, then libarrow.so.1300 is loaded (the library is compiled linking to that version).

The user library should pull it from pyarrow instead of the system. That repo is me testing stuff, it might have more or less content than is necessary to do this properly.

davenza · 2023-10-07T09:21:47Z

Ok, so the library will always be compiled with the same version as pyarrow.

I probably misunderstood it, because I thought that the C ABI could be used to make different versions of Arrow work together (one version for the library and other version for pyarrow).

Anyway, thank you @timkpaine .

pitrou · 2023-10-07T09:32:00Z

Le 07/10/2023 à 11:21, David Atienza a écrit :

Ok, so the library will always be compiled with the same version as pyarrow. I probably misunderstood it, because I thought that the C ABI could be used to make different versions of Arrow work together (one version for the library and other version for pyarrow).

It can, but you have to be careful with library loading issues. In particular, loading two different versions *from Python* is probably not possible, because the symbols will clash.

davenza · 2023-10-07T16:47:47Z

For future reference for other readers who have stumbled upon the same problem, I have been thinking about how we could design an Arrow C++ <-> Python library that is decoupled from the installed pyarrow version, and now I think it is not possible.

Even if we solve the name clashes caused by the different versions of libarrow.so (mangling the symbols with the version, for example), the big problem here is that the ABI is not stable, so the library should be recompiled when the pyarrow version changes.

For example, I could compile a library that connects Arrow version X to pyarrow version Y using the C ABI. However, this compiled library would not work if I change the pyarrow version to Z (until I recompile and link to the Z version). This makes really really hard to decouple the library from the pyarrow version.

pitrou · 2023-10-07T17:05:41Z

the big problem here is that the ABI is not stable,

There is a misunderstanding here. The Arrow C++ ABI is not stable. The Arrow C Data Interface, however, is a stable ABI by design. This is even spelled out in the spec for it:

Goals

Expose an ABI-stable interface.

https://arrow.apache.org/docs/format/CDataInterface.html#goals

davenza · 2023-10-07T23:22:59Z

Yes, I know that the C ABI is stable, but you need the C++ and Cython ABIs to pass the data between C++ and Python.

As I see it, this is what you need to transform a C++ Arrow array of version X to a pyarrow.Array of version Y (back and forth):

              std::shared<arrow::Array>               (version X, C++ array, libarrow.so.X)
                           ^
                           |
    ImportArray()          |          ExportArray()   (you need this functions from libarrow.so.X)
                           |
                           v
              struct ArrowArray*          (C ABI, this should be compatible in libarrow.so.X and libarrow.so.Y)
                           ^
                           |
    ExportArray()          |          ImportArray()   (you need this functions from libarrow.so.Y)
                           |
                           v
              std::shared<arrow::Array>               (version Y, pyarrow version, libarrow.so.Y)
                           ^
                           |
    unwrap_array()         |          wrap_array()    (you need this functions from libarrow_python.so.Y)
                           |
                           v
              PyObject*                               (version Y, this is a pyarrow.Array in Python)

I hope the diagram is understandable. The functions on the left, transform the data from Python to C++, and the functions on the right do the opposite. As you can see, you need to link two versions of libarrow.so and one version of libarrow_python.so.

If the C++ and Cython ABIs were stable, I could statically link the Arrow C++ version X (using some type of mangling for the version), and then I could use my library with any pyarrow version. Since the C++ and Cython ABIs are not stable, I have to recompile each time I change the pyarrow version to link to the new version.

Of course, I understand the difficulties of creating a stable ABI for C++ and Cython, so I think this is not very realistic (especially in the short-term). However, I am concerned about how Arrow can create a rich ecosystem if two libraries that depend on two different versions of pyarrow cannot coexist in the same virtualenv.

Is there anything obvious that I am not seeing @pitrou? At this point I do not see a winning solution for the libraries that need C++ <-> Python interoperability in the same process.

timkpaine · 2023-10-07T23:52:37Z

your c++ library statically links libarrow, you ship a separate python library which dynamically links against libarrow and libarrow_python but does not vendor them. Pyarrow loads its libarrow and libarrow_python, then your python library uses those to create the C api data structures which it hands to your c++ library. My repo doesn't do all this yet.

davenza · 2023-10-08T09:15:46Z

@timkpaine I thought about a similar design, but I don't think it will work.

You link your python library dynamically against libarrow and libarrow_python. But, you link against specific versions of libarrow and libarrow_python. If the pyarrow version changes, pyarrow will load a different libarrow.so and libarrow_python.so, and the Python library will crash because there is no ABI stability. I have tested this myself in a little project

Maybe there is a compilation/linker flag I'm not aware of that can solve this. I am very interested to see how your minimal work example could be completed.

timkpaine · 2023-10-10T17:42:52Z

ref: apache/arrow#37797

davenza · 2023-10-10T18:08:28Z

@timkpaine thank you! I think this solves my problem. I hope this gets merged soon.

timkpaine · 2024-02-23T17:05:22Z

@davenza timkpaine/arrow-cpp-python-nocopy#3 removes the need to have pyarrow, and should solve all ABI problems as the linkage between the python layer and the C++ layer is completely separated by the C API layer. I'm only doing a raw CPython binding for now, will do pybind after I finish array and table support.

timkpaine · 2024-02-23T22:02:58Z

ok pybind is done too, this is pretty easy and seems correct.

@pitrou sorry to bother you, does this look like the current best practice? timkpaine/arrow-cpp-python-nocopy#3

pitrou · 2024-02-26T16:43:22Z

@timkpaine I see three problems:

pack/unpack will not roundtrip, see [Python] Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11. apache/arrow#10488 (comment)
that code seems to lack error detection and handling, and might have reference leaks
the no-op capsule destructors (ReleaseArrowArrayPyCapsule and ReleaseArrowSchemaPyCapsule) will lead to massive memory leaks. Why didn't you reuse the code in https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#create-a-pycapsule?

timkpaine · 2024-02-26T17:01:58Z

Thanks for looking, sounds like it's good in general, this is just a demo so I just wanted to demonstrate working.

I will adjust to check if it has those OR if it's already a correct capsule
This I'll add later
Laziness during debugging, was getting segfaults but it ended up being that c array returns a tuple and I didn't read close enough at the type annotations.

dhirschfeld · 2024-03-17T11:30:09Z

From reading this issue, my understanding is that, at present, the only way to pass a pyarrow.Table to perspective is to first convert it to a binary buffer as shown in the original post - is that correct?

mhkeller · 2024-03-26T04:22:10Z

For anyone looking for how to do this in Node.js, this seems to work

const table = new arrow.Table(recordBatches);
const ipcStream = arrow.tableToIPC(table, 'stream');
const bytes = Buffer.from(ipcStream, 'utf-8');

If there's a better way in Node to pass the table directly, please let me know!

sc1f added C++ enhancement Feature requests or improvements JS Python labels Aug 19, 2020

texodus mentioned this issue Jan 18, 2021

Statically link perspective-python and libarrow #1290

Merged

4 tasks

dennybritz mentioned this issue May 15, 2022

Feed a stream of RecordBatches to a table (Python) #1812

Open

timkpaine mentioned this issue Jul 29, 2023

[Python] Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11. apache/arrow#10488

Closed

timkpaine mentioned this issue Feb 23, 2024

WIP move to pycapsule timkpaine/arrow-cpp-python-nocopy#3

Merged

mhkeller mentioned this issue Mar 26, 2024

Feature request: parquet data files support #342

Closed

timkpaine mentioned this issue Aug 30, 2024

Can we use the arrow PyCapsule/C api when ingesting from pyarrow.Table? #2729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load arrow tables into Perspective without converting to a binary buffer #1157

Load arrow tables into Perspective without converting to a binary buffer #1157

sc1f commented Aug 19, 2020

timkpaine commented Aug 19, 2020

timkpaine commented Aug 19, 2020 •

edited

Loading

timkpaine commented Aug 19, 2020

sc1f commented Aug 19, 2020

wesm commented Aug 19, 2020

timkpaine commented Jul 29, 2023 •

edited

Loading

pitrou commented Aug 8, 2023 •

edited

Loading

davenza commented Oct 7, 2023

timkpaine commented Oct 7, 2023

davenza commented Oct 7, 2023

pitrou commented Oct 7, 2023 via email

davenza commented Oct 7, 2023

pitrou commented Oct 7, 2023

davenza commented Oct 7, 2023

timkpaine commented Oct 7, 2023

davenza commented Oct 8, 2023

timkpaine commented Oct 10, 2023

davenza commented Oct 10, 2023

timkpaine commented Feb 23, 2024

timkpaine commented Feb 23, 2024

pitrou commented Feb 26, 2024

timkpaine commented Feb 26, 2024

dhirschfeld commented Mar 17, 2024

mhkeller commented Mar 26, 2024

Load arrow tables into Perspective without converting to a binary buffer #1157

Load arrow tables into Perspective without converting to a binary buffer #1157

Comments

sc1f commented Aug 19, 2020

Feature Request

Description of Problem:

Potential Solutions:

timkpaine commented Aug 19, 2020

timkpaine commented Aug 19, 2020 • edited Loading

timkpaine commented Aug 19, 2020

sc1f commented Aug 19, 2020

wesm commented Aug 19, 2020

timkpaine commented Jul 29, 2023 • edited Loading

pitrou commented Aug 8, 2023 • edited Loading

davenza commented Oct 7, 2023

timkpaine commented Oct 7, 2023

davenza commented Oct 7, 2023

pitrou commented Oct 7, 2023 via email

davenza commented Oct 7, 2023

pitrou commented Oct 7, 2023

davenza commented Oct 7, 2023

timkpaine commented Oct 7, 2023

davenza commented Oct 8, 2023

timkpaine commented Oct 10, 2023

davenza commented Oct 10, 2023

timkpaine commented Feb 23, 2024

timkpaine commented Feb 23, 2024

pitrou commented Feb 26, 2024

timkpaine commented Feb 26, 2024

dhirschfeld commented Mar 17, 2024

mhkeller commented Mar 26, 2024

timkpaine commented Aug 19, 2020 •

edited

Loading

timkpaine commented Jul 29, 2023 •

edited

Loading

pitrou commented Aug 8, 2023 •

edited

Loading