-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load arrow tables into Perspective without converting to a binary buffer #1157
Comments
There is a way to transfer the pointer, I discussed this with @wesm a while back and it was eventually implemented to optimize transferring arrow tables between R and python. Let me look up the details of that thread |
|
|
@timkpaine is there a way to do the same in the WASM? I'm personally ok with only being able to load pyarrow Tables in our python binding but for the sake of completeness it would be helpful in WASM |
You should be able to use the C interface now. cc @pitrou @nealrichardson |
Ok I think this is an example now: It does both parts in one, but basically the README explains it:
Does this seem like the right thing to be doing @wesm @pitrou @nealrichardson @westonpace? AFAIK you need to force your python extension to interact with your C++ library via the arrow C ABI, otherwise your extension might pickup a |
Yes, it does!
You are right indeed. This is how we provide zero-copy data sharing between R and Python. Also @jorisvandenbossche FYI |
@timkpaine (@wesm @pitrou too) I checked your solution in https://github.com/timkpaine/arrow-cpp-python-nocopy, but it is a bit confusing for me and I believe there is something wrong. I would be grateful if you could give me some advice. I have Arrow C++ 13 installed on my system. I can compile your example with Arrow C++ 13, and then I install it along with pyarrow 13 in a virtualenv. This works fine. Then, I force the installation of pyarrow 12 in the virtualenv (keeping your example compiled with Arrow C++ 13). In this case, there is a I believe this is what happens:
I may be misunderstanding something, but how does your example solve this problem using the C ABI? In fact, you keep using |
The user library should pull it from pyarrow instead of the system. That repo is me testing stuff, it might have more or less content than is necessary to do this properly. |
Ok, so the library will always be compiled with the same version as pyarrow. I probably misunderstood it, because I thought that the C ABI could be used to make different versions of Arrow work together (one version for the library and other version for pyarrow). Anyway, thank you @timkpaine . |
Le 07/10/2023 à 11:21, David Atienza a écrit :
Ok, so the library will always be compiled with the same version as pyarrow.
I probably misunderstood it, because I thought that the C ABI could be
used to make different versions of Arrow work together (one version for
the library and other version for pyarrow).
It can, but you have to be careful with library loading issues. In
particular, loading two different versions *from Python* is probably not
possible, because the symbols will clash.
|
For future reference for other readers who have stumbled upon the same problem, I have been thinking about how we could design an Arrow C++ <-> Python library that is decoupled from the installed pyarrow version, and now I think it is not possible. Even if we solve the name clashes caused by the different versions of libarrow.so (mangling the symbols with the version, for example), the big problem here is that the ABI is not stable, so the library should be recompiled when the pyarrow version changes. For example, I could compile a library that connects Arrow version X to pyarrow version Y using the C ABI. However, this compiled library would not work if I change the pyarrow version to Z (until I recompile and link to the Z version). This makes really really hard to decouple the library from the pyarrow version. |
There is a misunderstanding here. The Arrow C++ ABI is not stable. The Arrow C Data Interface, however, is a stable ABI by design. This is even spelled out in the spec for it:
https://arrow.apache.org/docs/format/CDataInterface.html#goals |
Yes, I know that the C ABI is stable, but you need the C++ and Cython ABIs to pass the data between C++ and Python. As I see it, this is what you need to transform a C++ Arrow array of version X to a pyarrow.Array of version Y (back and forth):
I hope the diagram is understandable. The functions on the left, transform the data from Python to C++, and the functions on the right do the opposite. As you can see, you need to link two versions of libarrow.so and one version of libarrow_python.so. If the C++ and Cython ABIs were stable, I could statically link the Arrow C++ version X (using some type of mangling for the version), and then I could use my library with any pyarrow version. Since the C++ and Cython ABIs are not stable, I have to recompile each time I change the pyarrow version to link to the new version. Of course, I understand the difficulties of creating a stable ABI for C++ and Cython, so I think this is not very realistic (especially in the short-term). However, I am concerned about how Arrow can create a rich ecosystem if two libraries that depend on two different versions of pyarrow cannot coexist in the same virtualenv. Is there anything obvious that I am not seeing @pitrou? At this point I do not see a winning solution for the libraries that need C++ <-> Python interoperability in the same process. |
your c++ library statically links libarrow, you ship a separate python library which dynamically links against libarrow and libarrow_python but does not vendor them. Pyarrow loads its libarrow and libarrow_python, then your python library uses those to create the C api data structures which it hands to your c++ library. My repo doesn't do all this yet. |
@timkpaine I thought about a similar design, but I don't think it will work. You link your python library dynamically against libarrow and libarrow_python. But, you link against specific versions of libarrow and libarrow_python. If the pyarrow version changes, pyarrow will load a different libarrow.so and libarrow_python.so, and the Python library will crash because there is no ABI stability. I have tested this myself in a little project Maybe there is a compilation/linker flag I'm not aware of that can solve this. I am very interested to see how your minimal work example could be completed. |
ref: apache/arrow#37797 |
@timkpaine thank you! I think this solves my problem. I hope this gets merged soon. |
@davenza timkpaine/arrow-cpp-python-nocopy#3 removes the need to have pyarrow, and should solve all ABI problems as the linkage between the python layer and the C++ layer is completely separated by the C API layer. I'm only doing a raw CPython binding for now, will do pybind after I finish array and table support. |
ok pybind is done too, this is pretty easy and seems correct. @pitrou sorry to bother you, does this look like the current best practice? timkpaine/arrow-cpp-python-nocopy#3 |
@timkpaine I see three problems:
|
Thanks for looking, sounds like it's good in general, this is just a demo so I just wanted to demonstrate working.
|
From reading this issue, my understanding is that, at present, the only way to pass a |
For anyone looking for how to do this in Node.js, this seems to work const table = new arrow.Table(recordBatches);
const ipcStream = arrow.tableToIPC(table, 'stream');
const bytes = Buffer.from(ipcStream, 'utf-8'); If there's a better way in Node to pass the table directly, please let me know! |
Feature Request
Description of Problem:
Perspective expects arrow data to be loaded as an ArrayBuffer in Javascript, and a binary string in Python. This requires, in PyArrow at least, a few lines to convert an Arrow
Table
into binary:When loading arrow Tables, I expect Perspective to be compatible with Arrow Tables without having to do any conversion. The requirement of
bytes
for an Arrow binary is outlined slightly in the Python user guide, but the conversion process from an Arrow Table -> bytes is unclear.Potential Solutions:
Write in an Arrow Table to binary conversion layer solely in the binding layer (using PyArrow or Arrow typescript), which would be simple but incomplete (would require reimplementation in future binding languages) and less performant. If there is a way to transfer a pointer to an arrow table from the binding layer into C++, then we should be able to write the conversion entirely in C++. We already malloc and memcpy the arrow binary from JS/Python into C++, so there might be something already there worth looking into.
The text was updated successfully, but these errors were encountered: