-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve protobuf compatibility #15511
Comments
CC @vuule for evaluating the feasibility of option 1 from a cuIO perspective |
Current implementation of |
We do have |
That sounds awesome to me, I would love to avoid needing the protobuf library altogether if we can. |
@vuule I started looking at this. The catch here is not cudf/cpp/include/cudf/io/orc_metadata.hpp Lines 174 to 183 in 9f2fdf8
|
Darn. |
It's not that hard to write our own Cython wrappers for STL types if we need them. We can just implement the bare minimum that we need, then work on upstreaming to Cython proper over time (i.e. addressing cython/cython#4677). The linked issue shows roughly what we need. I'd probably take the verbatim C++ approach to handle the variadic template as shown there, it's easier than figuring out a way to treat variadics more generally. |
Sorry I realize that is perhaps a cursory answer. Let me know if you need help understanding how to translate that advice into actual code. I don't think it'll be too bad to maintain! |
I think I can give it a shot. I’ll open a draft PR and ask for help if I get stuck. |
PR to remove protobuf dependency: #15564 I got this code to work (!) |
This PR removes the cuDF Python dependencies on `protobuf` and `protoc-wheel`. Closes #15511. The only use case for the `protobuf` dependency was reading ORC file/stripe statistics. However, we have code in libcudf that can do this without requiring `protobuf`. In this PR, we expose the C++ code for parsing ORC statistics from libcudf to Cython and remove all references to `protobuf`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Jake Awe (https://github.com/AyodeAwe) - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15564
Is your feature request related to a problem? Please describe.
cuDF currently contains a single
.proto
file that is used in the ORC reader for extracting column statistics. This feature periodically causes us headaches due to incompatibilities between our libprotobuf version and the one required by other libraries in common environments (particularly TensorFlow). We would like to find a way to relax this restriction.Describe the solution you'd like
I see two potential paths out of this dilemma. Option 1 is the cleaner long-term solution IMO, but option 2 is the easier one to implement today and should be nearly as good. If we can't find an easy path forward on option 1, I think we can move on option 2 in the near term. Both options also have the additional benefit of exposing ORC column statistics in libcudf instead of only in cuDF Python.
Option 1: Implement the necessary parsing logic in our own protobuf reader
Currently our ORC reader has its own protobuf reader. We could extend this reader to support reading the statistics out of the header as well. Then we would no longer have any dependency on libprotobuf at all. I don't know how difficult this would be offhand though since I don't know the protobuf spec well at all.
Option 2: Switch to using the protobuf C++ API and statically link it
Right now we use the protobuf compiler on the
.proto
file to generate a Python module via CMake, then use that module in our Python code as shown above. However,.proto
files can also be compiled down to C++. To change that, we would simply change--python_out
to--cpp_out
in this line. We could then use the protobuf C++ API instead of our current use of the Python API.The key point here is that with this approach we could choose to statically link to protobuf, isolating ourselves from the version of protobuf contained in the environment. Note that this is actually protobuf's own suggestion (admittedly for Windows on that page, but all the same compatibility requirements apply). With this change, we could continue using the protobuf library, but we would no longer have any runtime dependency on the library because the necessary symbols would be baked into our executable.
The text was updated successfully, but these errors were encountered: