Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve protobuf compatibility #15511

Closed
vyasr opened this issue Apr 11, 2024 · 10 comments · Fixed by #15564
Closed

[FEA] Improve protobuf compatibility #15511

vyasr opened this issue Apr 11, 2024 · 10 comments · Fixed by #15564
Assignees
Labels
feature request New feature or request

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 11, 2024

Is your feature request related to a problem? Please describe.
cuDF currently contains a single .proto file that is used in the ORC reader for extracting column statistics. This feature periodically causes us headaches due to incompatibilities between our libprotobuf version and the one required by other libraries in common environments (particularly TensorFlow). We would like to find a way to relax this restriction.

Describe the solution you'd like
I see two potential paths out of this dilemma. Option 1 is the cleaner long-term solution IMO, but option 2 is the easier one to implement today and should be nearly as good. If we can't find an easy path forward on option 1, I think we can move on option 2 in the near term. Both options also have the additional benefit of exposing ORC column statistics in libcudf instead of only in cuDF Python.

Option 1: Implement the necessary parsing logic in our own protobuf reader
Currently our ORC reader has its own protobuf reader. We could extend this reader to support reading the statistics out of the header as well. Then we would no longer have any dependency on libprotobuf at all. I don't know how difficult this would be offhand though since I don't know the protobuf spec well at all.

Option 2: Switch to using the protobuf C++ API and statically link it
Right now we use the protobuf compiler on the .proto file to generate a Python module via CMake, then use that module in our Python code as shown above. However, .proto files can also be compiled down to C++. To change that, we would simply change --python_out to --cpp_out in this line. We could then use the protobuf C++ API instead of our current use of the Python API.

The key point here is that with this approach we could choose to statically link to protobuf, isolating ourselves from the version of protobuf contained in the environment. Note that this is actually protobuf's own suggestion (admittedly for Windows on that page, but all the same compatibility requirements apply). With this change, we could continue using the protobuf library, but we would no longer have any runtime dependency on the library because the necessary symbols would be baked into our executable.

@vyasr vyasr added the feature request New feature or request label Apr 11, 2024
@vyasr
Copy link
Contributor Author

vyasr commented Apr 11, 2024

CC @vuule for evaluating the feasibility of option 1 from a cuIO perspective
CC @beckernick @GregoryKimball @quasiben @bdice @aravenel for viz

@vuule
Copy link
Contributor

vuule commented Apr 15, 2024

Current implementation of read_orc_statistics(Python) calls read_raw_orc_statistics(C++) to read blobs of statistics data.
We have a C++ API read_parsed_orc_statistics which also uses read_raw_orc_statistics and then parses it with out C++ protobuf reader implementation.
If we use read_parsed_orc_statistics in read_orc_statistics we only need to convert the C++ returned types (bunch of std::optional) to whatever Python expects. No need to directly use the C++ protobuf reader in Python.
If Cython still does not support std::optional this becomes a bigger issue. Also, the in-house C++ implementation might not be as robust as what Python offers, but, at this point, the gap should be very small.

@bdice
Copy link
Contributor

bdice commented Apr 15, 2024

We do have optional support in Cython now. This seems like a great solution.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 17, 2024

That sounds awesome to me, I would love to avoid needing the protobuf library altogether if we can.

@bdice
Copy link
Contributor

bdice commented Apr 17, 2024

@vuule I started looking at this. The catch here is not optional. It's variant. This isn't supported in Cython afaict and might require very creative hacks, if it can be done at all. @vyasr Have you faced this before?

std::variant<no_statistics,
integer_statistics,
double_statistics,
string_statistics,
bucket_statistics,
decimal_statistics,
date_statistics,
binary_statistics,
timestamp_statistics>
type_specific_stats; ///< type-specific statistics

cython/cython#4677

@vuule
Copy link
Contributor

vuule commented Apr 17, 2024

Darn.
We can always provide a wrapper that returns something that Cython can digest.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 17, 2024

It's not that hard to write our own Cython wrappers for STL types if we need them. We can just implement the bare minimum that we need, then work on upstreaming to Cython proper over time (i.e. addressing cython/cython#4677). The linked issue shows roughly what we need. I'd probably take the verbatim C++ approach to handle the variadic template as shown there, it's easier than figuring out a way to treat variadics more generally.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 17, 2024

Sorry I realize that is perhaps a cursory answer. Let me know if you need help understanding how to translate that advice into actual code. I don't think it'll be too bad to maintain!

@bdice
Copy link
Contributor

bdice commented Apr 17, 2024

I think I can give it a shot. I’ll open a draft PR and ask for help if I get stuck.

@bdice
Copy link
Contributor

bdice commented Apr 18, 2024

PR to remove protobuf dependency: #15564

I got this code to work (!) but the Cython layer is a bit gross, as expected and got a reasonably clean Cython implementation. Feedback is welcome!

@bdice bdice self-assigned this Apr 18, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 19, 2024
This PR removes the cuDF Python dependencies on `protobuf` and `protoc-wheel`. Closes #15511.

The only use case for the `protobuf` dependency was reading ORC file/stripe statistics. However, we have code in libcudf that can do this without requiring `protobuf`.

In this PR, we expose the C++ code for parsing ORC statistics from libcudf to Cython and remove all references to `protobuf`.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Jake Awe (https://github.com/AyodeAwe)
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15564
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants