-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet output #1229
Comments
I have no objections to adding Parquet support, it looks useful and interesting. I do object to adding several large dependencies and dramatically increasing binary size to add it. If a smaller implementation (in C/C++) can be found (or developed), I don't see any other roadblocks to adding support. |
Yeah, I'd forgot that Santa needs to sign all shipped binaries, so a compile-time switch is not a good option. I'm still trying to shrink the build, and I filed an issue with Arrow asking for help. I'll call this option (1). Option (2) is to build libparquet as a dylib, bundle it and then load it on demand. This way, santad doesn't get more jolly until you enable parquet support at runtime. Option (3) Instead of the C++ implementation, use one of the rust crates (parquet or parquet2). At a cursory glance, those look smaller and more reasonable. Options 2 & 3 aren't mutually exclusive - I actually quite like the idea of shipping as a dylib, because it leaves things as they are for people who don't care about parquet. Option (4) is building a new miniparquet writer-only library. |
Actually, @russellhancox how would you feel about having a dependency on a Rust crate? |
As long as it builds with bazel and bridges to C++ without performance issues, I see no problem. |
It should. Main concern I'd have is we'd have to make some changes
internally, but they should be minimal.
|
Related PoC for building rust with bazel and bridging to C/C++ WIP #1240 |
The draft PR #1240 now has some bona fide Rust code to write a couple of columns into a parquet file. Still TODO:
Additionally, I'm not super happy with the fact that each row group flush requires allocating memory, but the crate's design prioritizes async safety over allocation efficiency, and it's kind of hard to work around that. I don't think the outcome is terrible - most likely, it's about as allocation-efficient as proto2 with arenas, but it should be profiled. |
OK, I've done all of the above.
|
Thanks for the chat on Monday. To summarize:
|
I would like to add parquet output support to Santa, however there are some trade-offs that might not be acceptable to you. I'd like to have the discussion and the pros/cons in one place (this issue).
Goals
Why parquet?
Parquet is the de-facto standard interchange columnar format - most data platforms can ingest it natively, it supports fairly rich schemas and has decent performance.
Most Santa deployments are probably converting logs to a columnar format on the backend to get more efficient storage and compute. Those same advantages already apply on the host itself - a parquet file is going to be smaller than the equivalent fsspool folder, and will therefore use less compute on network IO. (This trade-off was already known to the designers of protocol buffers, hence variadic integer encoding.)
In summary: parquet support makes it easier to adopt Santa for people who already use the format on their backend. This is already a good reason to adopt it. Additionally, it may turn out to save CPU time and bandwidth for existing users of protobuf + fsspool
Why not parquet?
Briefly, code quality and dependency size. The main implementation of parquet is in the Apache Arrow library, which is complex and has a lot of dependencies, including thrift and boost. The codebase itself is large with no obvious modularization or layering - even though different parts have different coding styles and build options (e.g. exceptions vs no exceptions), they are heavily interdependent in all directions. The external dependencies are mostly transparently fetched from upstreams, or expected to be installed on the system, which largely breaks reproducible builds and adds a supply chain problem. All of this makes it difficult to add Arrow as a dependency.
Building Santa with parquet support would likely add ~20 MiB to binary size and require at least the following extra dependencies:
Implementation sketch
We can add the dependencies to WORKSPACE as
http_archive
, and check in a BUILD file intoexternal_patches
. Here's what that looks like for thrift:The BUILD file can shell out to make or cmake, or just use
cc_library
. It looks like the latter just works for most libraries.A serializer would need to keep a reasonable number of recent messages in memory, already converted to column chunks. A single parquet file can contain one or multiple chunks. To begin with, we could tune this to target 1-10 chunks per file, depending on busy the machine is.
File output can use the existing fsspool implementation, just swapping protobuf files for parquet files.
We can avoid building all of the dependencies by bundling them under an
@org_apache_arrow
target and only depending on that from the new serializer.Alternatives
Parquet is a standardized, stable format, and implementations other than the official one exist.
It may be worthwhile to implement a minimal version of parquet without all the dependencies. Such a project exists for go, and it could serve as a blueprint. (In fact, we'd only need writer support and don't need the higher-level schema support, so we could end up with an even smaller codebase.)
I'm not sure who has time to do this, though.
The text was updated successfully, but these errors were encountered: