-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273
Conversation
ca6a0cf
to
b54b195
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but can't we statically allocate schemas and pass static pointers around?
/// TODO(andreas): Use a schema registry that identifies this and the component name with a unique schema ID. | ||
ArrowSchema schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just statically allocate the schema and always pass the same pointer to it in here? Why do we need a registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment below: I want to also remove the Rust sided cost of FFI schema parsing & building. Also practically speaking we'd still need to allocate it somewhere since we don't get this schema allocated by libarrow, it only fills out a struct (which in turn holds on to allocs)
/// TODO(andreas): Use a schema registry. | ||
pub schema: arrow2::ffi::ArrowSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here; I'd expect this to be a static pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually planning to have the registry side over on Rust and give out a handle: Arrow2 doesn't take ownership of the ArrowSchema
and instead builds up a new Rust data structure. So I figure if we cache schemas then let's go all the way.
I'll do so in a follow-up PR
Yes, I wanted to do that as a follow-up! :) |
…atch the desired datatype (#4280) ### What * Prerequisite for #4273 Previously, not knowing about `arrow::MakeBuilder`, we created the hierarchy of arrow array builders manually in a utility method dubbed `new_arrow_array_builder`. While working on C FFI it turned out that this way we end up with an arrow array whose data type doesn't have the correct nullability settings - in other words, on creating the arrow array builders we would have needed to pass the exact data types. This is what I tried at first and it worked, but then I noticed that there's already `arrow::MakeBuilder` which internally uses a bunch of rolled out switch/case statements to create the hierarchy programmatically. It turns out that this in combination with the work done on switching from IPC to C FFI gives another significant speed-up. When tested in isolation however (i.e. still on IPC), the gains seem to be rather modest and within the (large!) measurement noise. ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4280) (if applicable) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG - [PR Build Summary](https://build.rerun.io/pr/4280) - [Docs preview](https://rerun.io/preview/030bafa44afbd77f510d15db7c3d254386f1aae9/docs) <!--DOCS-PREVIEW--> - [Examples preview](https://rerun.io/preview/030bafa44afbd77f510d15db7c3d254386f1aae9/examples) <!--EXAMPLES-PREVIEW--> - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)
e84f1b7
to
cb476d8
Compare
…ing a component type registry (#4296) ### What * Fixes #4287 * Follow-up to #4273 As expected, not doing the C++ datatype -> C FFI schema -> Rust datatype roundtrip for each log call helps perf quite a bit, especially when we do a lot of smaller log calls. The registry a single RwLock protected Vec (we never deregister) which is exposed via a single c entry point. On the C++ side we use the local `static` variable mechanism for threadsafe lazy registration (slight codegen adjustment). Indicator components had some special handling before and were refactored to fit in this system - in the process I made their arrow array shared across all instantiations, further cutting down on per-log work. --- Benchmark results: * large point cloud: `0.15s` -> `0.14s` * many points: `7.52s` -> `4.52s` * large images: `0.57s` -> `0.51s` Old values from previous PR. New values are median over three runs, single executable run (this makes more and more of a difference with all these registries!), timings without prepare step, same M1 macbook. A quick look over the profiler for running `log_benchmark points3d_many_individual` in isolation tells us that of the actual benchmark running time we spend.. * 35% of the the time in `rr_recording_stream_log` (of which in turn 20%, so 7% overall, is still arrow FFI translation of the array!!) * 30% in the various `to_data_cell` methods * 10% in exporting arrow arrays to C FFI * 6% in setting the time * the rest in various allocations along the way (taken via `Instruments` on my Mac) <img width="969" alt="image" src="https://github.com/rerun-io/rerun/assets/1220815/5632589f-52b1-4e92-b7a0-1482e69528ad"> --- ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4296) (if applicable) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG - [PR Build Summary](https://build.rerun.io/pr/4296) - [Docs preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/docs) <!--DOCS-PREVIEW--> - [Examples preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/examples) <!--EXAMPLES-PREVIEW--> - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)
What
So far we've been using the arrow IPC format to hand data from C++ to C/Rust. It turned out that this incurred a major performance cost on our api. Instead, using the C FFI interface (which still isn't exactly free) gives us a major performance boost in all of today's logging benchmarks:
large point cloud:
0.58s -> 0.15s
many points:
17.10 -> 7.52s
large images:
2.90s -> 0.57s
Execute times without prepare, single sample on before, median of three on after, all ran in a single process
for comparison, numbers on main for Rust:
large point clouds:
0.82s
many points:
3.87s
large images:
1.00s
Execute times without prepare via Puffin, single sample, all ran in a single process. As always these comparisons are very tricky. Also there's lots of noise!
There's still some significant discrepancy on
many points
. One likely source of this is repeated schema transfer. Need to do some more profiling before continuing.This PR also simplifies the serialization pipeline a little bit by removing
SerializedComponentBatch
, replacing it in favor of the existingrerun::DataCell
in a few places. Left a comment in the code on the next steps towards evolving it towards an interface that's more similar to how things work in Rust.Checklist