Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly improve C++ logging performance by using C FFI instead of arrow IPC #4273

Merged
merged 11 commits into from
Nov 21, 2023

Conversation

Wumpf
Copy link
Member

@Wumpf Wumpf commented Nov 20, 2023

What

So far we've been using the arrow IPC format to hand data from C++ to C/Rust. It turned out that this incurred a major performance cost on our api. Instead, using the C FFI interface (which still isn't exactly free) gives us a major performance boost in all of today's logging benchmarks:

large point cloud: 0.58s -> 0.15s
many points: 17.10 -> 7.52s
large images: 2.90s -> 0.57s
Execute times without prepare, single sample on before, median of three on after, all ran in a single process

for comparison, numbers on main for Rust:

large point clouds: 0.82s
many points: 3.87s
large images: 1.00s
Execute times without prepare via Puffin, single sample, all ran in a single process. As always these comparisons are very tricky. Also there's lots of noise!

There's still some significant discrepancy on many points. One likely source of this is repeated schema transfer. Need to do some more profiling before continuing.

This PR also simplifies the serialization pipeline a little bit by removing SerializedComponentBatch, replacing it in favor of the existing rerun::DataCell in a few places. Left a comment in the code on the next steps towards evolving it towards an interface that's more similar to how things work in Rust.

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested demo.rerun.io (if applicable)
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG

@Wumpf Wumpf added 🚀 performance Optimization, memory use, etc 🌊 C++ API C/C++ API specific include in changelog labels Nov 20, 2023
crates/rerun_c/src/lib.rs Outdated Show resolved Hide resolved
@Wumpf Wumpf marked this pull request as draft November 20, 2023 14:41
@Wumpf Wumpf force-pushed the andreas/cpp/use-c-ffi branch from ca6a0cf to b54b195 Compare November 20, 2023 17:54
@Wumpf Wumpf changed the base branch from main to andreas/cpp/fix-arrow-array-datatypes November 20, 2023 17:54
@Wumpf Wumpf marked this pull request as ready for review November 20, 2023 17:59
@teh-cmc teh-cmc self-requested a review November 21, 2023 08:18
Copy link
Member

@teh-cmc teh-cmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but can't we statically allocate schemas and pass static pointers around?

rerun_cpp/src/rerun/data_cell.hpp Outdated Show resolved Hide resolved
Comment on lines +144 to +145
/// TODO(andreas): Use a schema registry that identifies this and the component name with a unique schema ID.
ArrowSchema schema;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just statically allocate the schema and always pass the same pointer to it in here? Why do we need a registry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment below: I want to also remove the Rust sided cost of FFI schema parsing & building. Also practically speaking we'd still need to allocate it somewhere since we don't get this schema allocated by libarrow, it only fills out a struct (which in turn holds on to allocs)

crates/rerun_c/src/rerun.h Outdated Show resolved Hide resolved
Comment on lines +117 to +118
/// TODO(andreas): Use a schema registry.
pub schema: arrow2::ffi::ArrowSchema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here; I'd expect this to be a static pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually planning to have the registry side over on Rust and give out a handle: Arrow2 doesn't take ownership of the ArrowSchema and instead builds up a new Rust data structure. So I figure if we cache schemas then let's go all the way.
I'll do so in a follow-up PR

@Wumpf
Copy link
Member Author

Wumpf commented Nov 21, 2023

but can't we statically allocate schemas and pass static pointers around?

Yes, I wanted to do that as a follow-up! :)

Base automatically changed from andreas/cpp/fix-arrow-array-datatypes to main November 21, 2023 09:06
Wumpf added a commit that referenced this pull request Nov 21, 2023
…atch the desired datatype (#4280)

### What

* Prerequisite for #4273

Previously, not knowing about `arrow::MakeBuilder`, we created the
hierarchy of arrow array builders manually in a utility method dubbed
`new_arrow_array_builder`. While working on C FFI it turned out that
this way we end up with an arrow array whose data type doesn't have the
correct nullability settings - in other words, on creating the arrow
array builders we would have needed to pass the exact data types. This
is what I tried at first and it worked, but then I noticed that there's
already `arrow::MakeBuilder` which internally uses a bunch of rolled out
switch/case statements to create the hierarchy programmatically.

It turns out that this in combination with the work done on switching
from IPC to C FFI gives another significant speed-up. When tested in
isolation however (i.e. still on IPC), the gains seem to be rather
modest and within the (large!) measurement noise.

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4280) (if
applicable)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG

- [PR Build Summary](https://build.rerun.io/pr/4280)
- [Docs
preview](https://rerun.io/preview/030bafa44afbd77f510d15db7c3d254386f1aae9/docs)
<!--DOCS-PREVIEW-->
- [Examples
preview](https://rerun.io/preview/030bafa44afbd77f510d15db7c3d254386f1aae9/examples)
<!--EXAMPLES-PREVIEW-->
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)
@Wumpf Wumpf force-pushed the andreas/cpp/use-c-ffi branch from e84f1b7 to cb476d8 Compare November 21, 2023 09:10
@Wumpf Wumpf merged commit b5918ff into main Nov 21, 2023
42 checks passed
@Wumpf Wumpf deleted the andreas/cpp/use-c-ffi branch November 21, 2023 09:59
Wumpf added a commit that referenced this pull request Nov 22, 2023
…ing a component type registry (#4296)

### What

* Fixes #4287
* Follow-up to #4273

As expected, not doing the C++ datatype -> C FFI schema -> Rust datatype
roundtrip for each log call helps perf quite a bit, especially when we
do a lot of smaller log calls.

The registry a single RwLock protected Vec (we never deregister) which
is exposed via a single c entry point.
On the C++ side we use the local `static` variable mechanism for
threadsafe lazy registration (slight codegen adjustment).
Indicator components had some special handling before and were
refactored to fit in this system - in the process I made their arrow
array shared across all instantiations, further cutting down on per-log
work.

---

Benchmark results:
* large point cloud: `0.15s` -> `0.14s`
* many points: `7.52s` -> `4.52s`
* large images: `0.57s` -> `0.51s`

Old values from previous PR. New values are median over three runs,
single executable run (this makes more and more of a difference with all
these registries!), timings without prepare step, same M1 macbook.

A quick look over the profiler for running `log_benchmark
points3d_many_individual` in isolation tells us that of the actual
benchmark running time we spend..
* 35% of the the time in `rr_recording_stream_log` (of which in turn
20%, so 7% overall, is still arrow FFI translation of the array!!)
* 30% in the various `to_data_cell` methods
* 10% in exporting arrow arrays to C FFI
* 6% in setting the time
* the rest in various allocations along the way

(taken via `Instruments` on my Mac)
<img width="969" alt="image"
src="https://github.com/rerun-io/rerun/assets/1220815/5632589f-52b1-4e92-b7a0-1482e69528ad">


---

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4296) (if
applicable)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG

- [PR Build Summary](https://build.rerun.io/pr/4296)
- [Docs
preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/docs)
<!--DOCS-PREVIEW-->
- [Examples
preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/examples)
<!--EXAMPLES-PREVIEW-->
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌊 C++ API C/C++ API specific include in changelog 🚀 performance Optimization, memory use, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't go to arrow IPC and back on C++ -> Rust, use C FFI instead
2 participants