Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17259: [C++] Do not use shared_ptr<DataType> with kernel function signatures, do less copying of shared_ptrs #13753

Closed
wants to merge 11 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Jul 29, 2022

This is something that I've hacked on a little bit while on airplanes and in idle moments for my own curiosity and just cleaned up to turn into a PR. This PR changes arrow::compute::InputType to only store const DataType* instead of shared_ptr<DataType>, which means that we don't have to copy the smart pointers for every kernel and function that's registered. Amazingly, this trims the size of libarrow.so by about 1% (~300KB) because of not having a bunch of inlined shared_ptr code.

This change does have consequences -- developers need to be careful about creating function signatures with data types that may have temporary storage duration (all of the APIs like arrow::int64() return static instances, and I changed arrow::duration and some other type factories to return static instances, too). But I think the benefits outweigh the costs.

@wesm wesm changed the title ARROW-17259: [C++] Do not shared_ptr<DataType> with kernel function signatures, do less copying of shared_ptrs ARROW-17259: [C++] Do not use shared_ptr<DataType> with kernel function signatures, do less copying of shared_ptrs Jul 29, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

// Integer, floating point, base binary, and temporal
ARROW_EXPORT
const std::vector<std::shared_ptr<DataType>>& PrimitiveTypes();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't really a good reason for these functions to be here -- they probably even should be in an internal namespace in arrow::compute

@wesm wesm force-pushed the leaner-kernel-registry branch from b0d5cdc to a3a1820 Compare August 1, 2022 22:41
@pitrou
Copy link
Member

pitrou commented Aug 3, 2022

@wesm Would you like to take a look at the compile failures?

@wesm
Copy link
Member Author

wesm commented Aug 3, 2022

@pitrou yes, I'll fix them today

@wesm wesm force-pushed the leaner-kernel-registry branch from a3a1820 to 766f292 Compare August 3, 2022 14:55
@wesm
Copy link
Member Author

wesm commented Aug 3, 2022

@pitrou there's a doctest failure here unrelated to this PR, seems to be caused by the 9.0.0 to 10.0.0 version bump (which made the metadata 1 byte bigger):

2022-08-03T17:03:42.8027721Z =================================== FAILURES ===================================
2022-08-03T17:03:42.8028142Z ___________________ [doctest] pyarrow.parquet.read_metadata ____________________
2022-08-03T17:03:42.8028491Z 3406 
2022-08-03T17:03:42.8028721Z 3407     Examples
2022-08-03T17:03:42.8029041Z 3408     --------
2022-08-03T17:03:42.8029336Z 3409     >>> import pyarrow as pa
2022-08-03T17:03:42.8029679Z 3410     >>> import pyarrow.parquet as pq
2022-08-03T17:03:42.8030399Z 3411     >>> table = pa.table({'n_legs': [4, 5, 100],
2022-08-03T17:03:42.8030920Z 3412     ...                   'animal': ["Dog", "Brittle stars", "Centipede"]})
2022-08-03T17:03:42.8031437Z 3413     >>> pq.write_table(table, 'example.parquet')
2022-08-03T17:03:42.8031751Z 3414 
2022-08-03T17:03:42.8032144Z 3415     >>> pq.read_metadata('example.parquet')
2022-08-03T17:03:42.8032625Z Differences (unified diff with -expected +actual):
2022-08-03T17:03:42.8033004Z     @@ -1,7 +1,7 @@
2022-08-03T17:03:42.8033412Z     -<pyarrow._parquet.FileMetaData object at ...>
2022-08-03T17:03:42.8033907Z     -  created_by: parquet-cpp-arrow version ...
2022-08-03T17:03:42.8034339Z     +<pyarrow._parquet.FileMetaData object at 0x7fd2089b3830>
2022-08-03T17:03:42.8034877Z     +  created_by: parquet-cpp-arrow version 10.0.0-SNAPSHOT
2022-08-03T17:03:42.8035225Z        num_columns: 2
2022-08-03T17:03:42.8035477Z        num_rows: 3
2022-08-03T17:03:42.8035752Z        num_row_groups: 1
2022-08-03T17:03:42.8036047Z        format_version: 2.6
2022-08-03T17:03:42.8036407Z     -  serialized_size: 561
2022-08-03T17:03:42.8036708Z     +  serialized_size: 562
2022-08-03T17:03:42.8037076Z 
2022-08-03T17:03:42.8037502Z /opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/parquet/__init__.py:3415: DocTestFailure

@pitrou
Copy link
Member

pitrou commented Aug 3, 2022

Hmm... I assume it's because of the serialized_size difference? It's weird that it would differ by one byte, but perhaps that's due to a different Thrift version. We should probably simply relax this test.

@wesm
Copy link
Member Author

wesm commented Aug 3, 2022

I think the reason is the change from 9.0.0 to 10.0.0 which added one byte, here is the doctest

    >>> pq.read_metadata('example.parquet')
    <pyarrow._parquet.FileMetaData object at ...>
      created_by: parquet-cpp-arrow version ...
      num_columns: 2
      num_rows: 3
      num_row_groups: 1
      format_version: 2.6
      serialized_size: 561

@wesm
Copy link
Member Author

wesm commented Aug 3, 2022

I patched this in #13790

@wesm wesm force-pushed the leaner-kernel-registry branch from 766f292 to ec53645 Compare August 3, 2022 21:12
@wesm
Copy link
Member Author

wesm commented Aug 3, 2022

I think this can be merged once the build is green if the changes look okay.

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

InputType and OutputType instances are only created at kernel registration, if I'm not mistaken?

@lidavidm
Copy link
Member

lidavidm commented Aug 4, 2022

But those Input/OutputType instances are stored in the KernelSignature, so those references need to stay valid.

Comment on lines +288 to +289
namespace {} // namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty namespace?

std::shared_ptr<HashAggregateFunction> func,
FunctionRegistry* registry) {
static std::shared_ptr<DataType> decimal128_example = decimal128(1, 1);
static std::shared_ptr<DataType> decimal256_example = decimal256(1, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extreme nit: there's some inconsistency here and above, maybe PopulateHashGenericKernels should just call PopulateNumericHashKernels?

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

But those Input/OutputType instances are stored in the KernelSignature, so those references need to stay valid.

My question was more along the lines of: what is the point of micro-optimizing a number of shared_ptr copies that only happen at startup?

@wesm
Copy link
Member Author

wesm commented Aug 4, 2022

@pitrou this isn’t a micro-optimization imho — when 1% of your binary size is concerned with inline shared_ptr code in this narrow part of the library that speaks to a design problem. The kernel registry population is still very bloated and should be able to be streamlined further to yield smaller binary size. The use of shared pointers also causes extra overhead in kernel dispatching and input type checking, but we would need to write some benchmarks to quantify that better

it might be a useful exercise at some point to analyze the disassembly of libarrow.so to understand the total contribution of inline shared_ptr code to the binary size. There are a lot of places where we inline constructors for example where there is probably no need

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

@pitrou this isn’t a micro-optimization imho — when 1% of your binary size is concerned with inline shared_ptr code in this narrow part of the library that speaks to a design problem.

Is this looking at the text segment, or the entire binary size? If the latter, that might mean debug symbols or something... (those can be stripped for distribution purposes)

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

The use of shared pointers also causes extra overhead in kernel dispatching and input type checking, but we would need to write some benchmarks to quantify that better

I would like to see benchmarks indeed, because as long as you don't copy them the overhead should be minimal.

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

I took the PR and rebased it internally to then compare the sizes:

  • master:
$ size build-release/release/libarrow.so
   text	   data	    bss	    dec	    hex	filename
23471557	 317416	2568225	26357198	1922dce	build-release/release/libarrow.so
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 37289616 août   4 16:48 build-release/release/libarrow.so.1000.0.0
$ strip build-release/release/libarrow.so.1000.0.0 
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 23794552 août   4 16:48 build-release/release/libarrow.so.1000.0.0
  • PR:
$ size build-release/release/libarrow.so
   text	   data	    bss	    dec	    hex	filename
23314548	 314736	2568801	26198085	18fc045	build-release/release/libarrow.so
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 37097032 août   4 16:49 build-release/release/libarrow.so.1000.0.0
$ strip build-release/release/libarrow.so.1000.0.0 
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 23634808 août   4 16:50 build-release/release/libarrow.so.1000.0.0

So it's a 0.7% saving here in both text segment size and stripped library size.

@wesm
Copy link
Member Author

wesm commented Aug 5, 2022

Any objections to merging this?

@pitrou
Copy link
Member

pitrou commented Aug 5, 2022

If there's no sizable performance improvement then I'm against merging this. Robustness should be a guiding principle and replacing shared pointers with raw pointers goes against it.

@wesm
Copy link
Member Author

wesm commented Aug 5, 2022

I do not think the implementation is less robust — the only change is that ephemeral DataType instances cannot be used to instantiate InputType without creating a TypeMatcher that wraps and retains a shared_ptr for the cases where it is truly needed. If the developer does create a KernelSignature / InputType with an ephemeral shared_ptr<DataType>, the signature would be unusable so any unit test which attempts to use the kernel would segfault. The probability of an Arrow user ever facing a bug related to this essentially zero — it would amount to us merging untested kernels code (because tests would immediately reveal the programming error).

It also comes down to a cost philosophy in this code:

  • Pay the cost of using shared_ptr always — both in generated code (because most shared_ptr operations are inlined by the compiler) and in performance
  • Pay the cost only when it's needed

It turns out that in this code, that retaining a shared_ptr<DataType> for kernel function signatures and type validation is rarely needed. Why pay the cost for > 95% of cases (or whatever the number is?) when the shared_ptr is only actually needed in < 5% of cases? I recall that we've already seen shared_ptr contention show up unfavorably in performance profiles (e.g. ARROW-16161).

I will take the time when I can to write some benchmarks (for me, the present-and-future savings in generated code is evidence enough) that will hopefully provide some additional evidence to support this. In the meantime I hope that the rebase conflicts will not be too annoying to keep up with.

@wesm
Copy link
Member Author

wesm commented Aug 8, 2022

I've done a little exploration locally and concluded that shared_ptr<DataType> versus const DataType* has no impact on the performance of Function::DispatchExact (at least on recent clang), which confirms the general guidance that dereferencing shared pointers is no different than a raw pointer — it's the creation, copying, and destruction of shared pointers that's expensive, especially in a multithreaded setting.

The last thing I'm interested in is the "bootup time" of the function registry, so I'm going to compare the before and after performance of that and report it here. If there is no significant difference, I will refactor this PR to restore shared pointers to InputType and OutputType but rather avoid allowing shared pointer copies to get inlined in these functions, so hopefully that will yield the same kind of reduction in binary bloat. There are some other beneficial changes here so I'd like to get those in — I'll report back when I have that extra performance data and have done the refactoring.

There's probably other places where we're either passing a shared pointer where there is no need or where we inline non-performance-sensitive-constructors which copy shared pointers where there is similarly little need, so let's be an eye out for htat.

@wesm
Copy link
Member Author

wesm commented Sep 8, 2022

I had some personal stuff come up in August that got in the way of completing this work -- I will pick this up hopefully sometime this month (September) and hopefully reach a point that is satisfactory to all.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants