Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for BytesRefHash #8788

Merged
merged 13 commits into from
Aug 25, 2023

Conversation

ketanv3
Copy link
Contributor

@ketanv3 ketanv3 commented Jul 19, 2023

Description

Performance improvements for BytesRefHash which includes:

  1. Remove the extra bit-mixer step (here) now that BytesRef uses a stronger hash function (here).
  2. Use larger initial capacity (1 → 32) to avoid frequent resizing.
  3. Use fingerprint to short-circuit equality checks during linear probing caused by collisions.
  4. Re-organize hash table values to minimize the longest probe sequence length.
  5. Refactor code to avoid unnecessary paths, branches, and defensive checks in the hot path.

Related Issues

Meta issue: #8710

JMH Benchmarks

These results indicate the time to perform 1 million add(...) operations with varying number of unique keys. These operations were repeated and interleaved between 20 hash tables in order to make caches less effective. The following implementations are being compared:

  • (A) is the baseline implementation.
  • (B) adds fingerprinting, avoids unnecessary paths, branches, and defensive checks.
  • (C) uses a faster hash function on top of (B).

Peak improvement

Key length A B C
8 bytes -33.32% -37.47%
32 bytes -34.55% -49.00%
128 bytes -21.58% -63.92%
Screenshot 2023-07-17 at 7 26 06 AM Screenshot 2023-07-17 at 7 26 29 AM Screenshot 2023-07-17 at 7 25 31 AM

Whether or not to minimize the longest probe sequence length

The following implementations are being compared:

  • (A) CompactBytesRefHash which performs simple linear probing for inserts.
  • (B) ReorganizingBytesRefHash which re-organizes the hash table values during inserts to minimize the longest PSL.

(A) is simpler and marginally faster for insert-heavy workloads. It also packs 32-bits of fingerprint information which is 65,536 times of (B), but false positives are already rare with (B). On the other hand, (B) provides marginally faster worst-case lookups due to better use of CPU cache lines.

Benchmarks show no appreciable performance difference between (A) and (B) since the latency is dominated by the time to copy the key.

Screenshot 2023-07-20 at 8 57 02 PM

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.translog.RemoteFSTranslogTests.testMetadataFileDeletion

@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #8788 (58d3394) into main (a4024e7) will increase coverage by 0.14%.
The diff coverage is 90.14%.

@@             Coverage Diff              @@
##               main    #8788      +/-   ##
============================================
+ Coverage     71.01%   71.15%   +0.14%     
- Complexity    57420    57497      +77     
============================================
  Files          4778     4779       +1     
  Lines        270922   270995      +73     
  Branches      39585    39589       +4     
============================================
+ Hits         192403   192840     +437     
+ Misses        62338    61938     -400     
- Partials      16181    16217      +36     
Files Changed Coverage Δ
...g/opensearch/common/util/ReorganizingLongHash.java 93.65% <75.00%> (-1.67%) ⬇️
.../java/org/opensearch/common/util/BytesRefHash.java 89.18% <89.70%> (+1.31%) ⬆️
...rc/main/java/org/opensearch/common/hash/T1ha1.java 90.76% <90.76%> (ø)
...n/src/main/java/org/opensearch/common/Numbers.java 89.69% <100.00%> (+1.14%) ⬆️
...ggregations/bucket/terms/BytesKeyedBucketOrds.java 87.50% <100.00%> (ø)
.../aggregations/bucket/terms/SignificanceLookup.java 96.49% <100.00%> (ø)
...ations/bucket/terms/StringRareTermsAggregator.java 92.42% <100.00%> (ø)

... and 464 files with indirect coverage changes

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreIT.testStaleCommitDeletionWithInvokeFlush

@ketanv3
Copy link
Contributor Author

ketanv3 commented Jul 24, 2023

@backslasht @dblock Need your inputs!

  1. Please check the section "Whether or not to minimize the longest probe sequence length".
    I'm suggesting (A) as it's simpler, and marginal lookup improvements in (B) are dwarfed by expensive "equals" check which compares two keys byte-by-byte. This isn't the case with ReorganizingLongHash where equality check simply meant comparing two longs.
  2. I'm also suggesting to completely replace the existing BytesRefHash instead of keeping two implementations (like before). Regardless of our choice between (A) and (B), both are better across the board over the existing implementation.

What do you think?

@backslasht
Copy link
Contributor

backslasht commented Jul 24, 2023

@ketanv3 - Would the benefit of CPU cache lines applicable when large numbers of keys are used in ReorganizingBytesRefHash and also assuming the keys arrive out of order. Considering that is the worst case would it still out perform CompactBytesRefHash?

@ketanv3
Copy link
Contributor Author

ketanv3 commented Jul 24, 2023

@ketanv3 - Would the benefit of CPU cache lines applicable when large numbers of keys are used in ReorganizingLongHash and also assuming the keys arrive out of order. Considering that is the worst case would it still out perform CompactBytesRefHash?

Recency criteria (i.e. correlated adds) isn't considered for BytesRefHash; it made more sense for LongHash where similar timestamps across consecutive hits made that optimization possible. Only the PSL criteria is considered which makes CPU caches more effective. Performance should be similar no matter whether keys arrive in or out of order.

Theoretically, ReorganizingBytesRefHash should perform better than CompactBytesRefHash for lookups at the slight expense of inserts. But another factor to consider is the difference in byte packing:

  1. CompactBytesRefHash uses [32 fingerprint, 32 ordinal] bits.
  2. ReorganizingBytesRefHash uses [1 discard, 15 PSL, 16 fingerprint, 32 ordinal] bits.

Since the number of fingerprint bits is more in (1) (over 65K times more), it's not a 1-1 comparison between the two. With the current byte-packing scheme, performance is balanced between the two. But we can experiment with [1 discard, 7 PSL, 24 fingerprint, 32 ordinal] bits to see if this helps us speed up lookups even more.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Jul 25, 2023

@backslasht Here's some follow up:

I wrote an alternative benchmark that highlights the overhead of "new inserts" and the improvement it brings to "subsequent lookups". I ran it for large table sizes (10K to 100K) interleaved across 50 hash tables.

  • Insert – time to insert N unique keys across all hash tables.
  • Lookup – time to lookup those N unique keys across all tables.
Screenshot 2023-07-25 at 5 12 00 PM

On average, pure inserts were 8.68% slower and pure lookups were 3.47% faster when re-organizing. Inserts were about 30% more expensive than lookups, so a key needs to be repeated at least 4 times for the lookup improvements to compensate for the insertion overhead.

Note: Take these numbers with a grain of salt; I faced a lot of variance while benchmarking these large tables.


In hindsight, this overhead wasn't noticeable with ReorganizingLongHash as the performance was being compared to the baseline implementation. Instead, here our comparison is against CompactBytesRefHash which itself has several enhancements on top of the baseline implementation (better locality, reduced branching, fingerprinting, etc.). Would be interesting to see how an equivalent CompactLongHash stacks up (and possibly replace LongHash).

@backslasht
Copy link
Contributor

Thanks @ketanv3 for the experiments.

I think we can go with CompactBytesRefHash for the below reasons.

Given the number of operations for each key is going to be 2 (1 insert and 1 lookup), the benefit CompactBytesRefHash provides during insertion is marginally better than the benefit ReorganizingBytesRefHash provides during the lookup. Also, implementation wise, CompactBytesRefHash is relatively simpler when compared to ReorganizingBytesRefHash.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Jul 25, 2023

Given the number of operations for each key is going to be 2 (1 insert and 1 lookup) ...

@backslasht
This isn't fully true; let me explain why. Several hits may produce the same key, so during aggregation, the BytesRefHash::add(key) method will simply return the ordinal of the key when it was first added (i.e. a lookup). In other words, the number of buckets should be much less than the number of hits (i.e. lookups >> inserts) for re-organization to be effective.

To summarise:

  1. Inserts are always more expensive.
  2. For small tables (< 10K keys), lookups are identical as the worst-case PSL isn't too high to begin with.
  3. For large tables, lookups perform better and may compensate for the extra overhead of inserts if keys are repeated enough times.

This presents me with an idea: what if we re-organize only when the table grows? It will give us a permutation that sits between simple linear probing and Robin Hood hashing. On a second thought, this wont help us either. Growing the table cuts down the load factor in half, thus, breaking down the clusters and reducing the PSL. A more complicated approach would be to measure "how many times was a key inserted too far away from its home slot", and then use that heuristic to trigger re-organization. But all this to gain only a little. Let's stick with CompactBytesRefHash! :)

@backslasht
Copy link
Contributor

Thanks for the detailed explanation @ketanv3

I agree, it boils down to size of the table and the number of repeated keys where ReorganizingBytesRefHash performs marginally better. Looking at the additional complexity and very marginal benefits ReorganizingBytesRefHash brings, it makes sense to stick to CompactBytesRefHash and revisit ReorganizingBytesRefHash in future when compelling use case arise.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Jul 28, 2023

@dblock Would like to know your thoughts too.

@dblock
Copy link
Member

dblock commented Jul 28, 2023

@dblock Would like to know your thoughts too.

I am a bit out of my depth here - would love to see you and @backslasht agree on what should be a mergable state of this PR, first. If you need someone to break a tie I can give it a try for sure!

@ketanv3 ketanv3 changed the title Hash table improvements for BytesRefHash Performance improvements for BytesRefHash Jul 30, 2023
@ketanv3 ketanv3 force-pushed the performance/bytes-ref-hash branch from 67ee51c to 8870419 Compare July 30, 2023 18:34
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the performance/bytes-ref-hash branch from 8870419 to 6ca2eb1 Compare July 31, 2023 03:43
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 marked this pull request as ready for review July 31, 2023 04:28
@ketanv3
Copy link
Contributor Author

ketanv3 commented Aug 20, 2023

On top of the encoding improvements (#9412), this change would further reduce the latency by a decent amount. These tests in particular have pretty small composite keys (< 20 bytes), but improvements would be larger for larger keys.

Test p50 before p90 before p50 after p90 after
PR #2687 / mix 4023 ms 4185 ms 3857 ms 4024 ms
PR #2687 / numeric 538 ms 591 ms 485 ms 530 ms
OSB / http_logs / multi_term_agg 2241 ms 2282 ms 2102 ms 2147 ms

@ketanv3
Copy link
Contributor Author

ketanv3 commented Aug 22, 2023

Hi @dblock, can you take a look now? Thank you!

@ketanv3
Copy link
Contributor Author

ketanv3 commented Aug 23, 2023

Hi @reta @backslasht!

I was profiling hot methods (for a different optimization) and noticed around 3.93% CPU time being spent in the "reinsert" logic. I had deliberately removed the use of pre-computed hashes as it felt less useful with the use of a faster hash function and fingerprinting. But it seems like adding it back can still make things faster, especially on pathological cases where the number of buckets (i.e. keys in the table) are large.

We can expect another 5% reduction in latency compared to the table I shared above, and around 10% reduction in latency compared to the current baseline.

Test p50 before p90 before p50 after p90 after
PR #2687 / mix 4023 ms 4185 ms 3661 ms 3859 ms
PR #2687 / numeric 538 ms 591 ms 465 ms 502 ms
OSB / http_logs / multi_term_agg 2241 ms 2282 ms 2012 ms 2055 ms

I have made these minor changes. Please review the diff. Thank you!

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@reta
Copy link
Collaborator

reta commented Aug 23, 2023

I have made these minor changes. Please review the diff. Thank you!

Thanks @ketanv3 , it looks reasonable, I also see that the same optimization is present in the current BytesRefHash (over int-based hashes).

@backslasht
Copy link
Contributor

Looks good to me! thanks @ketanv3

@ketanv3
Copy link
Contributor Author

ketanv3 commented Aug 24, 2023

Hi @dblock, did you get a chance to review this?

@ketanv3 ketanv3 requested a review from msfroh as a code owner August 25, 2023 03:47
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ketanv3 ketanv3 force-pushed the performance/bytes-ref-hash branch from 8268950 to 58d3394 Compare August 25, 2023 04:40
@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 58d3394

Incompatible components

Incompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits

@andrross andrross added backport 2.x Backport to 2.x branch v2.10.0 labels Aug 25, 2023
@andrross andrross merged commit 3a8bbe9 into opensearch-project:main Aug 25, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 25, 2023
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
(cherry picked from commit 3a8bbe9)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Aug 25, 2023
* Performance improvements for BytesRefHash



* Replace BytesRefHash and clean up alternative implementations



* Added t1ha1 to replace xxh3 hash function



* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above



* Add link to the reference implementation for t1ha1



* Annotate t1ha1 with @opensearch.internal



* Run spotless



* Add pre-computed hashes to speed up reinserts



* Refactor HashFunctionTestCase



---------


(cherry picked from commit 3a8bbe9)

Signed-off-by: Ketan Verma <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Gaganjuneja pushed a commit to Gaganjuneja/OpenSearch that referenced this pull request Aug 28, 2023
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
Gaganjuneja pushed a commit to Gaganjuneja/OpenSearch that referenced this pull request Aug 28, 2023
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
Signed-off-by: Gagan Juneja <[email protected]>
kkmr pushed a commit to kkmr/OpenSearch that referenced this pull request Aug 28, 2023
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
Signed-off-by: Kiran Reddy <[email protected]>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
Signed-off-by: Kaushal Kumar <[email protected]>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
Signed-off-by: Ivan Brusic <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
* Performance improvements for BytesRefHash

Signed-off-by: Ketan Verma <[email protected]>

* Replace BytesRefHash and clean up alternative implementations

Signed-off-by: Ketan Verma <[email protected]>

* Added t1ha1 to replace xxh3 hash function

Signed-off-by: Ketan Verma <[email protected]>

* Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above

Signed-off-by: Ketan Verma <[email protected]>

* Add link to the reference implementation for t1ha1

Signed-off-by: Ketan Verma <[email protected]>

* Annotate t1ha1 with @opensearch.internal

Signed-off-by: Ketan Verma <[email protected]>

* Run spotless

Signed-off-by: Ketan Verma <[email protected]>

* Add pre-computed hashes to speed up reinserts

Signed-off-by: Ketan Verma <[email protected]>

* Refactor HashFunctionTestCase

Signed-off-by: Ketan Verma <[email protected]>

---------

Signed-off-by: Ketan Verma <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch v2.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants