Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620

alexklibisz · 2023-12-03T05:32:19Z

Related Issue

Changes

Copying the implementations of dotProduct and l2Distance from Lucene's PanamaVectorUtilSupport. Lucene calls it "squaredDistance" instead of "l2Distance", but it's the same thing.

The authors of Lucene have made some really nice optimizations here, which I'm shamelessly copying into Elastiknn. I also considered just using their class and functions directly (things are private, package private, etc), but found it tricky to instantiate, so I just copied over the implementations.

The rough idea in this optimization is: when doing anything with Panama vectors, we have to iterate over the original vector in segments that match the size expected by the processor's SIMD implementation. But we can actually go faster if we unroll the loop by iterating four chunks at a time. Somehow the JVM figures out that we can compute these four chunks in parallel, but does not seem to figure this out if we're going one chunk at a time.

In doing this I went ahead and updated to from JDK 19 to 21 in the .tool-versions file and in the Github workflows. JDK 21 is used by the latest Elasticsearch. I was just behind in this repo.

I updated the JMH micro benchmarks, and it looks like the new implementation is indeed ~4x faster:

[info] Benchmark                                          Mode  Cnt         Score        Error  Units
[info] FloatVectorOpsBenchmark.dotProductDefault         thrpt    5    889558.936 ±    661.452  ops/s
[info] FloatVectorOpsBenchmark.dotProductLucene          thrpt    5  13655547.975 ±  65406.816  ops/s
[info] FloatVectorOpsBenchmark.dotProductPanama          thrpt    5  11208095.266 ± 986906.771  ops/s
[info] FloatVectorOpsBenchmark.dotProductPanamaOriginal  thrpt    5   2474870.573 ± 172489.562  ops/s

[info] Benchmark                                          Mode  Cnt         Score       Error  Units
[info] FloatVectorOpsBenchmark.l2DistanceDefault         thrpt    5    858090.918 ±  3095.643  ops/s
[info] FloatVectorOpsBenchmark.l2DistanceLucene          thrpt    5  10886417.277 ± 55081.371  ops/s
[info] FloatVectorOpsBenchmark.l2DistancePanama          thrpt    5  10590012.970 ± 99390.590  ops/s
[info] FloatVectorOpsBenchmark.l2DistancePanamaOriginal  thrpt    5   2223699.798 ±  7011.267  ops/s

The ann-benchmarks also got faster:

There was some variability in the results. They went up as high as 96% recall at 211 qps in f7d082f:

Testing and Validation

Standard CI and benchmarking (see above)

…otProduct

…bisz/elastiknn into dotproduct-parallel-accumulators

alexklibisz added 4 commits December 2, 2023 21:31

Performance: use four parallel accumulators in PanamaFloatVectorOps.d…

87d44f6

…otProduct

Upgrade to Java 21, implement dotProduct at perf parity with Lucene's

2bcfcf7

Benchmark

7339f4f

l2Distance at parity with Lucene's implementation

2fc7efe

alexklibisz changed the title ~~Performance: use four parallel accumulators in PanamaFloatVectorOps.dotProduct~~ Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance Dec 3, 2023

alexklibisz added 7 commits December 3, 2023 08:53

Benchmark

13ad2b3

Benchmark

f7d082f

Merge branch 'dotproduct-parallel-accumulators' of github.com:alexkli…

3cc9a2e

…bisz/elastiknn into dotproduct-parallel-accumulators

Remove unused test

141b80b

Benchmark

346ba0c

Comments

6bc511c

Benchmark

dea5d84

alexklibisz changed the title ~~Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance~~ Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) Dec 3, 2023

Merge branch 'main' into dotproduct-parallel-accumulators

ad9a4c4

alexklibisz marked this pull request as ready for review December 3, 2023 18:55

alexklibisz merged commit ac3fed8 into main Dec 3, 2023
5 checks passed

alexklibisz deleted the dotproduct-parallel-accumulators branch December 3, 2023 18:55

This was referenced Dec 3, 2023

Build: bump version to 8.11.1.3 #621

Merged

Try Lucene VectorUtil instead/alongside PanamaFloatVectorOps #617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620

Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620

alexklibisz commented Dec 3, 2023 •

edited

Loading

Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620

Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620

Conversation

alexklibisz commented Dec 3, 2023 • edited Loading

Related Issue

Changes

Testing and Validation

alexklibisz commented Dec 3, 2023 •

edited

Loading