Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issue
#611 #617
Changes
Copying the implementations of dotProduct and l2Distance from Lucene's PanamaVectorUtilSupport. Lucene calls it "squaredDistance" instead of "l2Distance", but it's the same thing.
The authors of Lucene have made some really nice optimizations here, which I'm shamelessly copying into Elastiknn. I also considered just using their class and functions directly (things are private, package private, etc), but found it tricky to instantiate, so I just copied over the implementations.
The rough idea in this optimization is: when doing anything with Panama vectors, we have to iterate over the original vector in segments that match the size expected by the processor's SIMD implementation. But we can actually go faster if we unroll the loop by iterating four chunks at a time. Somehow the JVM figures out that we can compute these four chunks in parallel, but does not seem to figure this out if we're going one chunk at a time.
In doing this I went ahead and updated to from JDK 19 to 21 in the
.tool-versions
file and in the Github workflows. JDK 21 is used by the latest Elasticsearch. I was just behind in this repo.I updated the JMH micro benchmarks, and it looks like the new implementation is indeed ~4x faster:
The ann-benchmarks also got faster:
There was some variability in the results. They went up as high as 96% recall at 211 qps in f7d082f:
Testing and Validation
Standard CI and benchmarking (see above)